Semi-supervised learning is a type of machine learning paradigm that falls between supervised and unsupervised learning.
In supervised learning, an algorithm is trained on a labelled dataset, where each data point is associated with the correct output or target. In unsupervised learning, algorithms work with unlabeled data to discover patterns or structures within the data. Semi-supervised learning allows to leverage the combination of limited labelled data and a larger pool of unlabeled data to improve the performance of machine learning models. Semi-supervised learning, as the name suggests, combines elements of both supervised and unsupervised learning. The key ideas in semi-supervised learning include:
- Leveraging Limited Labels: Semi-supervised learning is useful in situations where obtaining labelled data is expensive or time-consuming. By making the most of the limited labelled data available, it can often achieve better performance than traditional supervised learning.
- Semi-supervised Models: Various machine learning algorithms and models can be adapted for semi-supervised learning, such as self-training, co-training, or techniques that use generative models like self-supervised learning.
- Smoothness Assumption: A fundamental assumption in semi-supervised learning is that points that are close to each other in the input space are likely to have the same class label. This assumption is based on the idea that the decision boundary between different classes should be smooth and continuous.
Semi-supervised learning finds application in a variety of fields and contexts where obtaining labelled data is challenging due to its scarcity or high acquisition costs. The core concept involves harnessing unlabeled data, which is more readily available, in conjunction with a limited set of labelled data for model training. To leverage unlabeled data effectively, a common approach involves the assignment of pseudo annotations to unlabeled data points. These pseudo-labeled data points are then combined with the originally labelled data to iteratively refine the model. Pseudo-labeling is typically implemented as an iterative process, allowing the model to progressively enhance the quality of pseudo annotations.
The procedure typically commences with the training of a machine learning model on a limited set of labelled data. The choice of the specific algorithm used, whether it be a neural network, decision tree, support vector machine, or another, is contingent upon the nature of the problem at hand. Subsequently, employ this trained model to generate predictions on a larger reservoir of unlabeled data. These predictions assume the role of pseudo-labels for the unlabeled instances. The level of trust in these pseudo-labels can be determined by considering the model's confidence in its predictions. One may opt to accept pseudo-labels with a high-confidence threshold, while disregarding those with lower confidence.
Next the approach merges the pseudo-labels with the original labelled data. Therefore, this combined dataset now encompasses both the initial labelled data and the freshly generated pseudo-labeled data. Ultimately, the model is trained on this combined dataset, which includes both the original labelled data and the pseudo-labeled data. This step enables the model to acquire knowledge from the newly generated labels.
The process continues iteratively, using the updated model to make predictions on the unlabeled data, generating fresh pseudo-labels, and continually retraining the model. This iterative process can contribute to enhancing the model's performance over time (See Figure 1).
Figure 1. In the upper part, an initial model is trained using a limited set of labelled data. In the lower part, pseudo labels are generated for the unlabeled dataset based on this model's predictions. The labelled dataset is then combined with the pseudo-labeled dataset to update and fine-tune the initial model. The training procedure alternates between these two stages, iterating through the process as described, and continues until a predefined number of iterations is achieved.