Semi-supervised learning explained


In his , Jeff Bezos wrote something interesting about Alexa, Amazon’s voice-driven intelligent assistant:

In the U.S., U.K., and Germany, we’ve improved Alexa’s spoken language understanding by more than 25% over the last 12 months through enhancements in Alexa’s machine learning components and the use of semi-supervised learning techniques. (These semi-supervised learning techniques reduced the amount of labeled data needed to achieve the same accuracy improvement by 40 times!)

Given those results, it might be interesting to try semi-supervised learning on our own classification problems. But what is semi-supervised learning? What are its advantages and disadvantages? How can we use it?

What is semi-supervised learning?

As you might expect from the name, semi-supervised learning is intermediate between and . Supervised learning starts with training data that are tagged with the correct answers (target values). After the learning process, you wind up with a model with a tuned set of weights, which can predict answers for similar data that haven’t already been tagged.

Semi-supervised learning uses both tagged and untagged data to fit a model. In some cases, such as Alexa’s, adding the untagged data actually improves the accuracy of the model. In other cases, the untagged data can make the model worse; different algorithms have vulnerabilities to different data characteristics, as I’ll discuss below.

In general, tagging data costs money and takes time. That isn’t always an issue, since some data sets already have tags. But if you have a lot of data, only some of which is tagged, then semi-supervised learning is a good technique to try.

Semi-supervised learning algorithms

Semi-supervised learning goes back at least 15 years, possibly more; Jerry Zhu of the University of Wisconsin wrote a . Semi-supervised learning has had a resurgence in recent years, not only at Amazon, because it reduces the error rate on important benchmarks.

about some of the semi-supervised learning algorithms, the ones that create proxy labels. These include self-training, multi-view learning, and self-ensembling.

Self-training uses a model’s own predictions on unlabeled data to add to the labeled data set. You essentially set some threshold for the confidence level of a prediction, often 0.5 or higher, above which you believe the prediction and add it to the labeled data set. You keep retraining the model until there are no more predictions that are confident.

also considers a number of other algorithms. These include generative models (such as ones that assume a Gaussian distribution for each class), semi-supervised support vector machines, and graph-based algorithms.

Semi-supervised learning in the cloud

Semi-supervised learning is slowly making its way into mainstream machine learning services. For example, uses Amazon Mechanical Turk for manual labeling and boundary determination of part of an image set and uses neural network training to label the rest of the image set.

Similar semi-supervised learning schemes can be used for other kinds of semi-supervised learning, including , classification, and regression on several services. However, you’ll have to write your own glue code for the semi-supervised algorithm on most of them.

Read more about machine learning: