You’re not a data scientist. Supposedly according to the tech and business press, machine learning will stop global warming, . Maybe machine learning can find fake news (a classification problem)? In fact, .
But what can machine learning do for you? And how will you find out? There’s a good place to start close to home, if you’re already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which you’re probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.
Here is a brief guide to four of the most essential , what they do, and how you might use them.
Mainly you’ll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isn’t necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldn’t buy the car and the people in the middle balance out the difference. On average the new seat might be slightly more comfortable but if no one over 6 feet tall buys the car anymore, we’ve failed somehow. allows you to do a or a Kolmogorov–Smirnov test to see how well something “fits” or whether the distribution of values is “normal.” This can be used most anywhere we have two series of data. That “fit” might be “did you like it” or did the new algorithm provide “better” results than the old one. You’re just in time to enroll in a on Coursera.
. If you think of someone looking through a set of forms and sorting them into categories, this is classification. You’ve run into this with , which use a list of words spam usually has. You may also be able to or determine which customers are likely to cancel their (people who don’t watch live sports). Essentially “learns” to label things based on labels applied to past data and can apply those labels in the future. In Coursera’s Machine Learning Specialization there is a that started on July 10, but I’m sure you can still get in.
Clustering is often used to sort people into groups. The big difference between “clustering” and “classification” is that we don’t know the labels (or groups) up front for clustering. We do for classification. Customer segmentation is a very common use. There are different flavors of that, such as sorting customers into credit or retention risk groups, or into buying groups (fresh produce or prepared foods), but it is also used for things like fraud detection. Here’s a with a lecture series specifically on and yes, they cover for that next interview, but I find it slightly creepy when half the professor floats over the board (you’ll see what I mean).
Collaborative filtering is a popularity contest. The company I work for uses this to improve search results. I even . If enough people click on the second cat picture it must be better than the first cat picture. In a social or e-commerce setting, if you use the , you can figure out which is the “best” result for most users or even specific sets of people. This can be done on multiple properties for recommender systems. You see this on Google Maps or Yelp when you search for restaurants (you can then filter by service, food, decor, good for kids, romantic, nice view, cost). There is a from the , which started on July 10 (but you can still get in).
This is not all you can do (by far) but these are some of the common uses along with the algorithms to accomplish them. Within each of these broad categories are often several alternative algorithms or derivatives of algorithms. Which to pick? Well, that’s a combination of mathematical background, experimentation, and knowing the data. Remember, just because you get the algorithm to run doesn’t mean the result isn’t nonsense.
course on Coursera is a good place to start — despite the creepy floating half-professor.