Apache PredictionIO: Easier machine learning with Spark


The Apache Foundation has added a new machine learning project to its roster, , an open-sourced version of a project originally devised by a subsidiary of Salesforce.

What PredictionIO does for machine learning and Spark

Apache PredictionIO is built atop Spark and Hadoop, and serves Spark-powered predictions from data using  for common tasks. Apps send data to PredictionIO’s event server to train a model, then query the engine for predictions based on the model.

Spark, MLlib, HBase, Spray, and and Elasticsearch all come bundled with PredictionIO, and Apache offers supported SDKs for working in Java, PHP, Python, and Ruby. Data can be stored in a variety of back ends: JDBC, Elasticsearch, HBase, HDFS, and their local file systems are all supported out of the box. Back ends are pluggable, so a developer can create a custom back-end connector.

How PredictionIO templates make it easier to serve predictions from Spark

PredictionIO’s most notable advantage is its template system for creating machine learning engines. Templates reduce the heavy lifting needed to set up the system to serve specific kinds of predictions. They describe any third-party dependencies that might be needed for the job, such as the machine-learning app framework.

 enhancements for Spark.

PredictionIO can also to determine the best hyperparameters to use with it. The developer needs to for how to do this, but there’s generally less work involved in doing this than in tuning hyperparameters by hand.

When running as a service, PredictionIO can accept predictions singly or . Batched predictions are automatically parallelized across a Spark cluster, as long as the algorithms used in a batch prediction job are all serializable. (PredictionIO’s default algorithms are.)

. For convenience, , as well as a .