Apache Spark 2.2 gets streaming, R language boosts


With version 2.2 of Apache Spark, a long-awaited feature for the  is now available for production use.

Structured Streaming, as that feature is called, allows Spark to . It’s part of Spark’s long-term push to become, if not  in data science, then at least the best thing for most of them.

Structured Streaming in 2.2 benefits from a number of other changes aside from losing its experimental designation. It can now work as a source or a sink for data coming from or being written to an Apache Kafka source, with lower latency for Kafka connections than previously.

Kafka, itself an Apache Software Foundation, is a distributed messaging bus widely used in streaming applications. Kafka has typically been paired with another stream-processing framework, Apache Storm, but Storm is limited to stream processing only, and Spark presents less complex APIs to the developer.

than running Spark batch jobs intermittently.

The native collection of machine learning libraries in Spark, MLlib, has been outfitted with new algorithms for tasks like performing  on datasets, or running analysis (e.g., which current hit movie will a person in various demographic categories probably like best?). Machine learning is a common use case for Spark. 

Machine learning in Spark also gets a major boost from improved support for the R language. Earlier versions of Spark had wider support for Java and Python than R, but Spark 2.2 adds R support for 10 distributed algorithms. Structured Streaming and the Catalog API (used for accessing query metadata in Spark SQL) can now also be used within Spark.