Apache Beam, a unified programming model for both batch and streaming data, has graduated from the Apache Incubator to become a .
Aside from becoming another full-fledged widget in the ever-expanding Apache tool belt of big-data processing software, Beam addresses ease of use and dev-friendly abstraction, rather than simply offering raw speed or a wider array of included processing algorithms.
Beam us up!
Beam provides a single programming model for creating batch and stream processing jobs ( is a hybrid of “batch” and “stream”), and it offers a layer of abstraction for dispatching to various engines used to run the jobs. The project originated at Google, where it’s currently a service called GCD (Google Cloud Dataflow). Beam uses the same API as GCD, and it can use GCD as an execution engine, along with Apache Spark, (a stream processing engine with a highly memory-efficient design), and now ( for working closely with Hadoop deployments).
The Beam model involves five components: the pipeline (the pathway for data through the program); the “PCollections,” or data streams themselves; the transforms, for processing data; the sources and sinks, where data is fetched and eventually sent; and the “runners,” or components that allow the whole thing to be executed on an engine.
in early 2016, is that it makes migrations between processing systems less of a headache. Likewise, Apache says Beam “cleanly [separates] the user’s processing logic from details of the underlying engine.”
Separation of concern and ease of migration will be good to have if the ongoing rivalries, and competitions between the various big data processing engines continues. Granted, Apache Spark has emerged as one of the undisputed champs of the field and become a de facto standard choice. But there’s always room for improvement or an entirely new streaming or processing paradigm. Beam is less about offering a specific alternative than about providing developers and data-wranglers with more breadth of choice between them.