Stateful stream processing with Apache Flink


Stream processing is rapidly growing in adoption and expanding to more and more use cases as the technology matures. While in the early days stream processors were used to compute approximate aggregates, today’s stream processors are capable of powering accurate analytics applications and evaluating complex business logic on high-throughput streams. One of the most important aspects of stream processing is state handling—i.e., remembering past input and using it to influence the processing of future input.

In this article, the first of a two-part series, we will discuss stateful stream processing as a design pattern for real-time analytics and event-driven applications. As we’ll see, stream processing has evolved to support a huge array of use cases, spanning both analytical and transactional applications. We will also introduce and its features that have been specifically designed to run stateful streaming applications. The second article of our series will be more hands-on and show in detail how to implement stateful stream processing applications with Flink.

What is stateful stream processing?

Virtually all business-relevant data is produced as a stream of events. Sensor measurements, website clicks, interactions with mobile applications, database modifications, application and machine logs, stock trades and financial transactions… all of these operations are characterized by continuously generated data. In fact, there are very few bounded data sets that are produced all at once instead of being recorded from a stream of data.

If we look at how data is accessed and processed, we can identify two classes of problems: 1) use cases in which the data changes faster than the processing logic and 2) use cases in which the code or query changes faster than the data. While in the first scenario we are dealing with a stream processing problem, the latter case indicates a data exploration problem. Examples for data exploration use cases include offline data analysis, data mining, and data science tasks. The clear separation of data streaming and data exploration problems leads to the insight that the majority of all production applications address stream processing problems. However, only a few of them are implemented using stream processing technology today.

. and of most talks are available online.

Introducing Apache Flink

is a distributed data processor that has been specifically designed to run stateful computations over data streams. Its runtime is optimized for processing unbounded data streams as well as bounded data sets of any size. Flink is able to scale computations to thousands of cores and thereby process data streams with high throughput at low latency. Flink applications can be deployed to resource managers including Hadoop YARN, Apache Mesos, and Kubernetes or to stand-alone Flink clusters. Fault-tolerance is a very important aspect of Flink, as it is for any distributed system. Flink can operate in a highly-available mode without a single point of failure and can recover stateful applications from failures with exactly-once state consistency guarantees. Moreover, Flink offers many features to ease the operational aspects of running stream processing applications in production. It integrates nicely with existing logging and metrics infrastructure and provides a REST API to submit and control running applications.

Flink features multiple APIs with different trade-offs for expressiveness and conciseness to implement stream processing applications. The is the basic API and provides familiar primitives found in other data-parallel processing frameworks such as map, flatMap, split, and union. These primitives are extended by common stream processing operations, as for example windowed aggregations, joins, and an operator for asynchronous requests against external data stores.

Flink’s ProcessFunctions are low-level interfaces that expose precise control over state and time. For example, a can be implemented to store each received event in its state and register a timer for a future point in time. Later, when the timer fires, the function can retrieve the event and possibly other events from its state to perform a computation and emit a result. This fine-grained control over state and time enables a broad scope of applications.

Finally, support and Table API offer declarative interfaces to specify . This means that the same query can be executed with the same semantics on a bounded data set and a stream of real-time events. Both ProcessFunctions and SQL queries can be seamlessly integrated with the DataStream API, which gives developers maximum flexibility in choosing the right API.

Data Artisans

Apache Flink APIs. 

In addition to Flink’s core APIs, Flink features domain-specific libraries for graph processing and analytics as well as complex event processing (CEP). provides an API to define and evaluate patterns on event streams. This pattern API can be used to monitor processes or raise alarms in case of unexpected sequences of events.

Streaming applications never run as isolated services. Instead they need to ingest and typically also emit event streams. Apache Flink provides a rich library of connectors for the most commonly used stream and storage systems. Applications can ingest streams from or publish streams to Apache Kafka and Amazon Kinesis. Streams can also be ingested by reading files as they appear in directories or be persisted by writing events to bucketed files. Flink supports a number of different file systems including HDFS, S3, and NFS. Moreover, Flink applications can “sink” data via JDBC (i.e. export it into a relational database) or insert it into Apache Cassandra and Elasticsearch.

Flink is powering business-critical applications in many companies around the world and across many industries, such as ecommerce, telecommunication, finance, gaming, and entertainment. Users are reporting applications that run on thousands of cores, maintain terabytes of state, and process billions of events per day. The open source community that is developing Flink is continuously growing and adding new users.

State management in Apache Flink

Many of Flink’s outstanding features revolve around the handling of application state. Flink always co-locates state and computation on the same machine, such that all state accesses are local. How the state is locally stored and accessed depends on the state back end that is configured for an application. State back ends are pluggable components. Flink provides implementations that store application state as objects on the JVM heap or serialized in RocksDB, an embedded, disk-based storage engine. While the heap-based back end provides in-memory performance, it is limited by the size of the available memory. In contrast, the RocksDB back end, which leverages RocksDB’s efficient LSM-tree-based disk storage, can easily maintain much larger state sizes.