I was in Berlin last week for Flink Forward, the inaugural Apache Flink conference. I’m still learning about Flink, and Flink Forward was a great place to learn more. In this post, I’ll share some of what I consider its coolest features and highlight some of the talks I especially enjoyed. Videos of the talks should all be online soon, so you’ll be able to check them out as well.
Background
Apache Flink is a data processing framework for the JVM that is most popular for streaming workloads with high throughput and low latency, although it is also general enough to support batch processing. Flink has a pleasant collection-style API, offers stateful elementwise transformations on streams (think of a fold
function), can be configured to support fault-tolerance with exactly-once delivery, and does all of this while achieving extremely high performance. Flink is especially attractive for use in contemporary multitenant environments because it manages its own memory and thus Flink jobs can run well in containers on overprovisioned systems (where CPU cycles may be relatively abundant but memory may be strictly constrained).
Keynotes and lightning talks
Kostas Tzoumas and Stephan Ewan (both of data Artisans) shared a keynote in which they presented the advancements in Flink 0.10 (to be released soon) and shared the roadmap for the next release, which will be Flink 1.0. The most interesting parts of this keynote for me were the philosophical arguments for the generality and importance of stream processing in contemporary event-driven data applications. Many users of batch-processing systems simulate streaming workflows by explicitly encoding windows in the structure of their input data (e.g., by using one physical file or directory to correspond to a day, month, or year worth of records) or by using various workarounds inspired by technical limitations (e.g., the “lambda architecture” or bespoke but narrowly-applicable stream processors). However, mature stream processing frameworks not only enable a wide range of applications that process live events, but they also are general enough to handle batch workloads as a special case (i.e., by processing a stream with only one window).1
Of course, the workarounds that data engineers have had to adopt to handle streaming data in batch systems are only necessary given an absence of mature stream processing frameworks. The streaming space has improved a great deal recently, and this talk gave a clear argument that Flink was mature enough for demanding and complex applications. Flink offers a flexible treatment of time: events can be processed immediately (one at a time), in windows based on when the events arrived at the processor, or in windows based on when the events were actually generated (even if they arrived out of order). Flink supports failure recovery with exactly-once delivery but also offers extremely high throughput and low latency: a basic Flink stream processing application offers two orders of magnitude more throughput than an equivalent Storm application. Flink also provides a batch-oriented API with a collection-style interface and an optimizing query planner.
After the keynote, there were several lightning talks. Lightning talks at many events are self-contained (and often speculative, provocative, or describing promising work in progress). However, these lightning talks were abbreviated versions of talks on the regular program. In essence, they were ads for talks to see later (think of how academic CS conference talks are really ads for papers to read later). This was a cool idea and definitely helped me navigate a two-track program that was full of interesting abstracts.
Everyday Flink
Michael Häusler of ResearchGate gave a talk in which he talked about the process of evaluating new data processing frameworks, focusing in particular on determining whether a framework makes simple tasks simple. (Another step, following Alan Kay’s famous formulation, is determining whether or not a framework makes complex tasks possible.) The “simple task” that Häusler set out to solve was finding the top 5 coauthors for every author in a database of publications; he implemented this task in Hive (running on Tez), Hadoop MapReduce, and Flink. Careful readers will note that this is not really a fair fight: SQL and HiveQL do not admit straightforward implementations of top-k queries and MapReduce applications are not known for elegant and terse codebases; indeed, Häusler acknowledged as much. However, it was still impressive to see how little code was necessary to solve this problem with Flink, especially when contrasted with the boilerplate of MapReduce or all of the machinery to implement a user-defined aggregate function to support top-k in Hive. The Flink solution was also twice as fast as the custom MapReduce implementation, which was in turn faster than Hive on Tez.
Declarative Machine Learning with the Samsara DSL
Sebastian Schelter introduced Samsara, a DSL for machine learning and linear algebra. Samsara supports in-memory vectors (both dense and sparse), in-memory matrices, and distributed row matrices, and provides an R-like syntax embedded in Scala for operations. The distributed row matrices are a unique feature of Samsara; they support only a subset of matrix operations (i.e., ones that admit efficient distributed implementations) and go through a process of algebraic optimization (including generating logical and physical plans) to minimize communication during execution. Samsara can target Flink, Spark, and H2O.
Streaming and parallel decision trees in Flink
Training decision trees in batch frameworks requires a view of the entire learning set (and sufficient training data to generate a useful tree). In streaming applications, each event is seen only once, the classifier must be available immediately (even if there is little data to train on) and the classifier should take feedback into account in real time. In this talk, Anwar Rizal of Amadeus presented a technique for training decision trees on streaming data by building and aggregating approximate histograms for incoming features and labels.
Juggling with bits and bytes – how Apache Flink operates on binary data
Applications using the Java heap often exhibit appalling memory efficiency; the heap footprint of Java library data structures can be 75% overhead or more. Since data processing applications frequently create, manipulate, and serialize many objects – some of which may be quite short-lived – there are potentially significant performance pitfalls to using the JVM directly for memory allocation. In this talk, Fabian Hueske of data Artisans presented Flink’s approach: combining a custom memory-management and serialization stack with algorithms that operate directly on compressed data. Flink jobs are thus more memory-efficient than programs that use the Java heap directly, exhibit predictable space usage, and handle running out of memory gracefully by spilling intermediate results to disk. In addition, Flink’s use of database-style algorithms to sort, filter, and join compressed data reduces computation and communication costs.
Stateful Stream Processing
Data processing frameworks like Flink and Spark support collection-style APIs where distributed collections or streams can be processed with operations like map
, filter
, and so on. In addition to these, it is useful to support transformations that include state, analogously to the fold
function on local collections. Of course, fold
by itself is fairly straightforward, but a reliable fold
-style operation that can recover in the face of worker failures is more interesting. In this talk, Márton Balassi and Gábor Hermann presented an overview of several different approaches to supporting reliable stream processing with state: the approaches used by Flink (both versions 0.9.1 and 0.10), Spark Streaming, Samza, and Trident. As one might imagine, Spark Streaming and Samza get a lot of mileage out of delegating to underlying models (immutable RDDs in Spark’s case and a reliable unified log in Samza’s). Flink’s approach of using distributed snapshots exhibits good performance and enables exactly-once semantics, but it also seems simpler to use than alternatives. This has become a recurring theme in my investigation of Flink: technical decisions that are advertised as improving performance (latency, throughput, etc.) also, by happy coincidence, admit a more elegant programming model.
Fault-tolerance and job recovery in Apache Flink
This talk was an apt chaser for the Stateful Stream Processing talk. Till Rohrmann presented Flink’s approaches to checkpointing and recovery, showing how Flink can be configured to support at-most-once delivery (the weakest guarantee), at-least-once delivery, or exactly-once delivery (the strongest guarantee). The basic approach Flink uses for checkpointing operator state is the Chandy-Lamport snapshot algorithm, which enables consistent distributed snapshots in a way that is transparent to the application programmer. This approach also enables configurable tradeoffs between throughput and snapshot interval, but it’s far faster (and nicer to use) than Storm’s approach in any case. Recovering operator state is only part of the fault-tolerance picture, though; Till’s talk also introduced Flink’s approach for supporting a highly-available Job Manager.
Other talks worth checking out
Here are a few talks that I’d like to briefly call out as worth watching:
- “Automatic detection of web trackers”, presented by Vasia Kalavri, was a cool application of graph processing in Flink.
- “Applying Kappa architecture in the telecom industry”, presented by Ignacio Mulas Viela, showed how to put a realistic streaming topology into production.
- In “A tale of squirrels and storms”, Matthias Sax introduced the Storm compatibility layer for Flink, enabling users to run Storm topologies on Flink with minimal code changes.
- Aljoscha Krettek’s talk covered the different approaches Flink supports for defining windows over streams.
- My colleague Suneel Marthi presented on a Flink port of the BigPetStore big data application blueprints.
General notes
The data Artisans team and the Flink community clearly put a lot of hard work towards making this a really successful conference. The venue (pictured above) was unique and cool, the overall vibe was friendly and technical, and I didn’t see a single talk that I regretted attending. (This is high praise indeed for a technical conference; I may have been lucky, but I suspect it’s more likely that the committee picked a good program.) I especially appreciated the depth of technical detail in the talks by Flink contributors on the second afternoon, covering both design tradeoffs and implementation decisions. I’m hoping to be back for a future iteration.
Footnotes
Indeed, as we saw later in the conference, the difference between “batch” workloads and “streaming” workloads can be as simple as a set of policy decisions by the scheduler.↩︎