Notes from Flink Forward

flink
Published

October 20, 2015

Brandenburg Tor lightshow

I was in Berlin last week for Flink Forward, the inaugural Apache Flink conference. I’m still learning about Flink, and Flink Forward was a great place to learn more. In this post, I’ll share some of what I consider its coolest features and highlight some of the talks I especially enjoyed. Videos of the talks should all be online soon, so you’ll be able to check them out as well.

Background

Apache Flink is a data processing framework for the JVM that is most popular for streaming workloads with high throughput and low latency, although it is also general enough to support batch processing. Flink has a pleasant collection-style API, offers stateful elementwise transformations on streams (think of a fold function), can be configured to support fault-tolerance with exactly-once delivery, and does all of this while achieving extremely high performance. Flink is especially attractive for use in contemporary multitenant environments because it manages its own memory and thus Flink jobs can run well in containers on overprovisioned systems (where CPU cycles may be relatively abundant but memory may be strictly constrained).

Keynotes and lightning talks

Kostas Tzoumas and Stephan Ewan (both of data Artisans) shared a keynote in which they presented the advancements in Flink 0.10 (to be released soon) and shared the roadmap for the next release, which will be Flink 1.0. The most interesting parts of this keynote for me were the philosophical arguments for the generality and importance of stream processing in contemporary event-driven data applications. Many users of batch-processing systems simulate streaming workflows by explicitly encoding windows in the structure of their input data (e.g., by using one physical file or directory to correspond to a day, month, or year worth of records) or by using various workarounds inspired by technical limitations (e.g., the “lambda architecture” or bespoke but narrowly-applicable stream processors). However, mature stream processing frameworks not only enable a wide range of applications that process live events, but they also are general enough to handle batch workloads as a special case (i.e., by processing a stream with only one window).1

Of course, the workarounds that data engineers have had to adopt to handle streaming data in batch systems are only necessary given an absence of mature stream processing frameworks. The streaming space has improved a great deal recently, and this talk gave a clear argument that Flink was mature enough for demanding and complex applications. Flink offers a flexible treatment of time: events can be processed immediately (one at a time), in windows based on when the events arrived at the processor, or in windows based on when the events were actually generated (even if they arrived out of order). Flink supports failure recovery with exactly-once delivery but also offers extremely high throughput and low latency: a basic Flink stream processing application offers two orders of magnitude more throughput than an equivalent Storm application. Flink also provides a batch-oriented API with a collection-style interface and an optimizing query planner.

After the keynote, there were several lightning talks. Lightning talks at many events are self-contained (and often speculative, provocative, or describing promising work in progress). However, these lightning talks were abbreviated versions of talks on the regular program. In essence, they were ads for talks to see later (think of how academic CS conference talks are really ads for papers to read later). This was a cool idea and definitely helped me navigate a two-track program that was full of interesting abstracts.

Declarative Machine Learning with the Samsara DSL

Sebastian Schelter introduced Samsara, a DSL for machine learning and linear algebra. Samsara supports in-memory vectors (both dense and sparse), in-memory matrices, and distributed row matrices, and provides an R-like syntax embedded in Scala for operations. The distributed row matrices are a unique feature of Samsara; they support only a subset of matrix operations (i.e., ones that admit efficient distributed implementations) and go through a process of algebraic optimization (including generating logical and physical plans) to minimize communication during execution. Samsara can target Flink, Spark, and H2O.

Stateful Stream Processing

Data processing frameworks like Flink and Spark support collection-style APIs where distributed collections or streams can be processed with operations like map, filter, and so on. In addition to these, it is useful to support transformations that include state, analogously to the fold function on local collections. Of course, fold by itself is fairly straightforward, but a reliable fold-style operation that can recover in the face of worker failures is more interesting. In this talk, Márton Balassi and Gábor Hermann presented an overview of several different approaches to supporting reliable stream processing with state: the approaches used by Flink (both versions 0.9.1 and 0.10), Spark Streaming, Samza, and Trident. As one might imagine, Spark Streaming and Samza get a lot of mileage out of delegating to underlying models (immutable RDDs in Spark’s case and a reliable unified log in Samza’s). Flink’s approach of using distributed snapshots exhibits good performance and enables exactly-once semantics, but it also seems simpler to use than alternatives. This has become a recurring theme in my investigation of Flink: technical decisions that are advertised as improving performance (latency, throughput, etc.) also, by happy coincidence, admit a more elegant programming model.

Other talks worth checking out

Here are a few talks that I’d like to briefly call out as worth watching:

General notes

Kulturbrauerei

The data Artisans team and the Flink community clearly put a lot of hard work towards making this a really successful conference. The venue (pictured above) was unique and cool, the overall vibe was friendly and technical, and I didn’t see a single talk that I regretted attending. (This is high praise indeed for a technical conference; I may have been lucky, but I suspect it’s more likely that the committee picked a good program.) I especially appreciated the depth of technical detail in the talks by Flink contributors on the second afternoon, covering both design tradeoffs and implementation decisions. I’m hoping to be back for a future iteration.

Footnotes

  1. Indeed, as we saw later in the conference, the difference between “batch” workloads and “streaming” workloads can be as simple as a set of policy decisions by the scheduler.↩︎