I’m thrilled to be in Brno for DevConf again! This year I’m speaking about structures and techniques for scalable data processing, and this post is a virtual handout for my talk.

  • Here’s a Jupyter notebook containing all of the code I discussed in the talk, complete with a few exercises and some links to other resources.
  • There are lots of implementations of these techniques that you can use in production. Apache Spark, for example, uses some of these structures to support library operations like aggregates in structured queries. Algebird provides scalable and parallel implementations of all of the techniques I discussed and it is especially nice if you’re an algebraically-inclined fan of functional programming (guilty as charged).
  • A very cool data structure that I didn’t have time to talk about is the t-digest, which calculates approximate cumulative distributions (so that you can take a stream of metric observations and ask, e.g., what’s the median latency, or the latency at the 99th percentile?) My friend Erik Erlandson has a really elegant scala implementation of the t-digest and has also given several great talks on the t-digest and some really clever applications for scalable cumulative distribution estimates. Start with this one.
  • radanalytics.io is a community effort to enable scalable data processing and intelligent applications on OpenShift, including tooling to manage compute resources in intelligent applications and a distribution of Apache Spark for OpenShift.

My slides are available here.

  mlscaleparallelsketches • You may reply to this post on Twitter or