Apache Spark for Library Developers

Published

June 5, 2018

Presented at Spark Summit (San Francisco, CA)

As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover:

We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

Note

Erik and I also delivered an extended deep dive session covering similar material at Spark Summit in London in October, 2018; both parts of that session are also linked below.

Talk video Deep dive part 1 Deep dive part 2 Slides Handout