I’ll be speaking later this afternoon at ApacheCon EU. The title of my talk is “Iteratively Improving Spark Application Performance.” The great thing about Apache Spark is that simple prototype applications are very easy to develop, and even a first attempt at realizing a new analysis will usually work well enough so that it’s not frustrating to evaluate it on real data. However, simple prototypes can often exhibit performance problems that aren’t obvious until you know where to look.
In this talk, we’ll introduce Spark’s execution model and discuss how to use Spark effectively. I’ll use a prototype implementation of my bike data analytics application as a running example and will present four general principles to keep in mind when writing efficient Spark applications, complete with detailed code explanations. My slides are available here as a PDF.
If you’re interested in more detail about the bike data analytics application, you can watch a brief video demo or watch a video of my talk about this application from Spark Summit earlier this year. Finally, I’ve published a blog post covering similar principles for improving Spark applications, which will be a useful reference whether or not you’re able to attend the talk.1
If you’re also at ApacheCon today, I hope to see you at 15:50 CET!
Footnotes
The talk will feature visual explanations and other material that are not in that post, but the post has links to full code examples suitable for off-line experimentation.↩︎