Improving Spark Application Performance
Presented at ApacheCon EU (Budapest, Hungary)
Apache Spark presents an elegant and powerful set of high-level abstractions for developing distributed data-processing applications. Analysts who use Spark can rapidly prototype applications and experiment with new techniques at scale. However, to make the most of Spark, developers need to understand both the abstractions and how Spark will schedule and execute their code.
This talk will show you how to improve Spark application performance by working with, not against, Spark’s operational model. We’ll start with a real prototype Spark application and apply several simple, generally applicable transformations to make it more efficient and scalable. For each transformation, we’ll look both at why it works, considering the relevant details of Spark’s internals, and how well it works, considering its impact on overall application performance. You’ll leave this talk with an improved understanding of how Spark runs your code and some additional tools to make your big data apps even more efficient.