Modernize Your Analytics Workloads for Apache Spark 3.0 and Beyond


May 27, 2021

Presented at Data+AI Summit North America  (Virtual)

Apache Spark 3.0 has been out for almost a year, and you’re probably running at least some production workloads against it today. However, many production Spark jobs may have evolved over the better part of a decade, and your code, configuration, and architecture may not be taking full advantage of all that Spark 3 has to offer.

In this talk, we’ll discuss changes you might need to make to legacy applications in order to make the most of Apache Spark 3.0. You’ll learn some common sources of technical debt in mature Apache Spark applications and how to pay them down, when to replace hand-tuned configurations with Adaptive Query Execution, how to ensure that your queries can take advantage of columnar processing, including execution on GPUs, and how your Spark analytics workloads can directly incorporate accelerated ML training.

We’ll provide several concrete examples taken from an end-to-end analytics application addressing customer churn modeling, recent experience modernizing Apache Spark applications, and lessons learned while maintaining a library of Apache Spark extensions across three major versions of Apache Spark.

Talk video