Some Things You Learn Running Apache Spark in Production for Three Years

Published

April 4, 2017

Apache Spark is one of the most exciting open-source data-processing frameworks today. It features a range of useful capabilities and an unusually developer-friendly programming model. However, the ease of getting a simple Spark application running can hide some of the challenges you might face while going from a proof of concept to a real-world application. This talk will distill our experiences as early adopters of Spark in production, present a case study where using Spark effectively provided huge benefits over legacy solutions, and provide concrete advice regarding:

How to integrate Spark with external data sources
How best to deploy and manage Spark in the cloud
The tradeoffs of various archive storage options for Spark
Configuring machines for data processing
How to evaluate predictive models and make sense of the analytic components of insightful applications