Some Things You Learn Running Apache Spark in Production for Three Years
Presented at Enterprise Data World (Atlanta, Georgia)
Apache Spark is one of the most exciting open-source data-processing frameworks today. It features a range of useful capabilities and an unusually developer-friendly programming model. However, the ease of getting a simple Spark application running can hide some of the challenges you might face while going from a proof of concept to a real-world application. This talk will distill our experiences as early adopters of Spark in production, present a case study where using Spark effectively provided huge benefits over legacy solutions, and provide concrete advice regarding:
- How to integrate Spark with external data sources
- How best to deploy and manage Spark in the cloud
- The tradeoffs of various archive storage options for Spark
- Configuring machines for data processing
- How to evaluate predictive models and make sense of the analytic components of insightful applications