Big Data In Production: Bare Metal to OpenShift

Published

January 29, 2017

Presented at DevConf.cz (Brno, Czechia)

Apache Spark is one of the most exciting open-source data-processing frameworks today. It features a range of useful capabilities and an unusually developer-friendly programming model. However, the ease of getting a simple Spark application running can hide some of the challenges you might face while going from a proof of concept to a real-world application. This talk will distill our experiences as early adopters of Spark in production, present a case study where using Spark effectively provided huge benefits over legacy solutions, explain why we migrated from a dedicated Spark cluster to OpenShift, and provide concrete advice regarding:

how to integrate Spark with external data sources (including databases, in-memory data grids, and message queues),
how best to deploy and manage Spark in the cloud,
the tradeoffs of various archive storage options for Spark,
how to evaluate predictive models and make sense of the analytic components of insightful applications, and
integrating Spark into microservice applications on OpenShift

This talk assumes some familiarity with Apache Spark but will provide context for attendees who are new to Spark. You’ll learn from a seasoned Red Hat engineer with over three years of experience running Spark in production and contributing to the Spark community.

Talk video