I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is shared between applications makes sense when analytics is a separate workload. However, analytics is no longer a separate workload – instead, analytics is now an essential part of long-running data-driven applications. This realization motivated my team to switch from a shared Spark cluster to multiple logical clusters that are co-scheduled with the applications that depend on them.
I’m glad for the opportunity to get together with the Spark community and present on some of the cool work my team has done lately. Here are some links you can visit to learn more about our work and other topics related to running Spark on Kubernetes and OpenShift:
- You can download a PDF of my slide deck, but it doesn’t include animations, so you may want to wait for the video (shortly after the event).
- We’re doing all of our work in the radanalytics.io GitHub organization. In particular, check out:
- openshift-spark, our container image for running Spark under OpenShift as a non-root user,
- Oshinko (REST service, web UI), a management console for containerized Spark clusters,
- source-to-image builders for PySpark applications so you can have a seamless developer workflow: commit, push, and see your changes built and deployed to production, and
- scorpion-stare, some prototypes of integrating Spark’s dynamic resource allocation with the Kubernetes API.
- You may also be interested in the Kubernetes Spark example, which enables standalone Spark clusters on vanilla Kubernetes.
- Finally, there has recently been some discussion in the Kubernetes community about developing first-class support in Spark for Kubernetes-managed clusters.