Interactively using Spark SQL and DataFrames from sbt projects

One of the great things about Apache Spark is that you can experiment with new analyses interactively. In the past, I’ve used the sbt console to try out new data transformations and models; the console is especially convenient since you can set it up as a custom Scala REPL with your libraries loaded and some test fixtures already created.

However, some of Spark’s coolest new functionality depends on some aspects of Scala reflection that aren’t compatible with how sbt uses classloaders for tasks, so you’re liable to see MissingRequirementError exceptions when you try and run code that exercises parts of Spark SQL or the DataFrame API from the sbt console.¹

You can certainly run a regular Scala REPL or the spark-shell, but doing so sacrifices a lot of the flexibility of running from sbt: every time your code or dependencies change, you’ll need to package your application and set your classpath, ensuring that all of your application classes and dependencies are available to the REPL application.

Fortunately, there’s an easier way: you can make a small application that runs a Scala REPL set up the way you like and ask sbt how to set its classpath. First, write up a simple custom Scala REPL, like this one:

object ReplApp {
  import scala.tools.nsc.interpreter._
  import scala.tools.nsc.Settings
  
  def main(args: Array[String]) {
    val repl = new ILoop {
      override def loop(): Unit = {
        // ConsoleApp is just a simple container for a Spark context
        // and configuration
        val app = new com.redhat.et.silex.app.ConsoleApp()
        intp.addImports("org.apache.spark.SparkConf")
        intp.addImports("org.apache.spark.SparkContext")
        intp.addImports("org.apache.spark.SparkContext._")
        intp.addImports("org.apache.spark.rdd.RDD")
        
        intp.bind("app", app)
        intp.bind("spark", app.context)
        intp.bind("sqlc", app.sqlContext)
        intp.addImports("sqlc._")
        
        super.loop()
      }
    }
    
    val settings = new Settings
    settings.Yreplsync.value = true
    
    settings.usejavacp.value = true
    
    repl.process(settings)
  }
}

The ReplApp application sets up a Scala REPL with imports for some common Spark classes and bindings to a SparkContext and SqlContext. (The ConsoleApp object is just a simple wrapper for Spark context and configuration; see the Silex project, where my team is collecting and generalizing infrastructure code from Spark applications, for more details – or just change this code to set up a SparkContext as you see fit.)

In order to run this application, you’ll need to set its classpath, and sbt gives you a way to do find out exactly what environment it would be using so you can run the application manually.² First, make sure you have a copy of sbt-extras either in your repository or somewhere pointed to by SBT in your environment. Then, create a shell script that looks like this:

#!/bin/sh

# set SBT to the location of a current sbt-extras script,
# or bundle one in your repository
export SBT=${SBT:-./sbt}
export SCALA_VERSION=$(${SBT} "export scalaVersion" | tail -1)
export APP_CP=$(${SBT} -batch -q "export compile:dependencyClasspath" | tail -1)
export JLINE_CP=$(find $HOME/.ivy2 | grep org.scala-lang/jline | grep ${SCALA_VERSION}.jar$ | tail -1)

${SBT} package && java -cp ${APP_CP}:${JLINE_CP} ReplApp
stty sane

You can then run repl.sh and get a Scala REPL that has all of your app’s dependencies and classes loaded and will let you experiment with structured data manipulation in Spark.

Footnotes

Frustratingly, apps that use these features will work, since the classes Scala reflection depends on will be loaded by the bootstrap classloader, and test cases will work as long as you have sbt fork a new JVM to execute them! Unfortunately, sbt cannot currently fork a new JVM to run a console.↩︎
The run-main task is the right way to run most applications from sbt, but it seems to be somewhat flaky when launching interactive console applications.↩︎