One of the great things about Apache Spark is that you can experiment with new analyses interactively. In the past, I’ve used the sbt
console to try out new data transformations and models; the console is especially convenient since you can set it up as a custom Scala REPL with your libraries loaded and some test fixtures already created.
However, some of Spark’s coolest new functionality depends on some aspects of Scala reflection that aren’t compatible with how sbt
uses classloaders for tasks, so you’re liable to see MissingRequirementError
exceptions when you try and run code that exercises parts of Spark SQL or the DataFrame API from the sbt
console.1
You can certainly run a regular Scala REPL or the spark-shell
, but doing so sacrifices a lot of the flexibility of running from sbt
: every time your code or dependencies change, you’ll need to package your application and set your classpath, ensuring that all of your application classes and dependencies are available to the REPL application.
Fortunately, there’s an easier way: you can make a small application that runs a Scala REPL set up the way you like and ask sbt
how to set its classpath. First, write up a simple custom Scala REPL, like this one:
object ReplApp {
import scala.tools.nsc.interpreter._
import scala.tools.nsc.Settings
def main(args: Array[String]) {
val repl = new ILoop {
override def loop(): Unit = {
// ConsoleApp is just a simple container for a Spark context
// and configuration
val app = new com.redhat.et.silex.app.ConsoleApp()
.addImports("org.apache.spark.SparkConf")
intp.addImports("org.apache.spark.SparkContext")
intp.addImports("org.apache.spark.SparkContext._")
intp.addImports("org.apache.spark.rdd.RDD")
intp
.bind("app", app)
intp.bind("spark", app.context)
intp.bind("sqlc", app.sqlContext)
intp.addImports("sqlc._")
intp
super.loop()
}
}
val settings = new Settings
.Yreplsync.value = true
settings
.usejavacp.value = true
settings
.process(settings)
repl}
}
The ReplApp
application sets up a Scala REPL with imports for some common Spark classes and bindings to a SparkContext
and SqlContext
. (The ConsoleApp
object is just a simple wrapper for Spark context and configuration; see the Silex project, where my team is collecting and generalizing infrastructure code from Spark applications, for more details – or just change this code to set up a SparkContext
as you see fit.)
In order to run this application, you’ll need to set its classpath, and sbt
gives you a way to do find out exactly what environment it would be using so you can run the application manually.2 First, make sure you have a copy of sbt-extras
either in your repository or somewhere pointed to by SBT
in your environment. Then, create a shell script that looks like this:
#!/bin/sh
# set SBT to the location of a current sbt-extras script,
# or bundle one in your repository
export SBT=${SBT:-./sbt}
export SCALA_VERSION=$(${SBT} "export scalaVersion" | tail -1)
export APP_CP=$(${SBT} -batch -q "export compile:dependencyClasspath" | tail -1)
export JLINE_CP=$(find $HOME/.ivy2 | grep org.scala-lang/jline | grep ${SCALA_VERSION}.jar$ | tail -1)
${SBT} package && java -cp ${APP_CP}:${JLINE_CP} ReplApp
stty sane
You can then run repl.sh
and get a Scala REPL that has all of your app’s dependencies and classes loaded and will let you experiment with structured data manipulation in Spark.
Footnotes
Frustratingly, apps that use these features will work, since the classes Scala reflection depends on will be loaded by the bootstrap classloader, and test cases will work as long as you have
sbt
fork a new JVM to execute them! Unfortunately,sbt
cannot currently fork a new JVM to run a console.↩︎The
run-main
task is the right way to run most applications fromsbt
, but it seems to be somewhat flaky when launching interactive console applications.↩︎