One of the great things about Apache Spark is that you can experiment with new analyses interactively. In the past, I’ve used the
sbt console to try out new data transformations and models; the console is especially convenient since you can set it up as a custom Scala REPL with your libraries loaded and some test fixtures already created.
However, some of Spark’s coolest new functionality depends on some aspects of Scala reflection that aren’t compatible with how
sbt uses classloaders for tasks, so you’re liable to see
MissingRequirementError exceptions when you try and run code that exercises parts of Spark SQL or the DataFrame API from the
You can certainly run a regular Scala REPL or the
spark-shell, but doing so sacrifices a lot of the flexibility of running from
sbt: every time your code or dependencies change, you’ll need to package your application and set your classpath, ensuring that all of your application classes and dependencies are available to the REPL application.
Fortunately, there’s an easier way: you can make a small application that runs a Scala REPL set up the way you like and ask
sbt how to set its classpath. First, write up a simple custom Scala REPL, like this one:
ReplApp application sets up a Scala REPL with imports for some common Spark classes and bindings to a
ConsoleApp object is just a simple wrapper for Spark context and configuration; see the Silex project, where my team is collecting and generalizing infrastructure code from Spark applications, for more details – or just change this code to set up a
SparkContext as you see fit.)
In order to run this application, you’ll need to set its classpath, and
sbt gives you a way to do find out exactly what environment it would be using so you can run the application manually.2 First, make sure you have a copy of
sbt-extras either in your repository or somewhere pointed to by
SBT in your environment. Then, create a shell script that looks like this:
You can then run
repl.sh and get a Scala REPL that has all of your app’s dependencies and classes loaded and will let you experiment with structured data manipulation in Spark.
Frustratingly, apps that use these features will work, since the classes Scala reflection depends on will be loaded by the bootstrap classloader, and test cases will work as long as you have
sbtfork a new JVM to execute them! Unfortunately,
sbtcannot currently fork a new JVM to run a console. ↩
run-maintask is the right way to run most applications from
sbt, but it seems to be somewhat flaky when launching interactive console applications. ↩