I’m currently on my way back from the first-ever Spark Summit, where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its infrastructure for Fedora. (My slides are online, but they aren’t particularly useful without the talk. I’ll post a link to the video when it’s available, though.)
If you’re interested in learning more about Spark, a great place to start is the guided exercises that the Spark team put together; simply follow their instructions to fire up an EC2 cluster with Spark installed and then work through the exercises. In one of the exercises, you’ll have an opportunity to build up one of the classic Spark demos: distributed k-means clustering in about a page of code.
Implementing k-means on resilient distributed datasets is an excellent introduction to key Spark concepts and idioms. With recent releases of Spark, though, machine learning can be simpler still: MLLib includes an implementation of k-means clustering (as well as several other fundamental algorithms). One of my spare-time projects has been experimenting with featurizing bicycling telemetry data (coordinates, altitude, mean maximal power, and heart rate) in order to aid self-coaching, and I’ve been using MLLib for this project. I don’t have any results yet that are interesting from a coaching perspective, but simply using GPS coordinates as feature vectors leads naturally to an expressive visualization:
The above map visualizes about six weeks of road rides in late summer and early fall. It does so by plotting the centers of clusters; darker markers correspond to clusters that contain more trackpoints. I’ve generated similar maps by hand before, and Strava offers automatic activity heatmaps now, but I like the clustering visualization since it can plot routes (when run with hundreds of clusters) or plot hot areas (when run with dozens of clusters).
Some fairly rough code to generate such a map is available in my cycling data analysis sandbox; you can download and run the app yourself. First, place a bunch of TCX files in a directory (here we’re using “activities”). Then build and run the app, specifying the location of your activities directory with the “-d” parameter:
% sbt console
scala> com.freevariable.surlaplaque.GPSClusterApp.main(Array("-dactivities"))
You can influence the output and execution of the app with several environment variables: SLP_MASTER
sets the Spark master (defaults to local with 8 threads); SLP_OUTPUT_FILE
sets the name of the GeoJSON output file (defaults to slp.json
), SLP_CLUSTERS
sets the number of clusters and SLP_ITERATIONS
sets the number of k-means iterations. Once you have the GeoJSON file, you can publish it by posting it to GitHub or your favorite map hosting service.
To get started with MLLib in your own projects, make sure to add spark-mllib
to your build.sbt
file:
libraryDependencies += "org.apache.spark" % "spark-core_2.9.3" % "0.8.0-incubating"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.9.3" % "0.8.0-incubating"
From there, it’s extremely straightforward to get k-means running; here are the relevant lines from my app (vectors
is an RDD
of Array[Double]
):
val km = new KMeans()
km.setK(numClusters)
km.setMaxIterations(numIterations)
val model = km.run(vectors)
val labeledVectors = vectors.map((arr:Array[Double]) => (model.predict(arr), arr))
In just a few lines of code, this code initializes a k-means object, optimizes a model, and labels each trackpoint with the cluster the model expects it to belong to. Since this functionality is blazing fast and available interactively from the Spark shell, we can easily experiment with different feature extraction policies and see what helps us get some insight from our data.