I gave a talk at Spark Summit earlier this month about my work using Apache Spark to analyze my bike power meter data, and the conference videos are now online. You can watch my talk here:

There were a lot of great talks at Spark Summit; check out the other videos as well!

I’m looking forward to giving a talk at Spark Summit this week about some of my recent work using Apache Spark to make sense of my bike data (see also previous posts here and here).

I’ll post a link to the video of my talk once it’s online, but in the meantime, I’ve made a short (~3 minute) video demonstrating one of the visualizations I was able to make with Spark; it’s embedded below:

Demo of cycling analytics with Apache Spark from William Benton on Vimeo.

In an earlier post, I showed how I had used Apache Spark to cluster points from GPS traces of bike rides and plot the convex hulls of each cluster, coloring each hull based on my the best power outputs at given durations I’d produced starting from within each cluster. This visualization had some interesting results and pointed me to some roads that I hadn’t thought of for interval workouts, but it was a little coarser than one might like.

To get a better picture of exactly where I should be working out, I set out to plot the actual road segments that I’d set my power bests on, but the naïve approach was completely unsatisfactory. Because I regularly do repeats on a local hill-climb time trial course, I often produce power for three to four minutes on this hill that is good enough to beat many of my other three-minute efforts (which happen during rides in which I am not working on those sorts of efforts). As a consequence, I might have thirty to sixty overlapping three-minute sample windows from the same climb, many of which could represent top-20 or top-50 three minute efforts. When I plotted my top 20 three-minute efforts, the result looked like this:

Twin valley

This visualization is correct but trivial. I already know that I can produce a lot of power for three to four minutes on this hill; that’s one of the reasons that I ride it so often. I changed my analysis to only consider the best output from several temporally-overlapping windows:

Non overlapping

This represented an improvement over just taking the best efforts overall, since it wasn’t merely a plot of the best windows of my many climbs up the Twin Valley hill. It also showed something interesting that was abstracted away in the cluster-based analysis: some strong three-minute efforts began in the same cluster but involved different roads. In particular, the inverted “T” shape at the bottom of the map excerpt shows two different route fragments beginning on the same hill, both of which lend themselves well to high three-minute average power. However, merely excluding windows that temporally overlapped still wound up plotting many (just not quite as many) trips up Twin Valley from separate climbs.

My next approach was to combine the clustering information with a best-effort-path visualization. Instead of considering only the best effort starting in a given cluster (as the hull visualization from the previous post did) or considering only non-temporally-overlapping efforts (but potentially considering multiple spatially-overlapping efforts), I considered only the best efforts starting and ending in a given pair of clusters. This is obviously far more straightforward than identifying which roads were involved or applying similarity metrics to efforts (as Strava does to match efforts to segments), but it manages to eliminate many (but not all) overlapping efforts while being fine-grained enough to not exclude nearby but non-overlapping efforts.

Here are the results from plotting my best one-, three-, and ten-minute efforts, considering only the best effort starting and ending in a given pair of clusters; as usual, you can click and drag or zoom to inspect the map:

On this map, each path is labeled with the average wattage for the effort (click on a path to see its label). This showed me a few surprising things:

  • I only get really strong, consistent shorter efforts in a few places; once we filter out all but the best results from these few places, the “top efforts” for a given duration include some surprisingly quotidian ones. I don’t have a lot of trouble consistently producing strong efforts on the trainer, but I’d rather be outside whenever possible, so I should probably do more structured workouts and fewer “rides.”
  • Many of the one- and three-minute efforts overlap, as do the three- and ten-minute efforts. I should probably also plot best efforts with low power variability in order to plan workouts.
  • One of my ten-minute bests (shown below) was during a local criterium. Since I wasn’t in a break at the time, I was clearly racing inefficiently.

GDVC criterium laps

If you’re interested in trying this visualization out on your own data, download the code, put a bunch of TCX files with wattage and GPS coordinates in a directory called activities, and fire up sbt console. Then you can run the application from the Scala REPL. Here are the commands I used to generate the above plot:

Finally, I’ll be talking about my work with Spark for bicycling data at Spark Summit this year; if you’re interested in this stuff and will be there, find me and we can chat!

One problem that a lot of enthusiastic amateur cyclists encounter is how to make sense of all the workout telemetry data that their smartphone or cycle computer captures. Most riders have some sense of how their cadence, heart rate, speed, road grade, and wattage come into play at any given moment in a ride as it’s happening, but answering questions about the bigger picture about how these fit together over time remains more difficult. I’ve been experimenting with cycling data analytics using Apache Spark for some time now, but I thought I’d share some visualizations that I put together recently to answer a question that’s been nagging me as the weather warms up here in Wisconsin.

In my last post on using Spark to process fitness data, I presented a very simple visualization based on plotting the centers of clustered GPS traces. By plotting darker center markers for denser clusters (and generating a large number of clusters), I was able to picture which roads and intersections I spent the most time riding on in the set of activities that I analyzed. This time, however, I was more interested in a visualization that would tell me what to do rather than a visualization that would tell me what I had already done.


One of the most useful tools for a cyclist who is interested in quantifying his or her performance and training is a direct-force power meter. By measuring the actual force applied at some point on the bicycle drivetrain, these devices can accurately tell riders how many calories they’re burning in a ride, whether or not they’re “burning matches” (that is, using anaerobic metabolism instead of aerobic metabolism) at a given point in a race, how to pace long steady efforts to maximize performance, and precisely how hard to work in interval training in order to improve various kinds of fitness. The last of these capabilities will be our focus in this post.

It’s obvious that there is a difference between ultra-endurance efforts and sprint efforts; no one would try to sprint for an entire 40km time trial (or run a marathon at their 100m pace), and it would be pointless to do sprint-duration efforts at the sort of pace one could maintain for a 12-hour race. More generally, every athlete has a power-duration curve of the best efforts they could produce over time: one’s best 5-second power might be double their best one-minute power and four times their best one-hour power, for example. There are several points where this curve changes for most people, and these correspond to various physiological systems (for example, the shift from anaerobic to aerobic metabolism). By targeting interval workouts to certain power zones, athletes can improve the corresponding physiological systems.


I began by clustering points from GPS traces, but instead of plotting the cluster centers, I plotted the convex hulls of all of the points in each cluster. By giving me polygons containing every point from my data set, this gave me a pretty good picture of where I’d actually been. I then calculated my mean power for three durations — corresponding roughly to anaerobic, VO2max, and just-above-aerobic efforts — at every point in each activity. In other words, I mapped each point in each ride to the mean power I was about to produce in that ride. Then, for each duration, I found the best efforts starting in each cluster and used these data to shade the convex hulls so that hulls where better “best efforts” originated would thus appear more saturated.

Because Spark is expressive and can work interactively, it was straightforward to experiment with various techniques and constant factors to make the most sense of these data. Debugging is straightforward; since I stick to effect-free code as much as possible, I can test my logic without running it under Spark. Furthermore, Spark is fast enough to make trying a bunch of different options completely painless, even on my desktop computer.


I’m including here three plots of cluster hulls, shaded by the best mean power I achieved starting in that cluster for one minute (green), three minutes (blue), and ten minutes (red). With these visualizations (and with increasingly friendly road cycling weather here in Wisconsin), I can decide where to go to do interval workouts based on where I’ve had my best efforts in the past. The data tell me that if I want to work on my one-minute power, I should focus on the Timber Lane climb up from Midtown; if I want to work on my three-minute power, it’s either Barlow Road or the east side of Indian Lake; and if I want to work on my ten-minute power, it’s off to Mounds Park Road for the same climb that made everyone suffer in the national championship road race last year.

(Click and drag or zoom to inspect any map; if one is missing polygons, drag and they should render.)

Future work

I have many ideas for where to take this work next and have some implementation in progress that is producing good results but not (yet) perspicuous visualizations. However, even the more mundane things on my to-do list are pretty interesting: among other things, I’d like to do some performance evaluation and see just how much cycling data we could feasibly process on a standard workstation or small cluster (my code is currently unoptimized); to add a web-based front end allowing more interactive analysis; and to improve my (currently very simple) computational geometry and power-analysis code to make better use of Spark’s abstractions and distributed execution. (The code itself, of course, is available under the Apache license and I welcome your feedback or pull requests.)

I love tools that make it easy to sketch solutions to hard problems interactively (indeed, I spent a lot of time in graduate school developing an interactive tool for designing program analyses — although in general it’s more fun to think about bicycling problems than whether or not two references alias one another), and Spark is one of the most impressive interactive environments I’ve seen for solving big problems. I’m looking forward to prototyping and refining more tools for understanding cycling training and performance in the future.

sbt is in Fedora 20

| No Comments

Longtime Chapeau readers may recall last summer’s lament about the state of the Scala ecosystem in Fedora. We’ve taken a lot of steps since then. After a rough patch for the Fedora Scala package, Scala 2.10 is available and works again on all current Fedora releases and Rawhide. We’ve added more Scala packages to Fedora as well, including scalacheck, sbinary, and test-interface. Today I’m especially pleased to announce that, by the time you read this, sbt 0.13.1 will be available in Fedora 20 testing.

Having sbt available in Fedora means that we can start packaging more of the Scala ecosystem in Fedora. In fact, the sbt package in Fedora is primarily intended for use building Scala packages for Fedora. While you’ll be able to use it for general Scala development, it has several important limitations in order to comply with the Fedora packaging guidelines: most notably, you won’t be able to cross-compile libraries for different Scala versions with it or launch different versions of sbt for different projects. (I use sbt-extras, which I’ve renamed to xsbt, for my general Scala development.)

In the future (i.e. F21 and on), sbt-based Fedora builds will be greatly streamlined by improved Ivy support in an upcoming version of xmvn. For now, we have to manage dependencies somewhat manually using scripts and macros, but it’s absolutely straightforward. To get started building Scala projects for Fedora right now, check out these guidelines I wrote up for the Big Data SIG and let me know if you have any trouble. There are many example spec files using sbt available among my Github repositories.

This was a big effort and thanks are due to several people, including Mark Harrah, who offered a lot of advice on sbt itself and gave prompt and thorough feedback on my patches; Mikołaj Izdebski, who helped a lot with my understanding of Java dependency resolution in Fedora and implemented improved support for Ivy resolution in xmvn; Rob Rati, who took on the task of reviewing the sbt package and did a thoughtful and careful job; and the Fedora Packaging Committee for their quick and helpful response to my request for a bootstrap-binary exception.

I’m looking forward to seeing new Scala projects packaged for Fedora soon!

The GNU project’s Four Freedoms present the essential components of Free software: freedom to use the program for any purpose, freedom to study and change the program’s source code, freedom to redistribute the author’s version of the code, and freedom to distribute your modifications. While not everyone who publishes their source code for others to use cares about freedom as the FSF defines it, these principles also motivate common open-source licenses. This post will discuss a surprisingly common obstacle to software freedom and show how you can avoid it in your own projects.

I'm currently on my way back from the first-ever Spark Summit, where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its infrastructure for Fedora. (My slides are online, but they aren't particularly useful without the talk. I'll post a link to the video when it's available, though.)

If you're interested in learning more about Spark, a great place to start is the guided exercises that the Spark team put together; simply follow their instructions to fire up an EC2 cluster with Spark installed and then work through the exercises. In one of the exercises, you'll have an opportunity to build up one of the classic Spark demos: distributed k-means clustering in about a page of code.

Implementing k-means on resilient distributed datasets is an excellent introduction to key Spark concepts and idioms. With recent releases of Spark, though, machine learning can be simpler still: MLLib includes an implementation of k-means clustering (as well as several other fundamental algorithms). One of my spare-time projects has been experimenting with featurizing bicycling telemetry data (coordinates, altitude, mean maximal power, and heart rate) in order to aid self-coaching, and I've been using MLLib for this project. I don't have any results yet that are interesting from a coaching perspective, but simply using GPS coordinates as feature vectors leads naturally to an expressive visualization:

The above map visualizes about six weeks of road rides in late summer and early fall. It does so by plotting the centers of clusters; darker markers correspond to clusters that contain more trackpoints. I've generated similar maps by hand before, and Strava offers automatic activity heatmaps now, but I like the clustering visualization since it can plot routes (when run with hundreds of clusters) or plot hot areas (when run with dozens of clusters).

Some fairly rough code to generate such a map is available in my cycling data analysis sandbox; you can download and run the app yourself. First, place a bunch of TCX files in a directory (here we're using "activities"). Then build and run the app, specifying the location of your activities directory with the "-d" parameter:

% sbt console
scala> com.freevariable.surlaplaque.GPSClusterApp.main(Array("-dactivities"))

You can influence the output and execution of the app with several environment variables: SLP_MASTER sets the Spark master (defaults to local with 8 threads); SLP_OUTPUT_FILE sets the name of the GeoJSON output file (defaults to slp.json), SLP_CLUSTERS sets the number of clusters and SLP_ITERATIONS sets the number of k-means iterations. Once you have the GeoJSON file, you can publish it by posting it to GitHub or your favorite map hosting service.

To get started with MLLib in your own projects, make sure to add spark-mllib to your build.sbt file:

libraryDependencies += "org.apache.spark" % "spark-core_2.9.3" % "0.8.0-incubating"

libraryDependencies += "org.apache.spark" % "spark-mllib_2.9.3" % "0.8.0-incubating"

From there, it's extremely straightforward to get k-means running; here are the relevant lines from my app (vectors is an RDD of Array[Double]):

val km = new KMeans()

val model = km.run(vectors)

val labeledVectors = vectors.map((arr:Array[Double]) => (model.predict(arr), arr))

In just a few lines of code, this code initializes a k-means object, optimizes a model, and labels each trackpoint with the cluster the model expects it to belong to. Since this functionality is blazing fast and available interactively from the Spark shell, we can easily experiment with different feature extraction policies and see what helps us get some insight from our data.

Apache Thrift in Fedora

| No Comments

You probably already know that Apache Thrift is a framework for developing distributed services and clients to access these in multiple languages. You probably also knew that Thrift is extremely popular among the sorts of cool projects that those of us in the Fedora Big Data SIG find interesting. What you might not have known before you saw the title of this post is that Thrift is currently available in Fedora 19 and later versions for your testing, development, and general yum install-based happiness! Please check it out and let us know how it works with your favorite upstream projects.

(Thanks to Gil Cattaneo for an extremely heroic effort reviewing this package.)

Scala combines a lot of excellent features (functional-style pattern matching, an expressive type system, closures, etc.) with JVM compatibility and a very interesting developer ecosystem (e.g., Akka, Play, Lift, scalacheck, and Spark, just to name a few notable projects). Fedora has included a package for Scala itself for some time, but it doesn't include any of the ecosystem projects. The main obstacle to having Scala ecosystem projects in Fedora is that many projects use sbt, the Simple Build Tool, but there is no native Fedora sbt package. In this post, I'm going to discuss some of the things that make sbt very interesting as a build tool but challenging to package for Fedora, as well as the solutions I've come up with to these problems. First, however, we'll discuss some background.

I've noted in the past that there is a big tension between the Fedora model of dependency management and the models adopted by many language-specific dependency managers. Put simply, the Fedora model is that projects should depend upon system copies of the latest versions of libraries, which have been built with system tools, which were themselves built from pristine sources in a controlled environment. Language-specific models, such as the ones we see with RubyGems and rvm; Python eggs; Java projects using Maven or Ivy; and Erlang releases (especially those managed with rebar), typically allow developers more flexibility to install multiple versions of libraries, fetch dependencies from canonical locations on the web or from source repositories, and rely on different versions of language environments and language runtimes. sbt, which provides both build and dependency management, is no exception in this regard; in fact, it provides as much flexibility as any other language-specific build tool I've encountered.


You don't actually download sbt. Instead, you download a small, self-contained JAR file that will run on any Java 1.6 JRE and includes enough of sbt, Apache Ivy, and the Scala standard library to fetch the whole Scala standard library and compiler, sbt itself, and its dependencies. It can also fetch multiple versions of each of these. This approach means that it's absolutely straightforward to get started using sbt in almost any environment with a JVM, but it conflicts with Fedora policies on bundling, single versions of libraries, and pristine sources.

My solution to this problem is to develop a Fedora-specific sbt launcher that is willing to run against system copies of sbt itself, the Scala compiler and libraries, and other locally-installed JAR files.

Dependency management

sbt uses Apache Ivy to manage dependencies. Fedora has excellent support for building packages that use Maven, but Ivy is still not well-represented in Fedora. Just as with Maven, most of the concerns that Ivy is meant to handle are either addressed by RPM itself (specifying versions of dependencies, finding transitive dependencies, etc.) or do not apply to packages that meet Fedora guidelines (e.g. running different projects against different versions of their dependencies).

It is possible (but clearly suboptimal) to build Ivy packages against RPM-installed dependencies by specifying an Ivy resolver pattern that ignores version numbers and finds JAR artifacts where Fedora packages put them in the filesystem, like this:


However, /usr/share/java isn't set up as a proper Ivy repository; it contains no Ivy module descriptor files (i.e., ivy.xml files). This isn't a problem if we're using Ivy from Ant or standalone, but sbt calls out to Ivy in a way that requires module descriptors and doesn't expose the setting to make them optional.

I have solved this problem in two ways: the first is a simple script that makes an ersatz Ivy repository from locally-installed packages, which can then be used by an sbt build. The second is a small patch to sbt that exposes the Ivy setting to make module descriptor files optional. (I use the former to build sbt binaries that include the latter.)


sbt is used to build itself, as well as some of its dependencies. Fedora has a policy for packaging projects that need to bootstrap in this way, and some other build tools (like rebar) also depend on libraries that are built with that tool. Because of how sbt uses a launcher and because of its dependency management, it is trickier to bootstrap in Fedora than other similar projects (since the initial sbt binary must run locally and must incorporate other Fedora-specific patches, like the module descriptor patch above and patches to work with versions of libraries that ship in Fedora).

Where from here?

Having sbt in Fedora would remove the biggest barrier to getting a lot of the Scala ecosystem in to Fedora, and sbt is a really interesting framework in its own right. However, it's one of the projects where the mismatch between what Fedora requires of upstream projects and the assumptions that contemporary developers work under is particularly pronounced. These difficulties aren't insurmountable, although I found the way that they combine and interweave somewhat daunting when I started investigating sbt.

I'm envisioning a sbt package for Fedora that provides the best of both worlds: an unrestricted sbt environment for developers who want to use Fedora but have the flexibility to target development to other Scala versions (or to use libraries that are available in Ivy repositories but not in Fedora) and a Fedora-specific sbt script that builds software against system packages in a Fedora-friendly way, much like the xmvn and mvn-rpmbuild tools were for Maven. This way, Fedora packagers would have a straightforward way to generate high-quality RPMs from Scala sources and Scala hackers who just want Fedora to meet their needs today could use the system package without restrictions (while having a path to package their projects for Fedora in the future should they choose to).

I welcome feedback and collaboration from Scala hackers who'd like to use Fedora (or other downstream distributions with similar packaging constraints) and from Fedora hackers who'd like to see Fedora as a better place for Scala.

Installing Spark on Fedora 18

| No Comments

The Spark project is an actively-developed open-source engine for data analytics on clusters using Scala, Python, or Java. It offers map, filter, and reduce operations over in-memory collections, data from local files, or data taken from HDFS, but unlike standard map-reduce frameworks, it offers the opportunity to cache intermediate results across the cluster (and can thus offer orders-of-magnitude improvements over standard map-reduce when implementing iterative algorithms). I’ve been using it lately and have been really impressed. However — as with many cool projects in the “big data” space — the chain of dependencies to get a working installation can be daunting.

In this post, we’ll walk through setting up Spark to run on a stock Fedora 18 installation. We’ll also build the Mesos cluster manager so that we can run Spark jobs under Mesos, and we’ll build Hadoop with support for Mesos (so that we’ll have the option to run standard Hadoop MapReduce jobs under Mesos as well). By following these steps, you should be up and running with Spark quickly and painlessly.

Find recent content on the main index or look in the archives to find all content.

About Chapeau

  • I work for Red Hat on distributed computing projects. I hold a PhD in computer sciences from the University of Wisconsin, where I mainly worked on program analysis and concurrency.
  • On this site, I write about topics related to things I'm working on now and things I've worked on in the past: distributed computing and programming languages. I don't speak for my employer, and any opinions on this site are mine alone.

Recent Comments

  • Will Benton: Erik, I absolutely agree; this should be considered early-access stuff. read more
  • Erik Erlandson: It might be better to organize as: import wallaby.tagging read more
  • ferkeltongs: Hi Will, I came across your post while looking for read more

Recent Assets



Powered by Movable Type 4.25
garcinia cambogia diet | find out more at motorhomeinsurance.webeden.co.uk | watch Godzilla 2014