Finding top bicycling efforts with Spark

spark
bicycling
mllib
Published

May 27, 2014

In an earlier post, I showed how I had used Apache Spark to cluster points from GPS traces of bike rides and plot the convex hulls of each cluster, coloring each hull based on my the best power outputs at given durations I’d produced starting from within each cluster. This visualization had some interesting results and pointed me to some roads that I hadn’t thought of for interval workouts, but it was a little coarser than one might like.

To get a better picture of exactly where I should be working out, I set out to plot the actual road segments that I’d set my power bests on, but the naïve approach was completely unsatisfactory. Because I regularly do repeats on a local hill-climb time trial course, I often produce power for three to four minutes on this hill that is good enough to beat many of my other three-minute efforts (which happen during rides in which I am not working on those sorts of efforts). As a consequence, I might have thirty to sixty overlapping three-minute sample windows from the same climb, many of which could represent top-20 or top-50 three minute efforts. When I plotted my top 20 three-minute efforts, the result looked like this:

Twin valley

This visualization is correct but trivial. I already know that I can produce a lot of power for three to four minutes on this hill; that’s one of the reasons that I ride it so often. I changed my analysis to only consider the best output from several temporally-overlapping windows:

Non overlapping

This represented an improvement over just taking the best efforts overall, since it wasn’t merely a plot of the best windows of my many climbs up the Twin Valley hill. It also showed something interesting that was abstracted away in the cluster-based analysis: some strong three-minute efforts began in the same cluster but involved different roads. In particular, the inverted “T” shape at the bottom of the map excerpt shows two different route fragments beginning on the same hill, both of which lend themselves well to high three-minute average power. However, merely excluding windows that temporally overlapped still wound up plotting many (just not quite as many) trips up Twin Valley from separate climbs.

My next approach was to combine the clustering information with a best-effort-path visualization. Instead of considering only the best effort starting in a given cluster (as the hull visualization from the previous post did) or considering only non-temporally-overlapping efforts (but potentially considering multiple spatially-overlapping efforts), I considered only the best efforts starting and ending in a given pair of clusters. This is obviously far more straightforward than identifying which roads were involved or applying similarity metrics to efforts (as Strava does to match efforts to segments), but it manages to eliminate many (but not all) overlapping efforts while being fine-grained enough to not exclude nearby but non-overlapping efforts.

Here are the results from plotting my best one-, three-, and ten-minute efforts, considering only the best effort starting and ending in a given pair of clusters; as usual, you can click and drag or zoom to inspect the map:

On this map, each path is labeled with the average wattage for the effort (click on a path to see its label). This showed me a few surprising things:

GDVC criterium laps

If you’re interested in trying this visualization out on your own data, download the code, put a bunch of TCX files with wattage and GPS coordinates in a directory called activities, and fire up sbt console. Then you can run the application from the Scala REPL. Here are the commands I used to generate the above plot:

Finally, I’ll be talking about my work with Spark for bicycling data at Spark Summit this year; if you’re interested in this stuff and will be there, find me and we can chat!​