Notes for my HTCondor Week talk

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that HTCondor users who aren’t already leading data science initiatives are well-equipped to start doing so. The talk is brief and high-level, so here are a few quick links to learn more if you’re interested:

Contemporary data processing frameworks like Apache Spark and Apache Flink offer superior programmability, flexibility, and performance. Both projects have really excellent documentation and vibrant user communities.
I’ve written regularly about Spark in particular but the best place to start here is probably my ApacheCon EU ’14 talk on Spark performance, which both introduces Spark and shows how to use its fundamental abstractions idiomatically and efficiently.

I also gave a quick overview of some of my team’s recent data science projects; visit these links to learn more:

Diagnosing open-source community health with Spark by William Benton,
Insights into Customer Behavior from Clickstream Data by RJ Nowling (also see the video),
Using a Relative Index of Performance (RIP) to Determine Optimum Configuration Settings Compared to Random Forest Assessment Using Spark by Diane Feddema,
Random Forest Clustering with Apache Spark by Erik Erlandson (see also Erik’s blog post), and
Analyzing endurance-sports activity data with Spark by William Benton.