Notes from Strata + Hadoop World 2014

I went to Strata + Hadoop World last week. This event targets a pretty broad audience and is an interesting mix of trade show, data science conference, and software conference. However, I’ve been impressed by the quality and relevance of the technical program both of the years that I’ve gone. The key to finding good talks at this kind of event is to target talks focusing on applications, visualization, and fundamental techniques, rather than ostensibly technical talks.¹ In the rest of this post, I’ll share some of my notes from the talks and tutorials I enjoyed the most.

D3.js

The first event I attended was Sebastian Gutierrez’s D3 tutorial. (Gutierrez’s slides and sample code are available online.) A great deal of this session was focused on introducing JavaScript and manipulating the DOM with D3, which was great for me; I’ve used D3 before but my knowledge of web frontend development practice is spotty and ad-hoc at best.

This was a great tutorial: the pace, scope, content, and presentation were all really excellent. It provided enough context to make sense of D3’s substantial capabilities and left me feeling like I could dive in to using D3 for some interesting projects. Finally, Gutierrez provided lots of links to other resources to learn more and experiment, both things I’d seen, like bl.ocks.org, and things that were new to me, like tributary.io and JSFiddle.

The standalone D3 tutorial on DashingD3JS.com seems like an excellent way to play along at home.

Engineering Pipelines for Learning at Scale

I attended the “Hardcore Data Science” track in the afternoon of the tutorial day; there were several interesting talks but I wanted to call out in particular Ben Recht’s talk, which mentioned BDAS in the title but was really about generic problems in developing large-scale systems that use machine learning.

Recht’s talk began with the well-known but oft-ignored maxim that machine learning is easy once you’ve turned your analysis problem into an optimization problem. I say “oft-ignored” here not because I believe practitioners are getting hung up on the easy part of the problem, but because it seems like raw modeling performance is often the focus of marketing for open-source and commercial projects alike.² The interesting engineering challenges are typically in the processing pipeline from raw data to an optimizable model: acquiring data, normalizing and pruning values, and selecting features all must take place before modeling and classification. Learning problems often have this basic structure; Recht provided examples of these pipeline stages in object recognition and text processing domains.

The main motivating example was Recht’s attempt to replicate the Oregon State Digital Scout project, which analyzed video footage of American football, and answered questions including what plays were likely from given offensive formations. So Recht’s team set out to stich together video footage into a panorama, translate from video coordinates to inferred field coordinates, and track player positions and movements. This preprocessing code, which they had implemented with standard C++ and the OpenCV library, took ten hours to analyze video of a four-second NFL play. Expert programmers could surely have improved the runtime of this code substantially, but it seems like we should be able to support software engineering of machine-learning applications without having teams of experts working on each stage of the learning pipeline.

The talk concluded by presenting some ideas for work that would support software engineering for learning pipelines. These were more speculative but generally focused on using programming language technology to make it easier to develop, compose, and reason about learning pipelines.

Keynotes

Since the talks from the plenary sessions are all fairly brief (and freely available as videos), I’ll simply call out a couple that were especially enjoyable.

John Rauser from Pinterest gave a great talk on using simulation techniques (in particular random permutation) to identify statistical significance in a way that’s more intuitive for many of us than the well-known (but poorly-understood) a priori methods for doing the same.
Bob Mankoff’s talk about the New Yorker Caption Contest was enormously entertaining and had some really interesting insights about how people receive, analyze, and rank humor.

Domain-Specific Languages for Data Transformation

Joe Hellerstein and Sean Kandel of Trifacta gave a talk on domain-specific languages for data transformation in general and on their Wrangle system in particular. (Trifacta is also responsible for the extremely cool Vega project.) The talk led with three rules for successful data wrangling, which I’ll paraphrase here:

your processes, not your data, are most important;
reusable scripts to process data are more valuable than postprocessed and cleaned data; and
“agile” models in which preprocessing is refined iteratively in conjunction with downstream analysis of the processed data are preferable to having preprocessing take place as an isolated phase.³

The first act of the talk also referenced this quotation from Alfred North Whitehead, which set the tone for a survey of DSLs for data transformation and query construction:

By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems…

In the 1990s, several data processing DSLs emerged that were based on second-order logic, proximity joins, and composable transformations.⁴ Hellerstein argued that, since these underlying formalisms weren’t any simpler to understand than the relational model, the DSLs weren’t particularly successful at simplifying data transformation for end-users.

Raman and Hellerstein’s Potter’s Wheel system presented a visual interface to SQL for data cleaning; it could automatically detect bad data and infer structure domains but wasn’t particularly easier to use than SQL: putting a visual, menu-driven interface over a language doesn’t generally lower the cognitive load of using that language.⁵

The Wrangler system (which forms the basis for the Wrangle DSL in Trifacta’s product) attacks the problem from a different angle: data cleaning operations are suggested based on explicit user guidance, semantic information about types and domains, and inferences of user intent based on prior actions. The entire problem of suggesting queries and transformations is thus expressible as an optimization problem over the space of DSL clauses. With the interface to the Wrangle DSL, the user highlights features in individual rows of the data to appear in the transformed data or in a visualization, and the system presents several suggested queries that it had inferred. The Wrangle system also infers regular expressions for highlighted text transformations, but since regular expressions – and especially machine-generated ones – are often painful to read, it translates these to a more readable (but less expressive) representation before presenting them to the user.

Kandel gave a very impressive live demo of Wrangle on a couple of data sets, including techniques for sensibly treating dirty or apparently-nonsensical rows (in this example, the confounding data included negative contribution amounts from election finance data, which represented refunds or revised accounting). This is excellent work at the intersection of databases, programming languages, machine learning, and human-computer interaction – which is a space that I suspect a lot of future contributions to data processing will occupy.

Clustering for Anomaly Detection

Sean Owen’s talk on using Spark and k-means clustering for anomaly detection was absolutely packed. Whether this was more due to the overwhelming popularity of talks with “Spark” in the title or because Sean is widely known as a sharp guy and great presenter, I’m not sure, but it was an excellent talk and was about more than just introducing clustering in Spark. It was really a general discussion of how to go from raw data to a clustering problem in a way that would give you the most useful results – Spark was essential to make the presentation of ETL and clustering code simple enough to fit on a slide, but the techniques were generally applicable.

The running example in this talk involved system log data from a supervised learning competition. Some of the log data was generated by normal activity and some of it was generated by activity associated with various security exploits. The original data were labeled (whether as “normal” or by the name of the associated exploit), since they were intended to evaluate supervised learning techniques. However, Owen pointed out that we could identify anomalies by clustering points and looking at ones that were far away from any cluster center. Since k-means clustering is unsupervised, one of Owen’s first steps was to remove the labels.

With code to generate an RDD of unlabeled records, Owen then walked through the steps he might use if he were using clustering to characterize these data:

The first question involves choosing a suitable k. Since Owen knew that there were 23 different kinds of records (those generated by normal traffic as well as by 22 attack types), it seems likely that there would be at least 23 clusters. A naïve approach to choosing a cluster count (viz., choosing a number that minimizes the distance from each point to its nearest centroid) falls short in a couple of ways. First, since cluster centers are originally picked randomly, the results of finding k clusters may not be deterministic. (This is easy to solve by ensuring that the algorithm runs for a large number of iterations and has a small distance threshold for convergence.) More importantly, though, finding n clusters, where n is the size of the population, would be optimal under this metric, but it would tell us nothing.
The vectors in the raw data describe 42 features but the distance between them is dominated by two values: bytes read and bytes written, which are relatively large integer values (as opposed to many of the features, which are booleans encoded as 0 or 1). Normalizing each feature by its z-score made each feature contribute equally to distance.
Using entropy (or normalized mutual information) to evaluate the clustering can be a better bet since fitness by entropy won’t increase until k = n as fitness by Euclidean distance will. Another option is restoring the labels from the original data and verifying that clusters typically have homogeneous labels.

Summary

Many of the talks I found interesting featured at least some of the following common themes: using programming-language technology, improving programmer productivity, or focusing on the pipeline from raw data to clean feature vectors. I especially appreciated Sean Owen’s talk for showing how to refine clustering over real data and demonstrating substantial improvements by a series of simple changes. As Recht reminds us, the optimization problems are easy to solve; it’s great to see more explicit focus on making the process that turns analysis problems into optimization problems as easy as possible.

Footnotes

Alas, talks with technical-sounding abstracts often wind up having a nontrivial marketing component!↩︎
This is related to the reason why “data lakes” have turned out to be not quite as useful as we might have hoped: a place to store vast amounts of sparse, noisy, and irregular data and an efficient way to train models from these data are necessary but not sufficient conditions for actually answering questions!↩︎
Hellerstein pointed out that the alternative to his recommendation is really the waterfall model (seriously, click that link), which makes it all the more incredible that anyone follows it for data processing. Who would willingly admit to using a practice initially defined as a strawman?↩︎
SchemaSQL and AJAX were provided as examples.↩︎
I was first confronted with the false promise of visual languages in the 1980s. I had saved my meager kid earnings for weeks and gone to HT Electronics in Sunnyvale to purchase the AiRT system for the Amiga. AiRT was a visual programming language that turned out to be neither more expressive nor easier to use than any of the text-based languages I was already familiar with. With the exception of this transcript of a scathing review in Amazing Computing magazine (“…most of [its] potential is not realized in the current implementation of AiRT, and I cannot recommend it for any serious work.”), it is now apparently lost to history.↩︎