I had a lot of fun presenting a tutorial at Strata Data NYC with my teammate Sophie Watson yesterday. In just over three hours, we covered a variety of hash-based data structures for answering interesting queries about large data sets or streams. These structures all have the following properties:
- they’re incremental, meaning that you can update a summary of a stream by adding a single observation to it,
- they’re parallel, meaning that you can combine a summary of A and a summary of B to get a summary of the combination of A and B.
- they’re scalable, meaning that it’s possible to summarize an arbitrary number of observations in a fixed-size structure.
I’ve been interested in these sorts of structures for a while and it was great to have a chance to develop a tutorial covering the magic of hashing and some fun applications like Sophie’s recent work on using MinHash for recommendation engines.
If you’re interested in the tutorial, you can run through our notebooks at your own pace.