Probabilistic Structures for Scalable Computing

Published

January 26, 2018

In this talk you’ll learn about streaming algorithms and approximate data structures to characterize data sources that are too big to keep around or difficult to replay. We’ll start simple, with an algorithm for on-line mean and variance estimates of a stream of samples. Then we’ll look at Bloom filters (for approximate set membership), count-min sketch (for approximate member count in a multiset), and HyperLogLog (for approximate set cardinality). We’ll cover implementing these algorithms, using them for data analysis (and even machine learning), and provide some intuition for why they work at scale. Come with reading knowledge of Python and leave with some cool new options in your scalable data processing toolbox!

Note that the YouTube video for this talk is audio-only; the actual talk was delivered without slides due to projector malfunction.