As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a virtual handout for that talk. (I’ll also be presenting another talk on scalable log data analysis later this summer. That talk is also inspired by my recent work with logs but will focus on different parts of the problem, so stay tuned if you’re interested in the domain!)

Some general links:

  • You can download a PDF of my slide deck. I recognize that people often want to download slides, although I’d prefer you look at the rest of this post instead since my slides are not intended to stand alone without my presentation.
  • Check out my team’s Silex library, which is intended to extend the standard Spark library with high-quality, reusable components for real-world data science. The most recent release includes the self-organizing map implementation I mentioned in my talk.
  • Watch this short video presentation showing some of the feature engineering and dimensionality-reduction techniques I discussed in the talk.

The following blog posts provide a deeper dive into some of the topics I covered in the talk:

  • When I started using Spark and ElasticSearch, the upstream documentation was pretty sparse (it was especially confusing because it required some unidiomatic configuration steps). So I wrote up my experiences getting things working. This is an older post but may still be helpful.
  • If you’re interested in applying natural-language techniques to log data, you should consider your preprocessing pipeline. Here are the choices I made when I was evaluating word2vec on log messages.
  • Here’s a brief (and not-overly technical) overview of self-organizing maps, including static visual explanations and an animated demo.

  data science, logs, machine learning, spark, talks • You may reply to this post on Twitter or