I’m giving a talk this afternoon at Spark Summit EU on extending Spark with new machine learning algorithms. Here are some additional resources and links:

  • Our team’s Silex library is where I’ve published my ongoing work to develop a self-organizing map implementation for Spark and to extend it with support for data frames and ML pipelines
  • I gave a talk about using self-organizing maps in Spark last year at Spark Summit
  • If you like the idea of developing new ML techniques on Spark, you’ll also want to attend a session tomorrow in which my friend and teammate Erik Erlandson will be talking about using his parallel t-digest implementation to support feature importance and other applications.
  • Finally, if you’re doing anything where parallelism and scale matter, especially in a cloud-native environment, you should also check out Mike McCune’s talk on Spark monitoring and metrics.

I’m speaking this morning at the OpenShift Commons Gathering about my team’s experience running Apache Spark on Kubernetes and OpenShift. Here are some links to learn more:

I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is shared between applications makes sense when analytics is a separate workload. However, analytics is no longer a separate workload – instead, analytics is now an essential part of long-running data-driven applications. This realization motivated my team to switch from a shared Spark cluster to multiple logical clusters that are co-scheduled with the applications that depend on them.

I’m glad for the opportunity to get together with the Spark community and present on some of the cool work my team has done lately. Here are some links you can visit to learn more about our work and other topics related to running Spark on Kubernetes and OpenShift:

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that HTCondor users who aren’t already leading data science initiatives are well-equipped to start doing so. The talk is brief and high-level, so here are a few quick links to learn more if you’re interested:

I also gave a quick overview of some of my team’s recent data science projects; visit these links to learn more:

As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a virtual handout for that talk. (I’ll also be presenting another talk on scalable log data analysis later this summer. That talk is also inspired by my recent work with logs but will focus on different parts of the problem, so stay tuned if you’re interested in the domain!)

Some general links:

  • You can download a PDF of my slide deck. I recognize that people often want to download slides, although I’d prefer you look at the rest of this post instead since my slides are not intended to stand alone without my presentation.
  • Check out my team’s Silex library, which is intended to extend the standard Spark library with high-quality, reusable components for real-world data science. The most recent release includes the self-organizing map implementation I mentioned in my talk.
  • Watch this short video presentation showing some of the feature engineering and dimensionality-reduction techniques I discussed in the talk.

The following blog posts provide a deeper dive into some of the topics I covered in the talk:

  • When I started using Spark and ElasticSearch, the upstream documentation was pretty sparse (it was especially confusing because it required some unidiomatic configuration steps). So I wrote up my experiences getting things working. This is an older post but may still be helpful.
  • If you’re interested in applying natural-language techniques to log data, you should consider your preprocessing pipeline. Here are the choices I made when I was evaluating word2vec on log messages.
  • Here’s a brief (and not-overly technical) overview of self-organizing maps, including static visual explanations and an animated demo.

If you’ll be at Apache: Big Data next week, you should definitely check out some talks from my teammates in Red Hat’s Emerging Technology group and our colleague Suneel Marthi from the CTO office:

Unfortunately, my talk is at the same time as Suneel’s, so I won’t be able to attend his, but these are all great talks and you should be sure to put as many as possible on your schedule if you’ll be in Vancouver!

Self-organizing maps are a useful technique for identifying structure in high-dimensional data sets. The map itself is a low-dimensional arrangement of cells, where each cell is an object comparable to the objects in the training set. The goal of self-organizing map training is to arrange a grid of cells so that nearby cells will be the best matches for similar objects. Once we’ve built up the map, we can identify clusters of similar objects (based on the cells that they map to) and even detect outliers (based on the distributions of map quality).

Here are a few snapshots of the training process on color data, which I developed as a test for a parallel implementation of self-organizing maps in Apache Spark. For this demo, I used angular similarity in the RGB color space (not Euclidean distance) as a measure of color similarity. This means that, for example, a darker color would be considered similar to a lighter color with a similar hue.

We start with a random map:

Matches made in the first training iteration essentially affect the whole map, producing a blurred, unsaturated, undifferentiated map:

Some structure begins to emerge pretty rapidly, though; after one quarter of our training iterations, we can already see clear clusters of colors:

The map begins to get more and more saturated as similar colors are grouped together. Here’s what it looks like after half of the training iterations:

…and three-quarters of the training iterations:

As training proceeds, it gradually affects smaller and smaller neighborhoods of the map until the very end, when each training match only affects a single cell (and thus the impact of darker colors becomes apparent, since they can cluster together in single cells that are not the best matching unit for any brighter colors):

In a future post, I’ll cover the training algorithm, introduce the code, and provide some tips for implementing similar techniques in Spark. For now, though, here is a demo video that shows an animation of the whole map training process:

Lately, I’ve been experimenting with Spark’s implementation of word2vec. Since most of the natural-language data I have sitting around these days are service and system logs from machines at work, I thought it would be fun to see how well word2vec worked if we trained it on the text of log messages. This is obviously pretty far from an ideal training corpus, but these brief, rich messages seem like they should have some minable content. In the rest of this post, I’ll show some interesting results from the model and also describe some concrete preprocessing steps to get more useful results for extracting words from the odd dialect of natural language that appears in log messages.


word2vec is a family of techniques for encoding words as relatively low-dimensional vectors that capture interesting semantic information. That is, words that are synonyms are likely to have vectors that are similar (by cosine similarity). Another really neat aspect of this encoding is that linear transformations of these vectors can expose semantic information like analogies: for example, given a model trained on news articles, adding the vectors for “Madrid” and “France” and subtracting the vector for “Spain” results in a vector very close to that for “Paris.”

Spark’s implementation of word2vec uses skip-grams, so the training objective is to produce a model that, given a word, predicts the context in which it is likely to appear.


Like the original implementation of word2vec, Spark’s implementation uses a window of ±5 surrounding words (this is not user-configurable) and defaults to discarding all words that appear fewer than 5 times (this threshold is user-configurable). Both of these assumptions seem sane for the sort of training “sentences” that appear in log messages, but they won’t be sufficient.

Spark doesn’t provide a lot of tools for tokenizing and preprocessing natural-language text.1 Simple string splitting is as ubiquitous in trivial language processing examples just as it is in trivial word count examples, but it’s not going to give us the best results. Fortunately, there are some minimal steps we can take to start getting useful tokens out of log messages. We’ll look at these steps and what see what motivates them now.

What is a word?

Let’s first consider what kinds of tokens might be interesting for analyzing the content of log messages. At the very least, we might care about:

  1. dictionary words,
  2. trademarks (which may or may not be dictionary words),
  3. technical jargon terms (which may or may not be dictionary words),
  4. service names (which may or may not be dictionary words),
  5. symbolic constant names (e.g., ENOENT and OPEN_MAX),
  6. pathnames (e.g., /dev/null), and
  7. programming-language identifiers (e.g., OutOfMemoryError and Kernel::exec).

For this application, we’re less interested in the following kinds of tokens, although it is possible to imagine other applications in which they might be important:

  1. hostnames,
  2. IPv4 and IPv6 addresses,
  3. MAC addresses,
  4. dates and times, and
  5. hex hash digests.

Preprocessing steps

If we’re going to convert sequences of lines to sequences of sequences of tokens, we’ll eventually be splitting strings. Before we split, we’ll collapse all runs of whitespace into single spaces so that we get more useful results when we do split. This isn’t strictly necessary – we could elect to split on runs of whitespace instead of single whitespace characters, or we could filter out empty strings from word sequences before training on them. But this makes for cleaner input and it makes the subsequent transformations a little simpler.

Here’s Scala code to collapse runs of whitespace into a single space:

def replace(r: scala.util.matching.Regex, s: String) = { (orig:String) => r.replaceAllIn(orig, s) }
val collapseSpaces = replace(new scala.util.matching.Regex("[\\s]+"), " ")

The next thing we’ll want to do is eliminate all punctuation from the ends of each word. An appropriate definition of “punctuation” will depend on the sorts of tokens we wind up deciding are interesting, but I considered punctuation characters to be anything except:

  1. alphanumeric characters,
  2. dashes, and
  3. underscores.

Whether or not we want to retain intratoken punctuation depends on the application; there are good arguments to be made for retaining colons and periods (MAC addresses, programming-language identifiers in stack traces, hostnames, etc.), slashes (paths), at-signs (email addresses), and other marks as well. I’ll be retaining these marks but stripping all others. After these transformations, we can split on whitespace and get a relatively sensible set of tokens.

Here’s Scala code to strip punctuation from lines:

val rejectedIntratokenPunctuation = new scala.util.matching.Regex("[^A-Za-z0-9-_./:@]")
val leadingPunctuation = new scala.util.matching.Regex("(\\s)[^\\sA-Za-z0-9-_/]+|()^[^\\sA-Za-z0-9-_/]+")
val trailingPunctuation = new scala.util.matching.Regex("[^\\sA-Za-z0-9-_/]+(\\s)|()[^\\sA-Za-z0-9-_/]+$")

val stripPunctuation: String => String = 
  replace(leadingPunctuation, "$1") compose 
  replace(trailingPunctuation, "$1") compose 
  replace(rejectedIntratokenPunctuation, "")

In order to filter out strings of numbers, we’ll reject all tokens that don’t contain at least one letter. (We could be stricter and reject all tokens that don’t contain at least one letter that isn’t a hex digit, but I decided to be permissive in order to avoid rejecting interesting words that only contain letters A-F.)

val oneletter = new scala.util.matching.Regex(".*([A-Za-z]).*")

Here’s what our preprocessing pipeline looks like, assuming an RDD of log messages called messages:

def tokens(s: String, post: String=>String = identity[String]): Seq[String] = 
    .split(" ")
    .map(s => post(stripPunctuation(s)))
    .collect { case token @ oneletter(_) => token } 

val tokenSeqs = messages.map(line => tokens(line))

Now we have a sequence of words for each log message and are ready to train a word2vec model.

import org.apache.spark.mllib.feature.Word2Vec
val w2v = new Word2Vec

val model = w2v.fit(tokenSeqs)

Note that there are a few things we could be doing in our preprocessing pipeline but aren’t, like using a whitelist (for dictionary words or service names), or rejecting stopwords. This approach is pretty basic, but it produces some interesting results in any case.

Results and conclusions

I evaluated the model by using it to find synonyms for (more or less) arbitrary words that appeared in log messages. Recall that word2vec basically models words by the contexts in which they might appear; informally, synonyms are thus words with similar contexts.

  • The top synonyms for nova (the OpenStack compute service) included vm, glance, containers, instances, and images – all of these are related to running OpenStack compute jobs.
  • The top synonyms for volume included update, cinder.scheduler.host_manager, and several UUIDs for actual volumes.
  • The top synonyms for tmpfs included type, dev, uses, initialized, and transition.
  • The top synonyms for sh included /usr/bin/bash, _AUDIT_SESSION, NetworkManager, _SYSTEMD_SESSION, postfixqmgr.
  • The top synonyms for password included publickey, Accepted, opened, IPMI, and the name of an internal project.

These results aren’t earth-shattering – indeed, they won’t even tell you where to get a decent burrito – but they’re interesting, they’re sensible, and they point to the effectiveness of word2vec even given a limited, unidiomatic corpus full of odd word-like tokens. Of course, one can imagine ways to make our preprocessing more robust. Similarly, there are certainly other ways to generate a training corpus for these words: perhaps using the set of all messages for a particular service and severity as a training sentence, using the documentation for the services involved, using format strings present in the source or binaries for the services themselves, or some combination of these.

Semantic modeling of terms in log messages is obviously useful for log analytics: it can be used as part of a pipeline to classify related log messages by topic, in feature engineering for anomaly detection, and to suggest alternate search terms for interactive queries. However, it is a pleasant surprise that we can train a competent word2vec model for understanding log messages from the uncharacteristic utterances that comprise log messages themselves.

  1. Spark does provide a stopword filter for English and there are external libraries to fill in some of its language-processing gaps. In particular, I’ve had good luck with the Porter stemmer implementation from Chalk