I’m speaking this morning at the OpenShift Commons Gathering about my team’s experience running Apache Spark on Kubernetes and OpenShift. Here are some links to learn more:

I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is shared between applications makes sense when analytics is a separate workload. However, analytics is no longer a separate workload — instead, analytics is now an essential part of long-running data-driven applications. This realization motivated my team to switch from a shared Spark cluster to multiple logical clusters that are co-scheduled with the applications that depend on them.

I’m glad for the opportunity to get together with the Spark community and present on some of the cool work my team has done lately. Here are some links you can visit to learn more about our work and other topics related to running Spark on Kubernetes and OpenShift:

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that HTCondor users who aren’t already leading data science initiatives are well-equipped to start doing so. The talk is brief and high-level, so here are a few quick links to learn more if you’re interested:

I also gave a quick overview of some of my team’s recent data science projects; visit these links to learn more:

As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a virtual handout for that talk. (I’ll also be presenting another talk on scalable log data analysis later this summer. That talk is also inspired by my recent work with logs but will focus on different parts of the problem, so stay tuned if you’re interested in the domain!)

Some general links:

  • You can download a PDF of my slide deck. I recognize that people often want to download slides, although I’d prefer you look at the rest of this post instead since my slides are not intended to stand alone without my presentation.
  • Check out my team’s Silex library, which is intended to extend the standard Spark library with high-quality, reusable components for real-world data science. The most recent release includes the self-organizing map implementation I mentioned in my talk.
  • Watch this short video presentation showing some of the feature engineering and dimensionality-reduction techniques I discussed in the talk.

The following blog posts provide a deeper dive into some of the topics I covered in the talk:

  • When I started using Spark and ElasticSearch, the upstream documentation was pretty sparse (it was especially confusing because it required some unidiomatic configuration steps). So I wrote up my experiences getting things working. This is an older post but may still be helpful.
  • If you’re interested in applying natural-language techniques to log data, you should consider your preprocessing pipeline. Here are the choices I made when I was evaluating word2vec on log messages.
  • Here’s a brief (and not-overly technical) overview of self-organizing maps, including static visual explanations and an animated demo.

If you’ll be at Apache: Big Data next week, you should definitely check out some talks from my teammates in Red Hat’s Emerging Technology group and our colleague Suneel Marthi from the CTO office:

Unfortunately, my talk is at the same time as Suneel’s, so I won’t be able to attend his, but these are all great talks and you should be sure to put as many as possible on your schedule if you’ll be in Vancouver!

Self-organizing maps are a useful technique for identifying structure in high-dimensional data sets. The map itself is a low-dimensional arrangement of cells, where each cell is an object comparable to the objects in the training set. The goal of self-organizing map training is to arrange a grid of cells so that nearby cells will be the best matches for similar objects. Once we’ve built up the map, we can identify clusters of similar objects (based on the cells that they map to) and even detect outliers (based on the distributions of map quality).

Here are a few snapshots of the training process on color data, which I developed as a test for a parallel implementation of self-organizing maps in Apache Spark. For this demo, I used angular similarity in the RGB color space (not Euclidean distance) as a measure of color similarity. This means that, for example, a darker color would be considered similar to a lighter color with a similar hue.

We start with a random map:

Matches made in the first training iteration essentially affect the whole map, producing a blurred, unsaturated, undifferentiated map:

Some structure begins to emerge pretty rapidly, though; after one quarter of our training iterations, we can already see clear clusters of colors:

The map begins to get more and more saturated as similar colors are grouped together. Here’s what it looks like after half of the training iterations:

…and three-quarters of the training iterations:

As training proceeds, it gradually affects smaller and smaller neighborhoods of the map until the very end, when each training match only affects a single cell (and thus the impact of darker colors becomes apparent, since they can cluster together in single cells that are not the best matching unit for any brighter colors):

In a future post, I’ll cover the training algorithm, introduce the code, and provide some tips for implementing similar techniques in Spark. For now, though, here is a demo video that shows an animation of the whole map training process:

Lately, I’ve been experimenting with Spark’s implementation of word2vec. Since most of the natural-language data I have sitting around these days are service and system logs from machines at work, I thought it would be fun to see how well word2vec worked if we trained it on the text of log messages. This is obviously pretty far from an ideal training corpus, but these brief, rich messages seem like they should have some minable content. In the rest of this post, I’ll show some interesting results from the model and also describe some concrete preprocessing steps to get more useful results for extracting words from the odd dialect of natural language that appears in log messages.

Background

word2vec is a family of techniques for encoding words as relatively low-dimensional vectors that capture interesting semantic information. That is, words that are synonyms are likely to have vectors that are similar (by cosine similarity). Another really neat aspect of this encoding is that linear transformations of these vectors can expose semantic information like analogies: for example, given a model trained on news articles, adding the vectors for “Madrid” and “France” and subtracting the vector for “Spain” results in a vector very close to that for “Paris.”

Spark’s implementation of word2vec uses skip-grams, so the training objective is to produce a model that, given a word, predicts the context in which it is likely to appear.

Preliminaries

Like the original implementation of word2vec, Spark’s implementation uses a window of ±5 surrounding words (this is not user-configurable) and defaults to discarding all words that appear fewer than 5 times (this threshold is user-configurable). Both of these assumptions seem sane for the sort of training “sentences” that appear in log messages, but they won’t be sufficient.

Spark doesn’t provide a lot of tools for tokenizing and preprocessing natural-language text.1 Simple string splitting is as ubiquitous in trivial language processing examples just as it is in trivial word count examples, but it’s not going to give us the best results. Fortunately, there are some minimal steps we can take to start getting useful tokens out of log messages. We’ll look at these steps and what see what motivates them now.

What is a word?

Let’s first consider what kinds of tokens might be interesting for analyzing the content of log messages. At the very least, we might care about:

  1. dictionary words,
  2. trademarks (which may or may not be dictionary words),
  3. technical jargon terms (which may or may not be dictionary words),
  4. service names (which may or may not be dictionary words),
  5. symbolic constant names (e.g., ENOENT and OPEN_MAX),
  6. pathnames (e.g., /dev/null), and
  7. programming-language identifiers (e.g., OutOfMemoryError and Kernel::exec).

For this application, we’re less interested in the following kinds of tokens, although it is possible to imagine other applications in which they might be important:

  1. hostnames,
  2. IPv4 and IPv6 addresses,
  3. MAC addresses,
  4. dates and times, and
  5. hex hash digests.

Preprocessing steps

If we’re going to convert sequences of lines to sequences of sequences of tokens, we’ll eventually be splitting strings. Before we split, we’ll collapse all runs of whitespace into single spaces so that we get more useful results when we do split. This isn’t strictly necessary — we could elect to split on runs of whitespace instead of single whitespace characters, or we could filter out empty strings from word sequences before training on them. But this makes for cleaner input and it makes the subsequent transformations a little simpler.

Here’s Scala code to collapse runs of whitespace into a single space:

1
2
def replace(r: scala.util.matching.Regex, s: String) = { (orig:String) => r.replaceAllIn(orig, s) }
val collapseSpaces = replace(new scala.util.matching.Regex("[\\s]+"), " ")

The next thing we’ll want to do is eliminate all punctuation from the ends of each word. An appropriate definition of “punctuation” will depend on the sorts of tokens we wind up deciding are interesting, but I considered punctuation characters to be anything except:

  1. alphanumeric characters,
  2. dashes, and
  3. underscores.

Whether or not we want to retain intratoken punctuation depends on the application; there are good arguments to be made for retaining colons and periods (MAC addresses, programming-language identifiers in stack traces, hostnames, etc.), slashes (paths), at-signs (email addresses), and other marks as well. I’ll be retaining these marks but stripping all others. After these transformations, we can split on whitespace and get a relatively sensible set of tokens.

Here’s Scala code to strip punctuation from lines:

1
2
3
4
5
6
7
8
val rejectedIntratokenPunctuation = new scala.util.matching.Regex("[^A-Za-z0-9-_./:@]")
val leadingPunctuation = new scala.util.matching.Regex("(\\s)[^\\sA-Za-z0-9-_/]+|()^[^\\sA-Za-z0-9-_/]+")
val trailingPunctuation = new scala.util.matching.Regex("[^\\sA-Za-z0-9-_/]+(\\s)|()[^\\sA-Za-z0-9-_/]+$")

val stripPunctuation: String => String =
  replace(leadingPunctuation, "$1") compose
  replace(trailingPunctuation, "$1") compose
  replace(rejectedIntratokenPunctuation, "")

In order to filter out strings of numbers, we’ll reject all tokens that don’t contain at least one letter. (We could be stricter and reject all tokens that don’t contain at least one letter that isn’t a hex digit, but I decided to be permissive in order to avoid rejecting interesting words that only contain letters A-F.)

1
val oneletter = new scala.util.matching.Regex(".*([A-Za-z]).*")

Here’s what our preprocessing pipeline looks like, assuming an RDD of log messages called messages:

1
2
3
4
5
6
7
def tokens(s: String, post: String=>String = identity[String]): Seq[String] =
  collapseWhitespace(s)
    .split(" ")
    .map(s => post(stripPunctuation(s)))
    .collect { case token @ oneletter(_) => token }

val tokenSeqs = messages.map(line => tokens(line))

Now we have a sequence of words for each log message and are ready to train a word2vec model.

1
2
3
4
import org.apache.spark.mllib.feature.Word2Vec
val w2v = new Word2Vec

val model = w2v.fit(tokenSeqs)

Note that there are a few things we could be doing in our preprocessing pipeline but aren’t, like using a whitelist (for dictionary words or service names), or rejecting stopwords. This approach is pretty basic, but it produces some interesting results in any case.

Results and conclusions

I evaluated the model by using it to find synonyms for (more or less) arbitrary words that appeared in log messages. Recall that word2vec basically models words by the contexts in which they might appear; informally, synonyms are thus words with similar contexts.

  • The top synonyms for nova (the OpenStack compute service) included vm, glance, containers, instances, and images — all of these are related to running OpenStack compute jobs.
  • The top synonyms for volume included update, cinder.scheduler.host_manager, and several UUIDs for actual volumes.
  • The top synonyms for tmpfs included type, dev, uses, initialized, and transition.
  • The top synonyms for sh included /usr/bin/bash, _AUDIT_SESSION, NetworkManager, _SYSTEMD_SESSION, postfixqmgr.
  • The top synonyms for password included publickey, Accepted, opened, IPMI, and the name of an internal project.

These results aren’t earth-shattering — indeed, they won’t even tell you where to get a decent burrito — but they’re interesting, they’re sensible, and they point to the effectiveness of word2vec even given a limited, unidiomatic corpus full of odd word-like tokens. Of course, one can imagine ways to make our preprocessing more robust. Similarly, there are certainly other ways to generate a training corpus for these words: perhaps using the set of all messages for a particular service and severity as a training sentence, using the documentation for the services involved, using format strings present in the source or binaries for the services themselves, or some combination of these.

Semantic modeling of terms in log messages is obviously useful for log analytics: it can be used as part of a pipeline to classify related log messages by topic, in feature engineering for anomaly detection, and to suggest alternate search terms for interactive queries. However, it is a pleasant surprise that we can train a competent word2vec model for understanding log messages from the uncharacteristic utterances that comprise log messages themselves.


  1. Spark does provide a stopword filter for English and there are external libraries to fill in some of its language-processing gaps. In particular, I’ve had good luck with the Porter stemmer implementation from Chalk.

Consider the following hypothetical conference session abstract:

Much like major oral surgery, writing talk abstracts is universally acknowledged as difficult and painful. This has never been more true than it is now, in our age of ubiquitous containerization technology. Today’s aggressively overprovisioned, multi-track conferences provide high-throughput information transfer in minimal venue space, but do so at a cost: namely, they impose stingy abstract word limits. The increasing prevalence of novel “lightning talk” tracks presents new challenges for aspiring presenters. Indeed, the time it takes to read a lightning talk abstract may be a substantial fraction of the time it takes to deliver the talk! The confluence of these factors, inter alia, presents an increasingly hostile environment for conference talk submissions in late 2015. Your talk proposals must adapt to this changing landscape or face rejection. Is there a solution?

Hopefully, you recognize some key elements of subpar abstracts that you’ve seen, reviewed, or — maybe, alas — even submitted in this example.

To identify what’s fundamentally wrong with it, we should first consider what the primary rhetorical aims for an abstract are. In particular, an abstract needs to

  1. provide context so that a general audience can understand that the problem the talk addresses is interesting,
  2. summarize the content of a talk so that audiences and reviewers know what to expect, and
  3. motivate conference attendees to put the talk on their schedule (and, more immediately, motivate the program committee to accept the talk).

The abstract above does none of these things, for both stylistic and structural reasons.

The example abstract’s prose is generally clunky, but the main stylistic problem is its overuse of jargon and enthymemes. If you don’t already spend time in the same neighborhoods of the practice as the author, you probably don’t understand all of these terms to mean the same things that the author does or agree with his or her sense of what is “universally acknowledged.” It is easy to fall in to using jargon when you’re deep in a particular problem domain: after all, most of the people you interact with use these words and you all seem to understand each other. However, jargon terms are essentially content-free: they convey nothing new to specialists and are completely opaque to novices. By propping up your writing on these empty terms instead of explaining yourself, you are excluding the cohort of your audience who doesn’t already understand your problem and shamelessly pandering to the cohort that does.1

The main structural problem with the example abstract is that it doesn’t actually make an argument for why the talk is interesting or worth attending; instead, it focuses on emphasizing the problems faced by abstract writers and ends with a cliffhanger. (The cliffhanger strategy not only adds no additional content, it is also especially risky.) A surprising number of abstracts, even accepted ones, suffer because they focus on only one or two of an abstract’s responsibilities, but it is possible to set your abstract up for success by starting from a structure that is designed to cover all of the abstract’s responsibilities.

In 1993, Kent Beck appeared on a panel on how to get a paper accepted at OOPSLA. OOPSLA (now called SPLASH) was an academic conference on research and development related to object-oriented programming languages, systems and environments to support object-oriented programming, and applications developed using these technologies. This is a particularly broad mandate, and because OOPSLA attracted so many papers on a wide range of topics, it had an extremely low acceptance rate. (This is probably why they held a panel on getting papers accepted, but it also makes OOPSLA a good analogy for contemporary practice-focused technical conferences that often cross several areas of specialization, e.g., data processing, distributed computing, and machine learning.)

Beck’s advice is worth reading even if you aren’t writing an academic conference paper. In particular, he suggests that you start by identifying a single “startling sentence” that summarizes your work and can grab the attention of the program committee. From there, Beck advises that you adopt the following four-sentence model to structure your abstract:

  1. The first sentence is the problem you’re trying to solve,
  2. The second sentence provides context for the problem or explains its impact,
  3. The third sentence is the “startling sentence” that is the key insight or contribution of your work, and
  4. The fourth sentence shows how the key contribution of your work affects the problem.

I’ve used this template in almost every abstract I’ve written for many years, although I sometimes devote more than a single sentence to each step. It has successfully helped me refine abstracts for both industry conference talks and academic papers, and it more or less ensures that each abstract accomplishes what it needs to. (If you’re writing a talk abstract, as opposed to a paper abstract, it’s sometimes also a good idea to add a sentence or two covering what the audience should expect to take away from your talk and why you’re qualified to give it.) If I am sure to consider my audience — first, an overworked program committee member, and second, a jetlagged and overstimulated conference attendee — I am far more likely to explain things clearly and eschew jargon. As a bonus, starting from a fairly rigid structure frees me from wasting time worrying about how best to arrange my prose.

If we avoid jargon and start from Beck’s structure, we can transform the mediocre example abstract from the beginning of this post into something far more effective:

Contemporary multiple-track industry conferences attract speakers and attendees who specialize in distinct but related parts of the practice. Since many authors adopt ineffective patterns from other technical abstracts they’ve read, they may unwittingly submit talk proposals that are at best rhetorically impotent and at worst nonsensical to people who don’t share their specialization. By starting from a simple template, prospective speakers can dramatically improve their chances of being understood, accepted, and attended, while also streamlining the abstract-writing process. Excellent abstracts benefit the entire community, because more people will be motivated to learn about interesting work that is outside of their immediate area of expertise. In this talk, delivered by someone who has delivered many talks without any serious train wrecks and has also helped other people get talks accepted, you’ll learn a straightforward technique for designing abstracts that communicate effectively to a general audience, sell your talk to the program committee, and motivate your peers to attend your talk.