Lately, I’ve been experimenting with Spark’s implementation of word2vec. Since most of the natural-language data I have sitting around these days are service and system logs from machines at work, I thought it would be fun to see how well word2vec worked if we trained it on the text of log messages. This is obviously pretty far from an ideal training corpus, but these brief, rich messages seem like they should have some minable content. In the rest of this post, I’ll show some interesting results from the model and also describe some concrete preprocessing steps to get more useful results for extracting words from the odd dialect of natural language that appears in log messages.

### Background

word2vec is a family of techniques for encoding words as relatively low-dimensional vectors that capture interesting semantic information. That is, words that are synonyms are likely to have vectors that are similar (by cosine similarity). Another really neat aspect of this encoding is that linear transformations of these vectors can expose semantic information like analogies: for example, given a model trained on news articles, adding the vectors for “Madrid” and “France” and subtracting the vector for “Spain” results in a vector very close to that for “Paris.”

Spark’s implementation of word2vec uses skip-grams, so the training objective is to produce a model that, given a word, predicts the context in which it is likely to appear.

### Preliminaries

Like the original implementation of word2vec, Spark’s implementation uses a window of ±5 surrounding words (this is not user-configurable) and defaults to discarding all words that appear fewer than 5 times (this threshold is user-configurable). Both of these assumptions seem sane for the sort of training “sentences” that appear in log messages, but they won’t be sufficient.

Spark doesn’t provide a lot of tools for tokenizing and preprocessing natural-language text.1 Simple string splitting is as ubiquitous in trivial language processing examples just as it is in trivial word count examples, but it’s not going to give us the best results. Fortunately, there are some minimal steps we can take to start getting useful tokens out of log messages. We’ll look at these steps and what see what motivates them now.

#### What is a word?

Let’s first consider what kinds of tokens might be interesting for analyzing the content of log messages. At the very least, we might care about:

1. dictionary words,
2. trademarks (which may or may not be dictionary words),
3. technical jargon terms (which may or may not be dictionary words),
4. service names (which may or may not be dictionary words),
5. symbolic constant names (e.g., ENOENT and OPEN_MAX),
6. pathnames (e.g., /dev/null), and
7. programming-language identifiers (e.g., OutOfMemoryError and Kernel::exec).

For this application, we’re less interested in the following kinds of tokens, although it is possible to imagine other applications in which they might be important:

1. hostnames,
4. dates and times, and
5. hex hash digests.

#### Preprocessing steps

If we’re going to convert sequences of lines to sequences of sequences of tokens, we’ll eventually be splitting strings. Before we split, we’ll collapse all runs of whitespace into single spaces so that we get more useful results when we do split. This isn’t strictly necessary – we could elect to split on runs of whitespace instead of single whitespace characters, or we could filter out empty strings from word sequences before training on them. But this makes for cleaner input and it makes the subsequent transformations a little simpler.

Here’s Scala code to collapse runs of whitespace into a single space:

The next thing we’ll want to do is eliminate all punctuation from the ends of each word. An appropriate definition of “punctuation” will depend on the sorts of tokens we wind up deciding are interesting, but I considered punctuation characters to be anything except:

1. alphanumeric characters,
2. dashes, and
3. underscores.

Whether or not we want to retain intratoken punctuation depends on the application; there are good arguments to be made for retaining colons and periods (MAC addresses, programming-language identifiers in stack traces, hostnames, etc.), slashes (paths), at-signs (email addresses), and other marks as well. I’ll be retaining these marks but stripping all others. After these transformations, we can split on whitespace and get a relatively sensible set of tokens.

Here’s Scala code to strip punctuation from lines:

In order to filter out strings of numbers, we’ll reject all tokens that don’t contain at least one letter. (We could be stricter and reject all tokens that don’t contain at least one letter that isn’t a hex digit, but I decided to be permissive in order to avoid rejecting interesting words that only contain letters A-F.)

Here’s what our preprocessing pipeline looks like, assuming an RDD of log messages called messages:

Now we have a sequence of words for each log message and are ready to train a word2vec model.

Note that there are a few things we could be doing in our preprocessing pipeline but aren’t, like using a whitelist (for dictionary words or service names), or rejecting stopwords. This approach is pretty basic, but it produces some interesting results in any case.

### Results and conclusions

I evaluated the model by using it to find synonyms for (more or less) arbitrary words that appeared in log messages. Recall that word2vec basically models words by the contexts in which they might appear; informally, synonyms are thus words with similar contexts.

• The top synonyms for nova (the OpenStack compute service) included vm, glance, containers, instances, and images – all of these are related to running OpenStack compute jobs.
• The top synonyms for volume included update, cinder.scheduler.host_manager, and several UUIDs for actual volumes.
• The top synonyms for tmpfs included type, dev, uses, initialized, and transition.
• The top synonyms for sh included /usr/bin/bash, _AUDIT_SESSION, NetworkManager, _SYSTEMD_SESSION, postfixqmgr.
• The top synonyms for password included publickey, Accepted, opened, IPMI, and the name of an internal project.

These results aren’t earth-shattering – indeed, they won’t even tell you where to get a decent burrito – but they’re interesting, they’re sensible, and they point to the effectiveness of word2vec even given a limited, unidiomatic corpus full of odd word-like tokens. Of course, one can imagine ways to make our preprocessing more robust. Similarly, there are certainly other ways to generate a training corpus for these words: perhaps using the set of all messages for a particular service and severity as a training sentence, using the documentation for the services involved, using format strings present in the source or binaries for the services themselves, or some combination of these.

Semantic modeling of terms in log messages is obviously useful for log analytics: it can be used as part of a pipeline to classify related log messages by topic, in feature engineering for anomaly detection, and to suggest alternate search terms for interactive queries. However, it is a pleasant surprise that we can train a competent word2vec model for understanding log messages from the uncharacteristic utterances that comprise log messages themselves.

1. Spark does provide a stopword filter for English and there are external libraries to fill in some of its language-processing gaps. In particular, I’ve had good luck with the Porter stemmer implementation from Chalk

sparknlpword2vecmachine learning • You may reply to this post on Twitter or