The Custom House

Dublin is a charming city and a burgeoning technology hub, but it also has special significance for anyone whose work involves making sense of data, since William Sealy Gosset was working as the head brewer at Guinness when he developed the t-statistic. Last week, Dublin had extra special significance for anyone whose work involves using Apache Spark for data processing. Our group at Red Hat gave three talks at Spark Summit EU this year, and videos of these are now online. You should check them out!

A lot of the work we discussed is available from radanalytics.io or from the Isarn project; if you’d like to see other talks about data science, distributed computing, and best practices for contemporary intelligent applications, you should see our team’s list of presentations.

I’m giving a talk this afternoon at Spark Summit EU on extending Spark with new machine learning algorithms. Here are some additional resources and links:

  • Our team’s Silex library is where I’ve published my ongoing work to develop a self-organizing map implementation for Spark and to extend it with support for data frames and ML pipelines
  • I gave a talk about using self-organizing maps in Spark last year at Spark Summit
  • If you like the idea of developing new ML techniques on Spark, you’ll also want to attend a session tomorrow in which my friend and teammate Erik Erlandson will be talking about using his parallel t-digest implementation to support feature importance and other applications.
  • Finally, if you’re doing anything where parallelism and scale matter, especially in a cloud-native environment, you should also check out Mike McCune’s talk on Spark monitoring and metrics.

I’m speaking this morning at the OpenShift Commons Gathering about my team’s experience running Apache Spark on Kubernetes and OpenShift. Here are some links to learn more:

I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is shared between applications makes sense when analytics is a separate workload. However, analytics is no longer a separate workload – instead, analytics is now an essential part of long-running data-driven applications. This realization motivated my team to switch from a shared Spark cluster to multiple logical clusters that are co-scheduled with the applications that depend on them.

I’m glad for the opportunity to get together with the Spark community and present on some of the cool work my team has done lately. Here are some links you can visit to learn more about our work and other topics related to running Spark on Kubernetes and OpenShift:

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that HTCondor users who aren’t already leading data science initiatives are well-equipped to start doing so. The talk is brief and high-level, so here are a few quick links to learn more if you’re interested:

I also gave a quick overview of some of my team’s recent data science projects; visit these links to learn more:

As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a virtual handout for that talk. (I’ll also be presenting another talk on scalable log data analysis later this summer. That talk is also inspired by my recent work with logs but will focus on different parts of the problem, so stay tuned if you’re interested in the domain!)

Some general links:

  • You can download a PDF of my slide deck. I recognize that people often want to download slides, although I’d prefer you look at the rest of this post instead since my slides are not intended to stand alone without my presentation.
  • Check out my team’s Silex library, which is intended to extend the standard Spark library with high-quality, reusable components for real-world data science. The most recent release includes the self-organizing map implementation I mentioned in my talk.
  • Watch this short video presentation showing some of the feature engineering and dimensionality-reduction techniques I discussed in the talk.

The following blog posts provide a deeper dive into some of the topics I covered in the talk:

  • When I started using Spark and ElasticSearch, the upstream documentation was pretty sparse (it was especially confusing because it required some unidiomatic configuration steps). So I wrote up my experiences getting things working. This is an older post but may still be helpful.
  • If you’re interested in applying natural-language techniques to log data, you should consider your preprocessing pipeline. Here are the choices I made when I was evaluating word2vec on log messages.
  • Here’s a brief (and not-overly technical) overview of self-organizing maps, including static visual explanations and an animated demo.

If you’ll be at Apache: Big Data next week, you should definitely check out some talks from my teammates in Red Hat’s Emerging Technology group and our colleague Suneel Marthi from the CTO office:

Unfortunately, my talk is at the same time as Suneel’s, so I won’t be able to attend his, but these are all great talks and you should be sure to put as many as possible on your schedule if you’ll be in Vancouver!

Self-organizing maps are a useful technique for identifying structure in high-dimensional data sets. The map itself is a low-dimensional arrangement of cells, where each cell is an object comparable to the objects in the training set. The goal of self-organizing map training is to arrange a grid of cells so that nearby cells will be the best matches for similar objects. Once we’ve built up the map, we can identify clusters of similar objects (based on the cells that they map to) and even detect outliers (based on the distributions of map quality).

Here are a few snapshots of the training process on color data, which I developed as a test for a parallel implementation of self-organizing maps in Apache Spark. For this demo, I used angular similarity in the RGB color space (not Euclidean distance) as a measure of color similarity. This means that, for example, a darker color would be considered similar to a lighter color with a similar hue.

We start with a random map:

Matches made in the first training iteration essentially affect the whole map, producing a blurred, unsaturated, undifferentiated map:

Some structure begins to emerge pretty rapidly, though; after one quarter of our training iterations, we can already see clear clusters of colors:

The map begins to get more and more saturated as similar colors are grouped together. Here’s what it looks like after half of the training iterations:

…and three-quarters of the training iterations:

As training proceeds, it gradually affects smaller and smaller neighborhoods of the map until the very end, when each training match only affects a single cell (and thus the impact of darker colors becomes apparent, since they can cluster together in single cells that are not the best matching unit for any brighter colors):

In a future post, I’ll cover the training algorithm, introduce the code, and provide some tips for implementing similar techniques in Spark. For now, though, here is a demo video that shows an animation of the whole map training process: