Chapeau

The mustard watch test for novel abstractions

A famous satire provides unexpectedly useful baseline criteria for new intellectual frameworks.

Code on slides

Some quick tips for preparing slide presentations that include source code.

Wrapping up

There may be better ways to end your talk than (merely) thanking your audience.

Accelerated single-node Kubernetes with microk8s, Kubeflow, and RAPIDS.ai

Some notes from setting up a local Kubernetes environment for accelerated data science.

Parquet interoperability across language ecosystems

Parquet is available in many environments but you’ll need to keep some quirks in mind to realize the benefits of its ubiquity.

Cloud-native machine learning systems at day two and beyond

A virtual handout for our KubeCon talk.

Mapping the territory for MLOps

A good map can reveal a lot about a problem and its solution.

Machine learning systems and intelligent applications

My argument for a new way to think about machine learning systems in the cloud.

Going beyond the basics with Altair

Altair is one of my favorite plotting libraries. Here are some examples of how to use it for data prep, interactive plots, and geospatial data.

Efficiently sampling from many probability distributions in SciPy

Avoid a performance pitfall when using SciPy’s probability distributions.

Repeatable simulations without repeated boilerplate

If you’re maintaining a lot of Python functions that depend on having pseudorandom number generation — like in a discrete-event simulation — you probably want different…

Sketching data and other magic tricks

Materials from a tutorial on some very cool data structures.

Automatically generating model services from Jupyter notebooks

In my last post, I showed some applications of source-to-image workflows for data scientists. In this post, I’ll show another: automatically generating a model serving…

Materials for my Strata Data talk

reproducibility

I’m excited to be speaking at Strata Data in New York this Wednesday afternoon! My talk introduces the benefits of Linux containers and container application platforms for…

A parable about causality and abduction

philosophy of science

What do your results tell you about the world?

Ad-hoc, informally-specified, bug-ridden operating system distributions

A world in which anyone can build a Linux container image is also a world in which everyone is maintaining their own Linux distribution, whether they want to or not.

Spark’s RDD API, variance, and typeclasses

This brief post is based on material that Erik and I didn’t have time to cover in our Spark+AI Summit talk; it will show you how to use Scala’s implicit parameter mechanism…

Virtual handout for my Red Hat Summit talk

Some materials related to my quick overview of machine learning techniques for enterprise developers at Red Hat Summit 2018.

Virtual handout for my DevConf talk

Some materials and links related to my talk on probabilistic data stuctures.

Misfeature recovery and APIs

My team recently agreed that it would improve the usability of our main Trello board if we moved lists containing cards we’d completed in previous years to archival boards.…

Identifying Python module dependencies

machine learning

static analysis

It’s possible to approximate the module dependencies of Python code with lightweight static analysis. These approximations aren’t perfect, but they are useful.

Red Hat talks at Spark Summit EU 2017

Dublin is a charming city and a burgeoning technology hub, but it also has special significance for anyone whose work involves making sense of data, since William Sealy…

Building machine learning algorithms on Apache Spark

machine learning

I’m giving a talk this afternoon at Spark Summit EU on extending Spark with new machine learning algorithms. Here are some additional resources and links:

Apache Spark on OpenShift

I’m speaking this morning at the OpenShift Commons Gathering about my team’s experience running Apache Spark on Kubernetes and OpenShift. Here are some links to learn more:

Spark on Kubernetes at Spark Summit EU

I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is…

Notes for my HTCondor Week talk

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that…

Silex 0.0.10

My team and I are pleased to announce the latest release of our Silex library, featuring cool new functionality from all of the core contributors. Silex is a library of…

Log analytics talk at Apache: Big Data

machine learning

As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a…

Red Hat Data Science talks at Apache: Big Data 2016

If you’ll be at Apache: Big Data next week, you should definitely check out some talks from my teammates in Red Hat’s Emerging Technology group and our colleague Suneel…

Self-organizing maps in Spark

machine learning

Self-organizing maps are a useful technique for identifying structure in high-dimensional data sets. The map itself is a low-dimensional arrangement of cells, where each…

Dimensionality reduction in Spark

machine learning

Here’s a quick video I put together introducing infrastructure log processing in Spark. At the end, there are a couple of nice graphs contrasting PCA and t-SNE for embedding…

Using word2vec on logs

machine learning

Lately, I’ve been experimenting with Spark’s implementation of word2vec. Since most of the natural-language data I have sitting around these days are service and system logs…

Concrete advice about abstracts

Consider the following hypothetical conference session abstract:

Pacing technical talks

If you resist the temptation to start too quickly, you can cover more ground.

Notes from Flink Forward

I was in Berlin last week for Flink Forward, the inaugural Apache Flink conference. I’m still learning about Flink, and Flink Forward was a great place to learn more. In…

fedmsg talk at Spark Summit

I’m speaking at Spark Summit today about using Spark to analyze operational data from the Fedora project. Here are some links to further resources related to my talk:

Using Spark ML Pipeline transformers

In this post, we’ll see how to make a simple transformer for Spark ML Pipelines. The transformer we’ll design will generate a sparse binary feature vector from an…

Bokeh plots from Spark

This post will show you an extremely simple way to make quick-and-dirty Bokeh plots from data you’ve generated in Spark, but the basic technique is generally applicable to…

Planning your career like a racing season

professional development

Most people set personal and professional goals. If you work in software, your near-term professional goals might sound like this:

Elasticsearch and Spark 1.3

Elasticsearch has offered Hadoop InputFormat and OutputFormat implementations for quite some time. These made it possible to process Elasticsearch indices with Spark just as…

Effective continuous integration for Spark projects

Silex is a small library of helper code intended to make it easier to build real-world Spark applications;¹ most of it is factored out from applications we’ve developed…

Natural join for data frames in Spark

Natural join is a useful special case of the relational join operation (and is extremely common when denormalizing data pulled in from a relational database). Spark’s…

Interactively using Spark SQL and DataFrames from sbt projects

One of the great things about Apache Spark is that you can experiment with new analyses interactively. In the past, I’ve used the sbt console to try out new data…

Your next favorite collaboration tool

Last night I had a crazy realization: I could probably replace the majority of what my team hopes to accomplish with standup meetings, design documents, project management…

Caveat censor

technical review

Over eight years ago, Richard WM Jones wrote a great but disheartening article about his experience serving as a technical reviewer for an infamous book about OCaml. The…

Spark performance talk at ApacheCon EU

I’ll be speaking later this afternoon at ApacheCon EU. The title of my talk is “Iteratively Improving Spark Application Performance.” The great thing about Apache Spark is…

Algebraic types and schema inference

My last post covered some considerations for using Spark SQL on a real-world JSON dataset. In particular, schema inference can suffer when you’re ingesting a dataset of…

fedmsg data and Spark SQL

In this post, I’ll briefly introduce fedmsg, the federated message bus developed as part of the Fedora project’s infrastructure, and discuss how to ingest fedmsg data for…

Notes from Strata + Hadoop World 2014

I went to Strata + Hadoop World last week. This event targets a pretty broad audience and is an interesting mix of trade show, data science conference, and software…

Introducing Leitmotif

leitmotif, development

leitmotif is a very simple templating tool that generates directories from prototypes stored in git repositories. Its design prioritizes simplicity and a minimal set of…

Licensing morality

I’ve written in the past about what a mistake it is to add behavioral incentives or morality clauses to the licenses for open-source projects. Briefly, these clauses are bad:

Replies via Twitter in Octopress

As of the early 2020s, this advice is only of historical interest: the versions of Octopress and Jekyll described have long since bit-rotted and many people — including this…

Improving Spark application performance

spark scala performance big-data sur-la-plaque bicycling

One of my side projects this year has been using Apache Spark to make sense of my bike power meter data. There are a few well-understood approaches to bike power data…

Name this concept

Consider the collection¹ of all contiguous subsequences of a sequence. If we’re talking about a stream of n observations, this could be the multiset of windows containing…

Automating the sbt REPL

sbt scala automation

If you’re like me, you often find yourself pasting transcripts into sbt console sessions in order to interactively test out new app functionality. A lot of times, these…

Implementing type translation

In earlier posts we introduced the concepts of type widening and type translation, discussed support for these in existing database systems, and presented a general approach…

Implementing type widening

In this installment of our series on type coercions, we’re going to introduce a way to support type widening in a language interpreter. We’ll present a general approach…

Implicit type coercion support in existing database systems

In my last post, I introduced two kinds of implicit type coercions that can appear in database query languages: type widenings, in which values are converted to wider types…

Type coercions for untyped query languages

In this post, we’re going to introduce two kinds of implicit type conversions that are common in database query languages:

Sharing volumes to Docker as the right user

Yesterday’s post provided a couple of minimal Docker images for spinning up containers to do Scala and Java builds and tests. The second of the images had a pre-loaded Ivy…

Upstream-friendly JVM development and testing with Docker

I recently experienced some bizarre failures running Akka actors and sbt tests in forked mode on my Fedora laptop. As far as I can tell, the root of both problems was that…

Video of bike data analysis talk

I gave a talk at Spark Summit earlier this month about my work using Apache Spark to analyze my bike power meter data, and the conference videos are now online. You can…

Spark bike data analytics video

I’m looking forward to giving a talk at Spark Summit this week about some of my recent work using Apache Spark to make sense of my bike data (see also previous posts here and …

Finding top bicycling efforts with Spark

In an earlier post, I showed how I had used Apache Spark to cluster points from GPS traces of bike rides and plot the convex hulls of each cluster, coloring each hull based…

Fitness data visualization with Apache Spark

This post contains embedded maps; you’ll need to view it in a browser with JavaScript support in order to see them.

sbt is in Fedora 20

Longtime Chapeau readers may recall last summer’s lament about the state of the Scala ecosystem in Fedora. We’ve taken a lot of steps since then. After a rough patch for the…

One weird trick to eviscerate open source licenses

The GNU project’s Four Freedoms present the essential components of Free software: freedom to use the program for any purpose, freedom to study and change the program’s…

A simple machine learning app with Spark

I’m currently on my way back from the first-ever Spark Summit, where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its…

Apache Thrift in Fedora

You probably already know that Apache Thrift is a framework for developing distributed services and clients to access these in multiple languages. You probably also knew…

Making Fedora a better place for Scala

Scala combines a lot of excellent features (functional-style pattern matching, an expressive type system, closures, etc.) with JVM compatibility and a very interesting…

Installing Spark on Fedora 18

The Spark project is an actively-developed open-source engine for data analytics on clusters using Scala, Python, or Java. It offers map, filter, and reduce operations over…

Best practices for Wallaby’s default group

Recall that Wallaby applies partial configurations to groups of nodes. Groups can be either explicit — that is, a named subset of nodes created by the user, or special groups…

Configuring high-availability Condor central managers with Wallaby

Rob Rati and I gave a tutorial on highly-available job queues at Condor Week this year. While it was not a Wallaby-specific tutorial, we did point out that configuring…

Authorization for Wallaby clients

Wallaby 0.16.0, which updates the Wallaby API version to 20101031.6, includes support for authorizing broker users with various roles that can interact with Wallaby in…

Highly-available configuration data with Wallaby

Many Condor users are interested in high-availability (HA) services: they don’t want their compute resources to become unavailable due to the failure of a single machine…

Using Wallaby’s skeleton group

Wallaby 0.15.0 includes a new feature called the skeleton group. (This feature was available in earlier versions of Wallaby, too, but it was experimental and had some rough…

Troubleshooting Condor with Wallaby

Often, if you’re trying to reproduce a problem someone else is having with Condor, you’ll need their configuration. Likewise, if you’re trying to help someone reproduce a…

Boldly going forward — to Ruby 1.9

From the very beginning of the project, we’ve developed Wallaby and its stack in Ruby 1.8 and not paid much attention to Ruby 1.9. We had done so for a few reasons, but…

Exporting versioned Wallaby configurations

Wallaby stores versioned configurations in a database. Wallaby API clients can access older versions of a node’s configuration by supplying the version option to the Node#ge…

Wallaby paper at SC11

I’m pleased to announce that our paper “Wallaby: A Scalable Semantic Configuration Service for Grids and Clouds” will be presented at SC11 in the “State of the Practice”…

Write-once, defaultable constants in Ruby

metaprogramming

quiescing constants

Ruby constants are a nice place to put application configuration information, but they can be inflexible if you want to defer initialization until later — for example, if…

Gliss 0.2.0 release and a gliss example

Earlier today, I released version 0.2.0 of Gliss, a lightweight tool for inspecting and processing tagged annotations in git repositories. Since I last wrote about gliss…

When software-dependency philosophies collide

Earlier today, I released version 0.4.0 of Rhubarb, a little object-graph persistence library for Ruby built on top of SQLite. Rhubarb 0.4.0 adds no new features, unless you…

Node tagging in a Wallaby client library

In an earlier post, I presented a technique for adding node tagging to Wallaby without adding explicit tagging support to the Wallaby API. Node tags are useful for a…

RESTful manipulation of versioned data

functional trees

In this post, I’ll sketch a half-baked plan for making an idiomatic RESTful service that handles versioned data in a sensible way. I’m not claiming that the pattern I’m…

Using Wallaby groups to implement node tagging

One of the great things about Wallaby is that it’s a platform, not merely a tool. Put another way, if it doesn’t do exactly what you want, you can use its API to build…

Wallaby user tutorial and live VM

Rob Rati and I presented a Wallaby user tutorial at Condor Week yesterday. Today, we have a tutorial you can follow along with at home, including a link to an EC2 AMI that…

Grepping for git glosses with gliss

I’m pleased to announce the first release of gliss, a tool to make it easier to track lightweight inline annotations in your git repositories. gliss is available as source…

Extending wallaby with a python client library

In my previous post, we saw how to extend wallaby by writing Ruby classes that use a client library to extend the wallaby shell. If you’re comfortable with Ruby, this is a…

Extending the wallaby shell

The most recent few releases of the Wallaby configuration management service have included some great new features: wallaby console can now be used as an interpreter for sheb…

Wallaby node inventory with constraints

wallaby inventory is a useful command for quickly checking up on the health of your pool and answering certain kinds of questions: Which nodes have checked in recently?…

Retrieving Wallaby node configurations over HTTP

In some environments, users may wish to use Wallaby to serve configurations to nodes that can’t reach the Qpid broker that the Wallaby agent is running against. Some users…

Flexible interaction with the Wallaby console

One of the main benefits of using Wallaby for configuration is the remote-access API. Because the API is comprehensive and usable from any language with a QMF binding…

Migrating legacy Condor configurations to Wallaby

Wallaby provides a great way to manage Condor configurations, and if you’re just starting out with Condor, it’s easy to do things the Wallaby way from the start. However…

Mostly-transparent memoization in Ruby

functional programming

Here’s an easy technique for automatically memoizing the results of method calls in Ruby. Let’s say that we’re interested in looking up instances of the Employee class by…

Updates to the wallaby API

If you’ve built tools on the Wallaby API, you may be interested in some recent changes to the API; these are currently in source control and will appear in the upcoming…

Markdown documentation of QMF APIs

Here’s a cheap and cheerful little script I threw together to automatically generate Markdown-formatted documentation for my QMF methods. I used this to make the Wallaby…

SPQR 0.3.0, now with event support

I’m pleased to announce yesterday’s release of SPQR 0.3.0, which is available as source, on github, or as a RubyGem. (SPQR is a library to make it painless to publish Ruby…

Notes on configuration

As most readers of this site know, I’ve been busy lately working on the Wallaby configuration service, which aims to make it painless to manage configurations for entire…

At Condor Week

Introducing capricious

Last week I needed a good random number generator to make repeatable stress tests for a Ruby project. Ruby’s standard library includes a good random number generator (the Me…

SPQR update

I released version 0.1.2 of SPQR this morning; it is available from gemcutter (as an installable gem package) or from fedorahosted.org (as source). This version contains…

Two brief SPQR updates

Here are two quick notes (and a bonus meta-note) about the quickly-evolving SPQR project:

Automatically generating QMF agents with spqr-gen

In a previous post, I introduced SPQR and presented a couple of examples of how one could use SPQR to publish Ruby objects over QMF. Sometimes, though, you aren’t starting…

Introducing SPQR

SPQR is a framework to make it almost painless to create QMF agents in the Ruby language, and thus to write Ruby applications that can be managed remotely. I built it to…

Hiding template parameters with dynamic dispatch

template metaprogramming

In my last post, I mentioned a problem (choosing between one of several templated classes based on information that won’t be available until runtime) and also mentioned a…

A problem of dependent types

Here’s an interesting problem involving C++ templates. Say you have a class that is parameterized on a term — we’ll use std::bitset<N> as a running example — and you’d like…

Virtualizing a physical Linux machine

Due to some hardware trouble with my main work machine, I’m presently working in a virtual machine on my personal computer. After a few dim trails, I found a pretty…

About the author

I’m Will Benton and I’ll be posting some short technical articles related to my work – specifically, programming languages, broadly construed, and high-throughput…