Chapeau
  • About the author
  • Select talks
  • Select publications
Code on slides
speaking
writing
slides

Some quick tips for preparing slide presentations that include source code.

Mar 20, 2023
Wrapping up
speaking
writing
slides

There may be better ways to end your talk than (merely) thanking your audience.

Jul 5, 2021
Accelerated single-node Kubernetes with microk8s, Kubeflow, and RAPIDS.ai
kubernetes
microk8s
kubeflow
rapids

Some notes from setting up a local Kubernetes environment for accelerated data science.

Apr 6, 2021
Parquet interoperability across language ecosystems
parquet
arrow
spark
pyspark
java
big data

Parquet is available in many environments but you’ll need to keep some quirks in mind to realize the benefits of its ubiquity.

Jan 29, 2021
Cloud-native machine learning systems at day two and beyond
mlops
talks
A virtual handout for our KubeCon talk.
Nov 19, 2020
Mapping the territory for MLOps
mlops
A good map can reveal a lot about a problem and its solution.
Oct 23, 2020
Machine learning systems and intelligent applications
mlops
publications
My argument for a new way to think about machine learning systems in the cloud.
Apr 27, 2020
Going beyond the basics with Altair
data science
python
altair
visualization
Altair is one of my favorite plotting libraries. Here are some examples of how to use it for data prep, interactive plots, and geospatial data.
Apr 13, 2020
Efficiently sampling from many probability distributions in SciPy
data science
python
scipy
simulation
Avoid a performance pitfall when using SciPy’s probability distributions.
Mar 15, 2020
Repeatable simulations without repeated boilerplate
data science
python
scipy
simulation
If you’re maintaining a lot of Python functions that depend on having pseudorandom number generation — like in a discrete-event simulation — you probably want different…
Feb 8, 2020
Sketching data and other magic tricks
data science
sketching
python
Materials from a tutorial on some very cool data structures.
Sep 25, 2019
Automatically generating model services from Jupyter notebooks
mlops
data science
modelops
kubernetes
In my last post, I showed some applications of source-to-image workflows for data scientists. In this post, I’ll show another: automatically generating a model serving…
Oct 24, 2018
Materials for my Strata Data talk
containers
data science
reproducibility
I’m excited to be speaking at Strata Data in New York this Wednesday afternoon! My talk introduces the benefits of Linux containers and container application platforms for…
Sep 10, 2018
A parable about causality and abduction
philosophy of science
What do your results tell you about the world?
Aug 27, 2018
Ad-hoc, informally-specified, bug-ridden operating system distributions
containers
A world in which anyone can build a Linux container image is also a world in which everyone is maintaining their own Linux distribution, whether they want to or not.
Jul 23, 2018
Spark’s RDD API, variance, and typeclasses
spark
rdd
scala
implicit
typeclasses
This brief post is based on material that Erik and I didn’t have time to cover in our Spark+AI Summit talk; it will show you how to use Scala’s implicit parameter mechanism…
Jun 3, 2018
Virtual handout for my Red Hat Summit talk
ml
Some materials related to my quick overview of machine learning techniques for enterprise developers at Red Hat Summit 2018.
May 8, 2018
Virtual handout for my DevConf talk
ml
scale
parallel
sketches
Some materials and links related to my talk on probabilistic data stuctures.
Jan 26, 2018
Misfeature recovery and APIs
rest
api
python
trello
My team recently agreed that it would improve the usability of our main Trello board if we moved lists containing cards we’d completed in previous years to archival boards.…
Jan 11, 2018
Identifying Python module dependencies
python
jupyter
ml
machine learning
static analysis

It’s possible to approximate the module dependencies of Python code with lightweight static analysis. These approximations aren’t perfect, but they are useful.

Dec 17, 2017
Red Hat talks at Spark Summit EU 2017
Dublin is a charming city and a burgeoning technology hub, but it also has special significance for anyone whose work involves making sense of data, since William Sealy…
Oct 31, 2017
Building machine learning algorithms on Apache Spark
spark
ml
ai
machine learning
pipelines
I’m giving a talk this afternoon at Spark Summit EU on extending Spark with new machine learning algorithms. Here are some additional resources and links:
Oct 25, 2017
Apache Spark on OpenShift
spark
openshift
big-data
I’m speaking this morning at the OpenShift Commons Gathering about my team’s experience running Apache Spark on Kubernetes and OpenShift. Here are some links to learn more:
Nov 7, 2016
Spark on Kubernetes at Spark Summit EU
spark
kubernetes
openshift
I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is…
Oct 24, 2016
Notes for my HTCondor Week talk
htcondor
data science
spark
flink
I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that…
May 17, 2016
Silex 0.0.10
silex
spark
My team and I are pleased to announce the latest release of our Silex library, featuring cool new functionality from all of the core contributors. Silex is a library of…
May 7, 2016
Log analytics talk at Apache: Big Data
talks
spark
logs
machine learning
data science
As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a…
May 7, 2016
Red Hat Data Science talks at Apache: Big Data 2016
talks
If you’ll be at Apache: Big Data next week, you should definitely check out some talks from my teammates in Red Hat’s Emerging Technology group and our colleague Suneel…
May 5, 2016
Self-organizing maps in Spark
spark
machine learning
soms
Self-organizing maps are a useful technique for identifying structure in high-dimensional data sets. The map itself is a low-dimensional arrangement of cells, where each…
May 1, 2016
Dimensionality reduction in Spark
spark
pca
tsne
machine learning
logs
Here’s a quick video I put together introducing infrastructure log processing in Spark. At the end, there are a couple of nice graphs contrasting PCA and t-SNE for embedding…
Feb 16, 2016
Using word2vec on logs
spark
nlp
word2vec
machine learning
Lately, I’ve been experimenting with Spark’s implementation of word2vec. Since most of the natural-language data I have sitting around these days are service and system logs…
Dec 11, 2015
Concrete advice about abstracts
writing
speaking
talks
Consider the following hypothetical conference session abstract:
Nov 16, 2015
Pacing technical talks
speaking
If you resist the temptation to start too quickly, you can cover more ground.
Oct 21, 2015
Notes from Flink Forward
flink
I was in Berlin last week for Flink Forward, the inaugural Apache Flink conference. I’m still learning about Flink, and Flink Forward was a great place to learn more. In…
Oct 20, 2015
fedmsg talk at Spark Summit
fedora
spark
I’m speaking at Spark Summit today about using Spark to analyze operational data from the Fedora project. Here are some links to further resources related to my talk:
Jun 15, 2015
Using Spark ML Pipeline transformers
spark
sql
ml-pipelines
In this post, we’ll see how to make a simple transformer for Spark ML Pipelines. The transformer we’ll design will generate a sparse binary feature vector from an…
Jun 13, 2015
Bokeh plots from Spark
spark
plotting
scala
bokeh
This post will show you an extremely simple way to make quick-and-dirty Bokeh plots from data you’ve generated in Spark, but the basic technique is generally applicable to…
May 21, 2015
Planning your career like a racing season
professional development
Most people set personal and professional goals. If you work in software, your near-term professional goals might sound like this:
May 14, 2015
Elasticsearch and Spark 1.3
spark
spark sql
elasticsearch
silex
Elasticsearch has offered Hadoop InputFormat and OutputFormat implementations for quite some time. These made it possible to process Elasticsearch indices with Spark just as…
Apr 30, 2015
Effective continuous integration for Spark projects
spark
spark sql
Silex is a small library of helper code intended to make it easier to build real-world Spark applications;1 most of it is factored out from applications we’ve developed…
Apr 21, 2015
Natural join for data frames in Spark
spark
sql
data frames
Natural join is a useful special case of the relational join operation (and is extremely common when denormalizing data pulled in from a relational database). Spark’s…
Apr 8, 2015
Interactively using Spark SQL and DataFrames from sbt projects
spark
sql
dataframe
scala
reflection
One of the great things about Apache Spark is that you can experiment with new analyses interactively. In the past, I’ve used the sbt console to try out new data…
Apr 2, 2015
Your next favorite collaboration tool
collaboration
Last night I had a crazy realization: I could probably replace the majority of what my team hopes to accomplish with standup meetings, design documents, project management…
Feb 13, 2015
Caveat censor
writing
publishing
technical review
Over eight years ago, Richard WM Jones wrote a great but disheartening article about his experience serving as a technical reviewer for an infamous book about OCaml. The…
Dec 2, 2014
Spark performance talk at ApacheCon EU
spark
talks
apachecon
I’ll be speaking later this afternoon at ApacheCon EU. The title of my talk is “Iteratively Improving Spark Application Performance.” The great thing about Apache Spark is…
Nov 18, 2014
Algebraic types and schema inference
type systems
type theory
json
spark
sql
scala
My last post covered some considerations for using Spark SQL on a real-world JSON dataset. In particular, schema inference can suffer when you’re ingesting a dataset of…
Nov 2, 2014
fedmsg data and Spark SQL
fedora
fedmsg
spark
spark sql
json
scala
In this post, I’ll briefly introduce fedmsg, the federated message bus developed as part of the Fedora project’s infrastructure, and discuss how to ingest fedmsg data for…
Oct 31, 2014
Notes from Strata + Hadoop World 2014
big data
strata
I went to Strata + Hadoop World last week. This event targets a pretty broad audience and is an interesting mix of trade show, data science conference, and software…
Oct 21, 2014
Introducing Leitmotif
leitmotif, development
leitmotif is a very simple templating tool that generates directories from prototypes stored in git repositories. Its design prioritizes simplicity and a minimal set of…
Oct 8, 2014
Licensing morality
open source
licensing
I’ve written in the past about what a mistake it is to add behavioral incentives or morality clauses to the licenses for open-source projects. Briefly, these clauses are bad:
Sep 19, 2014
Replies via Twitter in Octopress
As of the early 2020s, this advice is only of historical interest: the versions of Octopress and Jekyll described have long since bit-rotted and many people — including this…
Sep 10, 2014
Improving Spark application performance
spark scala performance big-data sur-la-plaque bicycling
One of my side projects this year has been using Apache Spark to make sense of my bike power meter data. There are a few well-understood approaches to bike power data…
Sep 9, 2014
Name this concept
combinatorics
Consider the collection1 of all contiguous subsequences of a sequence. If we’re talking about a stream of n observations, this could be the multiset of windows containing…
Sep 5, 2014
Automating the sbt REPL
sbt scala automation
If you’re like me, you often find yourself pasting transcripts into sbt console sessions in order to interactively test out new app functionality. A lot of times, these…
Sep 3, 2014
Implementing type translation
type systems
sql
hive
type coercions
In earlier posts we introduced the concepts of type widening and type translation, discussed support for these in existing database systems, and presented a general approach…
Aug 26, 2014
Implementing type widening
type systems
sql
hive
scala
type coercions
In this installment of our series on type coercions, we’re going to introduce a way to support type widening in a language interpreter. We’ll present a general approach…
Aug 19, 2014
Implicit type coercion support in existing database systems
type systems
sql
hive
type coercions
In my last post, I introduced two kinds of implicit type coercions that can appear in database query languages: type widenings, in which values are converted to wider types…
Aug 19, 2014
Type coercions for untyped query languages
type systems
sql
hive
type coercions
In this post, we’re going to introduce two kinds of implicit type conversions that are common in database query languages:
Aug 18, 2014
Sharing volumes to Docker as the right user
docker
development
Yesterday’s post provided a couple of minimal Docker images for spinning up containers to do Scala and Java builds and tests. The second of the images had a pre-loaded Ivy…
Aug 6, 2014
Upstream-friendly JVM development and testing with Docker
docker
development
sbt
maven
fedora
centos
I recently experienced some bizarre failures running Akka actors and sbt tests in forked mode on my Fedora laptop. As far as I can tell, the root of both problems was that…
Aug 5, 2014
Video of bike data analysis talk
spark
bicycling
mllib
I gave a talk at Spark Summit earlier this month about my work using Apache Spark to analyze my bike power meter data, and the conference videos are now online. You can…
Jul 20, 2014
Spark bike data analytics video
spark
bicycling
mllib
demo
I’m looking forward to giving a talk at Spark Summit this week about some of my recent work using Apache Spark to make sense of my bike data (see also previous posts here and …
Jun 29, 2014
Finding top bicycling efforts with Spark
spark
bicycling
mllib
In an earlier post, I showed how I had used Apache Spark to cluster points from GPS traces of bike rides and plot the convex hulls of each cluster, coloring each hull based…
May 27, 2014
Fitness data visualization with Apache Spark
spark
bicycling
mllib
This post contains embedded maps; you’ll need to view it in a browser with JavaScript support in order to see them.
Apr 1, 2014
sbt is in Fedora 20
sbt
scala
fedora
Longtime Chapeau readers may recall last summer’s lament about the state of the Scala ecosystem in Fedora. We’ve taken a lot of steps since then. After a rough patch for the…
Feb 20, 2014
One weird trick to eviscerate open source licenses
open source
legal
licensing
free software
fedora
The GNU project’s Four Freedoms present the essential components of Free software: freedom to use the program for any purpose, freedom to study and change the program’s…
Feb 19, 2014
A simple machine learning app with Spark
spark
mllib
fedora
I’m currently on my way back from the first-ever Spark Summit, where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its…
Dec 4, 2013
Apache Thrift in Fedora
thrift
fedora
You probably already know that Apache Thrift is a framework for developing distributed services and clients to access these in multiple languages. You probably also knew…
Oct 16, 2013
Making Fedora a better place for Scala
sbt
scala
fedora
Scala combines a lot of excellent features (functional-style pattern matching, an expressive type system, closures, etc.) with JVM compatibility and a very interesting…
Aug 5, 2013
Installing Spark on Fedora 18
spark
fedora
The Spark project is an actively-developed open-source engine for data analytics on clusters using Scala, Python, or Java. It offers map, filter, and reduce operations over…
Apr 11, 2013
Best practices for Wallaby’s default group
htcondor
mrg
wallaby
Recall that Wallaby applies partial configurations to groups of nodes. Groups can be either explicit — that is, a named subset of nodes created by the user, or special groups…
Nov 1, 2012
Configuring high-availability Condor central managers with Wallaby
htcondor
mrg
wallaby
Rob Rati and I gave a tutorial on highly-available job queues at Condor Week this year. While it was not a Wallaby-specific tutorial, we did point out that configuring…
Oct 22, 2012
Authorization for Wallaby clients
htcondor
mrg
wallaby
Wallaby 0.16.0, which updates the Wallaby API version to 20101031.6, includes support for authorizing broker users with various roles that can interact with Wallaby in…
Sep 12, 2012
Highly-available configuration data with Wallaby
htcondor
mrg
wallaby
Many Condor users are interested in high-availability (HA) services: they don’t want their compute resources to become unavailable due to the failure of a single machine…
Aug 29, 2012
Using Wallaby’s skeleton group
htcondor
mrg
wallaby
Wallaby 0.15.0 includes a new feature called the skeleton group. (This feature was available in earlier versions of Wallaby, too, but it was experimental and had some rough…
Jun 15, 2012
Troubleshooting Condor with Wallaby
htcondor
mrg
wallaby
Often, if you’re trying to reproduce a problem someone else is having with Condor, you’ll need their configuration. Likewise, if you’re trying to help someone reproduce a…
Jun 1, 2012
Boldly going forward — to Ruby 1.9
ruby
From the very beginning of the project, we’ve developed Wallaby and its stack in Ruby 1.8 and not paid much attention to Ruby 1.9. We had done so for a few reasons, but…
May 2, 2012
Exporting versioned Wallaby configurations
htcondor
mrg
wallaby
Wallaby stores versioned configurations in a database. Wallaby API clients can access older versions of a node’s configuration by supplying the version option to the Node#ge…
Nov 2, 2011
Wallaby paper at SC11
wallaby
rht
supercomputing
I’m pleased to announce that our paper “Wallaby: A Scalable Semantic Configuration Service for Grids and Clouds” will be presented at SC11 in the “State of the Practice”…
Sep 8, 2011
Write-once, defaultable constants in Ruby
ruby
metaprogramming
quiescing constants
Ruby constants are a nice place to put application configuration information, but they can be inflexible if you want to defer initialization until later — for example, if…
Sep 2, 2011
Gliss 0.2.0 release and a gliss example
git
gliss
annotation
Earlier today, I released version 0.2.0 of Gliss, a lightweight tool for inspecting and processing tagged annotations in git repositories. Since I last wrote about gliss…
Jul 21, 2011
When software-dependency philosophies collide
fedora
devops
ruby
sqlite
Earlier today, I released version 0.4.0 of Rhubarb, a little object-graph persistence library for Ruby built on top of SQLite. Rhubarb 0.4.0 adds no new features, unless you…
Jul 6, 2011
Node tagging in a Wallaby client library
htcondor
mrg
wallaby
In an earlier post, I presented a technique for adding node tagging to Wallaby without adding explicit tagging support to the Wallaby API. Node tags are useful for a…
Jun 20, 2011
RESTful manipulation of versioned data
rest
functional trees
versioning
In this post, I’ll sketch a half-baked plan for making an idiomatic RESTful service that handles versioned data in a sensible way. I’m not claiming that the pattern I’m…
Jun 7, 2011
Using Wallaby groups to implement node tagging
htcondor
mrg
wallaby
One of the great things about Wallaby is that it’s a platform, not merely a tool. Put another way, if it doesn’t do exactly what you want, you can use its API to build…
May 25, 2011
Wallaby user tutorial and live VM
htcondor
mrg
wallaby
Rob Rati and I presented a Wallaby user tutorial at Condor Week yesterday. Today, we have a tutorial you can follow along with at home, including a link to an EC2 AMI that…

May 4, 2011
Grepping for git glosses with gliss
git
gliss
annotation
I’m pleased to announce the first release of gliss, a tool to make it easier to track lightweight inline annotations in your git repositories. gliss is available as source…
Jan 12, 2011
Extending wallaby with a python client library
htcondor
mrg
wallaby
python
In my previous post, we saw how to extend wallaby by writing Ruby classes that use a client library to extend the wallaby shell. If you’re comfortable with Ruby, this is a…
Dec 16, 2010
Extending the wallaby shell
htcondor
mrg
wallaby
The most recent few releases of the Wallaby configuration management service have included some great new features: wallaby console can now be used as an interpreter for sheb…
Oct 21, 2010
Wallaby node inventory with constraints
htcondor
mrg
wallaby
wallaby inventory is a useful command for quickly checking up on the health of your pool and answering certain kinds of questions: Which nodes have checked in recently?…
Oct 18, 2010
Retrieving Wallaby node configurations over HTTP
htcondor
mrg
wallaby
In some environments, users may wish to use Wallaby to serve configurations to nodes that can’t reach the Qpid broker that the Wallaby agent is running against. Some users…
Oct 15, 2010
Flexible interaction with the Wallaby console
htcondor
mrg
wallaby
One of the main benefits of using Wallaby for configuration is the remote-access API. Because the API is comprehensive and usable from any language with a QMF binding…
Sep 28, 2010
Migrating legacy Condor configurations to Wallaby
htcondor
mrg
wallaby
Wallaby provides a great way to manage Condor configurations, and if you’re just starting out with Condor, it’s easy to do things the Wallaby way from the start. However…
Sep 27, 2010
Mostly-transparent memoization in Ruby
ruby
memoization
functional programming
Here’s an easy technique for automatically memoizing the results of method calls in Ruby. Let’s say that we’re interested in looking up instances of the Employee class by…
Aug 26, 2010
Updates to the wallaby API
htcondor
mrg
wallaby
If you’ve built tools on the Wallaby API, you may be interested in some recent changes to the API; these are currently in source control and will appear in the upcoming…
Jun 8, 2010
Markdown documentation of QMF APIs
qpid
qmf
Here’s a cheap and cheerful little script I threw together to automatically generate Markdown-formatted documentation for my QMF methods. I used this to make the Wallaby…
May 13, 2010
SPQR 0.3.0, now with event support
spqr
qpid
qmf
ruby
I’m pleased to announce yesterday’s release of SPQR 0.3.0, which is available as source, on github, or as a RubyGem. (SPQR is a library to make it painless to publish Ruby…
May 11, 2010
Notes on configuration
htcondor
mrg
wallaby
As most readers of this site know, I’ve been busy lately working on the Wallaby configuration service, which aims to make it painless to manage configurations for entire…
May 3, 2010
At Condor Week
htcondor
mrg
Apr 15, 2010
Introducing capricious
ruby
capricious
prng
Last week I needed a good random number generator to make repeatable stress tests for a Ruby project. Ruby’s standard library includes a good random number generator (the Me…
Mar 17, 2010
SPQR update
spqr
qpid
qmf
ruby
I released version 0.1.2 of SPQR this morning; it is available from gemcutter (as an installable gem package) or from fedorahosted.org (as source). This version contains…
Dec 21, 2009
Two brief SPQR updates
spqr
qpid
qmf
ruby
Here are two quick notes (and a bonus meta-note) about the quickly-evolving SPQR project:
Nov 26, 2009
Automatically generating QMF agents with spqr-gen
spqr
qpid
qmf
ruby
In a previous post, I introduced SPQR and presented a couple of examples of how one could use SPQR to publish Ruby objects over QMF. Sometimes, though, you aren’t starting…
Nov 24, 2009
Introducing SPQR
spqr
qpid
qmf
ruby
SPQR is a framework to make it almost painless to create QMF agents in the Ruby language, and thus to write Ruby applications that can be managed remotely. I built it to…
Nov 21, 2009
Hiding template parameters with dynamic dispatch
type systems
c++
template metaprogramming
In my last post, I mentioned a problem (choosing between one of several templated classes based on information that won’t be available until runtime) and also mentioned a…
Jul 1, 2009
A problem of dependent types
type systems
c++
Here’s an interesting problem involving C++ templates. Say you have a class that is parameterized on a term — we’ll use std::bitset<N> as a running example — and you’d like…
Jun 24, 2009
Virtualizing a physical Linux machine
kvm
lvm
linux
image
Due to some hardware trouble with my main work machine, I’m presently working in a virtual machine on my personal computer. After a few dim trails, I found a pretty…
Mar 30, 2009
About the author
meta
I’m Will Benton and I’ll be posting some short technical articles related to my work – specifically, programming languages, broadly construed, and high-throughput…
Mar 27, 2009
No matching items
     
     

    Copyright (c) 2009–2023, William C. Benton