Chapeau

It’s possible to approximate the module dependencies of Python code with lightweight static analysis. These approximations aren’t perfect, but they are useful.

Dec 17, 2017

Red Hat talks at Spark Summit EU 2017

Dublin is a charming city and a burgeoning technology hub, but it also has special significance for anyone whose work involves making sense of data, since William Sealy…

Oct 31, 2017

Building machine learning algorithms on Apache Spark

spark

machine learning

pipelines

I’m giving a talk this afternoon at Spark Summit EU on extending Spark with new machine learning algorithms. Here are some additional resources and links:

Oct 25, 2017

Apache Spark on OpenShift

spark

openshift

big-data

I’m speaking this morning at the OpenShift Commons Gathering about my team’s experience running Apache Spark on Kubernetes and OpenShift. Here are some links to learn more:

Nov 7, 2016

Spark on Kubernetes at Spark Summit EU

spark

kubernetes

openshift

I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is…

Oct 24, 2016

Notes for my HTCondor Week talk

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that…

May 17, 2016

Silex 0.0.10

silex

spark

My team and I are pleased to announce the latest release of our Silex library, featuring cool new functionality from all of the core contributors. Silex is a library of…

May 7, 2016

Log analytics talk at Apache: Big Data

As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a…

May 7, 2016

Red Hat Data Science talks at Apache: Big Data 2016

talks

If you’ll be at Apache: Big Data next week, you should definitely check out some talks from my teammates in Red Hat’s Emerging Technology group and our colleague Suneel…

May 5, 2016

Self-organizing maps in Spark

spark

machine learning

soms

Self-organizing maps are a useful technique for identifying structure in high-dimensional data sets. The map itself is a low-dimensional arrangement of cells, where each…

May 1, 2016

Dimensionality reduction in Spark

Here’s a quick video I put together introducing infrastructure log processing in Spark. At the end, there are a couple of nice graphs contrasting PCA and t-SNE for embedding…

Feb 16, 2016

Using word2vec on logs

Lately, I’ve been experimenting with Spark’s implementation of word2vec. Since most of the natural-language data I have sitting around these days are service and system logs…

Dec 11, 2015

Concrete advice about abstracts

writing

speaking

talks

Consider the following hypothetical conference session abstract:

Nov 16, 2015

Pacing technical talks

speaking

If you resist the temptation to start too quickly, you can cover more ground.

Oct 21, 2015

Notes from Flink Forward

flink

I was in Berlin last week for Flink Forward, the inaugural Apache Flink conference. I’m still learning about Flink, and Flink Forward was a great place to learn more. In…

Oct 20, 2015

fedmsg talk at Spark Summit

fedora

spark

I’m speaking at Spark Summit today about using Spark to analyze operational data from the Fedora project. Here are some links to further resources related to my talk:

Jun 15, 2015

Using Spark ML Pipeline transformers

spark

sql

ml-pipelines

In this post, we’ll see how to make a simple transformer for Spark ML Pipelines. The transformer we’ll design will generate a sparse binary feature vector from an…

Jun 13, 2015

Bokeh plots from Spark

This post will show you an extremely simple way to make quick-and-dirty Bokeh plots from data you’ve generated in Spark, but the basic technique is generally applicable to…

May 21, 2015

Planning your career like a racing season

professional development

Most people set personal and professional goals. If you work in software, your near-term professional goals might sound like this:

May 14, 2015

Elasticsearch and Spark 1.3

Elasticsearch has offered Hadoop InputFormat and OutputFormat implementations for quite some time. These made it possible to process Elasticsearch indices with Spark just as…

Apr 30, 2015

Effective continuous integration for Spark projects

spark

spark sql

Silex is a small library of helper code intended to make it easier to build real-world Spark applications;¹ most of it is factored out from applications we’ve developed…

Apr 21, 2015

Natural join for data frames in Spark

spark

sql

data frames

Natural join is a useful special case of the relational join operation (and is extremely common when denormalizing data pulled in from a relational database). Spark’s…

Apr 8, 2015

Interactively using Spark SQL and DataFrames from sbt projects

One of the great things about Apache Spark is that you can experiment with new analyses interactively. In the past, I’ve used the sbt console to try out new data…

Apr 2, 2015

Your next favorite collaboration tool

collaboration

Last night I had a crazy realization: I could probably replace the majority of what my team hopes to accomplish with standup meetings, design documents, project management…

Feb 13, 2015

Caveat censor

writing

publishing

technical review

Over eight years ago, Richard WM Jones wrote a great but disheartening article about his experience serving as a technical reviewer for an infamous book about OCaml. The…

Dec 2, 2014

Spark performance talk at ApacheCon EU

spark

talks

apachecon

I’ll be speaking later this afternoon at ApacheCon EU. The title of my talk is “Iteratively Improving Spark Application Performance.” The great thing about Apache Spark is…

Nov 18, 2014

Algebraic types and schema inference

My last post covered some considerations for using Spark SQL on a real-world JSON dataset. In particular, schema inference can suffer when you’re ingesting a dataset of…

Nov 2, 2014

fedmsg data and Spark SQL

In this post, I’ll briefly introduce fedmsg, the federated message bus developed as part of the Fedora project’s infrastructure, and discuss how to ingest fedmsg data for…

Oct 31, 2014

Notes from Strata + Hadoop World 2014

big data

strata

I went to Strata + Hadoop World last week. This event targets a pretty broad audience and is an interesting mix of trade show, data science conference, and software…

Oct 21, 2014

Introducing Leitmotif

leitmotif, development

leitmotif is a very simple templating tool that generates directories from prototypes stored in git repositories. Its design prioritizes simplicity and a minimal set of…

Oct 8, 2014

Licensing morality

open source

licensing

I’ve written in the past about what a mistake it is to add behavioral incentives or morality clauses to the licenses for open-source projects. Briefly, these clauses are bad:

Sep 19, 2014

Replies via Twitter in Octopress

As of the early 2020s, this advice is only of historical interest: the versions of Octopress and Jekyll described have long since bit-rotted and many people — including this…

Sep 10, 2014

Improving Spark application performance

spark scala performance big-data sur-la-plaque bicycling

One of my side projects this year has been using Apache Spark to make sense of my bike power meter data. There are a few well-understood approaches to bike power data…

Sep 9, 2014

Name this concept

combinatorics

Consider the collection¹ of all contiguous subsequences of a sequence. If we’re talking about a stream of n observations, this could be the multiset of windows containing…

Sep 5, 2014

Automating the sbt REPL

sbt scala automation

If you’re like me, you often find yourself pasting transcripts into sbt console sessions in order to interactively test out new app functionality. A lot of times, these…

Sep 3, 2014

Implementing type translation

In earlier posts we introduced the concepts of type widening and type translation, discussed support for these in existing database systems, and presented a general approach…

Aug 26, 2014

Implementing type widening

In this installment of our series on type coercions, we’re going to introduce a way to support type widening in a language interpreter. We’ll present a general approach…

Aug 19, 2014

Implicit type coercion support in existing database systems

In my last post, I introduced two kinds of implicit type coercions that can appear in database query languages: type widenings, in which values are converted to wider types…

Aug 19, 2014

Type coercions for untyped query languages

In this post, we’re going to introduce two kinds of implicit type conversions that are common in database query languages:

Aug 18, 2014

Sharing volumes to Docker as the right user

docker

development

Yesterday’s post provided a couple of minimal Docker images for spinning up containers to do Scala and Java builds and tests. The second of the images had a pre-loaded Ivy…

Aug 6, 2014

Upstream-friendly JVM development and testing with Docker

I recently experienced some bizarre failures running Akka actors and sbt tests in forked mode on my Fedora laptop. As far as I can tell, the root of both problems was that…

Aug 5, 2014

Video of bike data analysis talk

spark

bicycling

mllib

I gave a talk at Spark Summit earlier this month about my work using Apache Spark to analyze my bike power meter data, and the conference videos are now online. You can…

Jul 20, 2014

Spark bike data analytics video

I’m looking forward to giving a talk at Spark Summit this week about some of my recent work using Apache Spark to make sense of my bike data (see also previous posts here and …

Jun 29, 2014

Finding top bicycling efforts with Spark

spark

bicycling

mllib

In an earlier post, I showed how I had used Apache Spark to cluster points from GPS traces of bike rides and plot the convex hulls of each cluster, coloring each hull based…

May 27, 2014

Fitness data visualization with Apache Spark

spark

bicycling

mllib

This post contains embedded maps; you’ll need to view it in a browser with JavaScript support in order to see them.

Apr 1, 2014

sbt is in Fedora 20

sbt

scala

fedora

Longtime Chapeau readers may recall last summer’s lament about the state of the Scala ecosystem in Fedora. We’ve taken a lot of steps since then. After a rough patch for the…

Feb 20, 2014

One weird trick to eviscerate open source licenses

The GNU project’s Four Freedoms present the essential components of Free software: freedom to use the program for any purpose, freedom to study and change the program’s…

Feb 19, 2014

A simple machine learning app with Spark

spark

mllib

fedora

I’m currently on my way back from the first-ever Spark Summit, where I presented a talk on some of my work with the Fedora Big Data SIG to package Apache Spark and its…

Dec 4, 2013

Apache Thrift in Fedora

thrift

fedora

You probably already know that Apache Thrift is a framework for developing distributed services and clients to access these in multiple languages. You probably also knew…

Oct 16, 2013

Making Fedora a better place for Scala

sbt

scala

fedora

Scala combines a lot of excellent features (functional-style pattern matching, an expressive type system, closures, etc.) with JVM compatibility and a very interesting…

Aug 5, 2013

Installing Spark on Fedora 18

spark

fedora

The Spark project is an actively-developed open-source engine for data analytics on clusters using Scala, Python, or Java. It offers map, filter, and reduce operations over…

Apr 11, 2013

Best practices for Wallaby’s default group

htcondor

mrg

wallaby

Recall that Wallaby applies partial configurations to groups of nodes. Groups can be either explicit — that is, a named subset of nodes created by the user, or special groups…

Nov 1, 2012

Configuring high-availability Condor central managers with Wallaby

htcondor

mrg

wallaby

Rob Rati and I gave a tutorial on highly-available job queues at Condor Week this year. While it was not a Wallaby-specific tutorial, we did point out that configuring…

Oct 22, 2012

Authorization for Wallaby clients

htcondor

mrg

wallaby

Wallaby 0.16.0, which updates the Wallaby API version to 20101031.6, includes support for authorizing broker users with various roles that can interact with Wallaby in…

Sep 12, 2012

Highly-available configuration data with Wallaby

htcondor

mrg

wallaby

Many Condor users are interested in high-availability (HA) services: they don’t want their compute resources to become unavailable due to the failure of a single machine…

Aug 29, 2012

Using Wallaby’s skeleton group

htcondor

mrg

wallaby

Wallaby 0.15.0 includes a new feature called the skeleton group. (This feature was available in earlier versions of Wallaby, too, but it was experimental and had some rough…

Jun 15, 2012

Troubleshooting Condor with Wallaby

htcondor

mrg

wallaby

Often, if you’re trying to reproduce a problem someone else is having with Condor, you’ll need their configuration. Likewise, if you’re trying to help someone reproduce a…

Jun 1, 2012

Boldly going forward — to Ruby 1.9

ruby

From the very beginning of the project, we’ve developed Wallaby and its stack in Ruby 1.8 and not paid much attention to Ruby 1.9. We had done so for a few reasons, but…

May 2, 2012

Exporting versioned Wallaby configurations

htcondor

mrg

wallaby

Wallaby stores versioned configurations in a database. Wallaby API clients can access older versions of a node’s configuration by supplying the version option to the Node#ge…

Nov 2, 2011

Wallaby paper at SC11

wallaby

rht

supercomputing

I’m pleased to announce that our paper “Wallaby: A Scalable Semantic Configuration Service for Grids and Clouds” will be presented at SC11 in the “State of the Practice”…

Sep 8, 2011

Write-once, defaultable constants in Ruby

ruby

metaprogramming

quiescing constants

Ruby constants are a nice place to put application configuration information, but they can be inflexible if you want to defer initialization until later — for example, if…

Sep 2, 2011

Gliss 0.2.0 release and a gliss example

git

gliss

annotation

Earlier today, I released version 0.2.0 of Gliss, a lightweight tool for inspecting and processing tagged annotations in git repositories. Since I last wrote about gliss…

Jul 21, 2011

When software-dependency philosophies collide

Earlier today, I released version 0.4.0 of Rhubarb, a little object-graph persistence library for Ruby built on top of SQLite. Rhubarb 0.4.0 adds no new features, unless you…

Jul 6, 2011

Node tagging in a Wallaby client library

htcondor

mrg

wallaby

In an earlier post, I presented a technique for adding node tagging to Wallaby without adding explicit tagging support to the Wallaby API. Node tags are useful for a…

Jun 20, 2011

RESTful manipulation of versioned data

rest

functional trees

versioning

In this post, I’ll sketch a half-baked plan for making an idiomatic RESTful service that handles versioned data in a sensible way. I’m not claiming that the pattern I’m…

Jun 7, 2011

Using Wallaby groups to implement node tagging

htcondor

mrg

wallaby

One of the great things about Wallaby is that it’s a platform, not merely a tool. Put another way, if it doesn’t do exactly what you want, you can use its API to build…

May 25, 2011

Wallaby user tutorial and live VM

htcondor

mrg

wallaby

Rob Rati and I presented a Wallaby user tutorial at Condor Week yesterday. Today, we have a tutorial you can follow along with at home, including a link to an EC2 AMI that…

May 4, 2011

Grepping for git glosses with gliss

git

gliss

annotation

I’m pleased to announce the first release of gliss, a tool to make it easier to track lightweight inline annotations in your git repositories. gliss is available as source…

Jan 12, 2011

Extending wallaby with a python client library

In my previous post, we saw how to extend wallaby by writing Ruby classes that use a client library to extend the wallaby shell. If you’re comfortable with Ruby, this is a…

Dec 16, 2010

Extending the wallaby shell

htcondor

mrg

wallaby

The most recent few releases of the Wallaby configuration management service have included some great new features: wallaby console can now be used as an interpreter for sheb…

Oct 21, 2010

Wallaby node inventory with constraints

htcondor

mrg

wallaby

wallaby inventory is a useful command for quickly checking up on the health of your pool and answering certain kinds of questions: Which nodes have checked in recently?…

Oct 18, 2010

Retrieving Wallaby node configurations over HTTP

htcondor

mrg

wallaby

In some environments, users may wish to use Wallaby to serve configurations to nodes that can’t reach the Qpid broker that the Wallaby agent is running against. Some users…

Oct 15, 2010

Flexible interaction with the Wallaby console

htcondor

mrg

wallaby

One of the main benefits of using Wallaby for configuration is the remote-access API. Because the API is comprehensive and usable from any language with a QMF binding…

Sep 28, 2010

Migrating legacy Condor configurations to Wallaby

htcondor

mrg

wallaby

Wallaby provides a great way to manage Condor configurations, and if you’re just starting out with Condor, it’s easy to do things the Wallaby way from the start. However…

Sep 27, 2010

Mostly-transparent memoization in Ruby

ruby

memoization

functional programming

Here’s an easy technique for automatically memoizing the results of method calls in Ruby. Let’s say that we’re interested in looking up instances of the Employee class by…

Aug 26, 2010

Updates to the wallaby API

htcondor

mrg

wallaby

If you’ve built tools on the Wallaby API, you may be interested in some recent changes to the API; these are currently in source control and will appear in the upcoming…

Jun 8, 2010

Markdown documentation of QMF APIs

qpid

qmf

Here’s a cheap and cheerful little script I threw together to automatically generate Markdown-formatted documentation for my QMF methods. I used this to make the Wallaby…

May 13, 2010

SPQR 0.3.0, now with event support

I’m pleased to announce yesterday’s release of SPQR 0.3.0, which is available as source, on github, or as a RubyGem. (SPQR is a library to make it painless to publish Ruby…

May 11, 2010

Notes on configuration

htcondor

mrg

wallaby

As most readers of this site know, I’ve been busy lately working on the Wallaby configuration service, which aims to make it painless to manage configurations for entire…

May 3, 2010

At Condor Week

htcondor

mrg

Apr 15, 2010

Introducing capricious

ruby

capricious

prng

Last week I needed a good random number generator to make repeatable stress tests for a Ruby project. Ruby’s standard library includes a good random number generator (the Me…

Mar 17, 2010

SPQR update

I released version 0.1.2 of SPQR this morning; it is available from gemcutter (as an installable gem package) or from fedorahosted.org (as source). This version contains…

Dec 21, 2009

Two brief SPQR updates

Here are two quick notes (and a bonus meta-note) about the quickly-evolving SPQR project:

Nov 26, 2009

Automatically generating QMF agents with spqr-gen

In a previous post, I introduced SPQR and presented a couple of examples of how one could use SPQR to publish Ruby objects over QMF. Sometimes, though, you aren’t starting…

Nov 24, 2009

Introducing SPQR

SPQR is a framework to make it almost painless to create QMF agents in the Ruby language, and thus to write Ruby applications that can be managed remotely. I built it to…

Nov 21, 2009

Hiding template parameters with dynamic dispatch

type systems

c++

template metaprogramming

In my last post, I mentioned a problem (choosing between one of several templated classes based on information that won’t be available until runtime) and also mentioned a…

Jul 1, 2009

A problem of dependent types

type systems

c++

Here’s an interesting problem involving C++ templates. Say you have a class that is parameterized on a term — we’ll use std::bitset<N> as a running example — and you’d like…

Jun 24, 2009

Virtualizing a physical Linux machine

Due to some hardware trouble with my main work machine, I’m presently working in a virtual machine on my personal computer. After a few dim trails, I found a pretty…

Mar 30, 2009

Code on slides

Wrapping up

Accelerated single-node Kubernetes with microk8s, Kubeflow, and RAPIDS.ai

Parquet interoperability across language ecosystems

Cloud-native machine learning systems at day two and beyond

Mapping the territory for MLOps

Machine learning systems and intelligent applications

Going beyond the basics with Altair

Efficiently sampling from many probability distributions in SciPy

Repeatable simulations without repeated boilerplate

Sketching data and other magic tricks

Automatically generating model services from Jupyter notebooks

Materials for my Strata Data talk

A parable about causality and abduction

Ad-hoc, informally-specified, bug-ridden operating system distributions

Spark’s RDD API, variance, and typeclasses

Virtual handout for my Red Hat Summit talk

Virtual handout for my DevConf talk

Misfeature recovery and APIs

Identifying Python module dependencies

Red Hat talks at Spark Summit EU 2017

Building machine learning algorithms on Apache Spark

Apache Spark on OpenShift

Spark on Kubernetes at Spark Summit EU

Notes for my HTCondor Week talk

Silex 0.0.10

Log analytics talk at Apache: Big Data

Red Hat Data Science talks at Apache: Big Data 2016

Self-organizing maps in Spark

Dimensionality reduction in Spark

Using word2vec on logs

Concrete advice about abstracts

Pacing technical talks

Notes from Flink Forward

fedmsg talk at Spark Summit

Using Spark ML Pipeline transformers

Bokeh plots from Spark

Planning your career like a racing season

Elasticsearch and Spark 1.3

Effective continuous integration for Spark projects

Natural join for data frames in Spark

Interactively using Spark SQL and DataFrames from sbt projects

Your next favorite collaboration tool

Caveat censor

Spark performance talk at ApacheCon EU

Algebraic types and schema inference

fedmsg data and Spark SQL

Notes from Strata + Hadoop World 2014

Introducing Leitmotif

Licensing morality

Replies via Twitter in Octopress

Improving Spark application performance

Name this concept

Automating the sbt REPL

Implementing type translation

Implementing type widening

Implicit type coercion support in existing database systems

Type coercions for untyped query languages

Sharing volumes to Docker as the right user

Upstream-friendly JVM development and testing with Docker

Video of bike data analysis talk

Spark bike data analytics video

Finding top bicycling efforts with Spark

Fitness data visualization with Apache Spark

sbt is in Fedora 20

One weird trick to eviscerate open source licenses

A simple machine learning app with Spark

Apache Thrift in Fedora

Making Fedora a better place for Scala

Installing Spark on Fedora 18

Best practices for Wallaby’s default group

Configuring high-availability Condor central managers with Wallaby

Authorization for Wallaby clients

Highly-available configuration data with Wallaby

Using Wallaby’s skeleton group

Troubleshooting Condor with Wallaby

Boldly going forward — to Ruby 1.9

Exporting versioned Wallaby configurations

Wallaby paper at SC11

Write-once, defaultable constants in Ruby