The fundamental value proposition of Kubernetes is that it can provide an abstraction layer for distributed applications on any infrastructure. This depends on the intersection of two partial myths:

  1. Kubernetes consistently provides every essential service that distributed applications need, and
  2. Kubernetes can run equally well on different cloud footprints, in a datacenter, or on a single node.

There’s some reflection of the truth in these myths. Vanilla upstream Kubernetes provides many important primitives for distributed applications — but not everything — and individual Kubernetes distributions typically bundle services to address the gaps. While it is technically possible to run Kubernetes on a single node by installing a specialized distribution, most of the solutions are rough around the edges, and if you want true portability, you’ll need to run the same Kubernetes distribution on your workstation (or laptop) and in your datacenter.

Since I’m more interested in developing tools that could be ported to a variety of Kubernetes distributions than I am with developing an application that is absolutely reproducible across multiple footprints in a single organization without additional effort, I have the flexibility to choose any single-node distribution of Kubernetes for local use.

I’ve been impressed with the setup and user experience of microk8s for a long time and used to run it (in a VM) on my old MacBook. In this post, I’ll explain how I used microk8s to set up a data science development environment on my workstation, complete with GPU acceleration, Kubeflow, and a notebook image with RAPIDS.ai preloaded.

System setup

I started with a relatively fresh installation of Ubuntu 20.04,1 and installed CUDA 11.0 from the NVIDIA repository, following these instructions:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda-11-0

microk8s

I then installed microk8s with snap. There are a few options for versions and channels, but as of late March 2021 the most stable for me was Kubernetes 1.20 (more on this in a bit).

sudo snap install microk8s --classic --channel=1.20/stable
sudo microk8s status --wait-ready

microk8s ships with Calico enabled, and there is a longstanding bug in Calico that prevents it from finding network interfaces that contain lo anywhere in their names. Since the wireless interface in my workstation is called wlo2, I needed to change Calico’s environment to get it to work:

microk8s kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=wlo.\*

GPU support

With CUDA 11.0 installed, I was able to enable GPU support in microk8s:

microk8s enable dns
microk8s enable gpu

I could then verify that the pods started successfully:

kubectl describe pods -l name=nvidia-device-plugin-ds -n kube-system

and, once they had, that my node had been labeled properly:

microk8s kubectl get node -o jsonpath="{range .items[*]}{..allocatable}{'\n'}{end}"

You’ll want to see an allocatable resource of type nvidia.com/gpu in that output, like this (replace “2” with the number of GPUs your workstation has installed):

{
  "cpu":"...",
  "ephemeral-storage":"...",
  "hugepages-1Gi":"...",
  "hugepages-2Mi":"...",
  "memory":"...",
  "nvidia.com/gpu":"2",
  "pods":"..."
}

I could then launch a simple job to verify that I was able to schedule pods to run on the GPU:

cat << EOF | microk8s kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

Kubeflow and notebooks

Now I was able to install Kubeflow itself (feel free to specify your favorite password in these instructions):

microk8s enable ingress istio
microk8s enable kubeflow -- --password my-ultra-secure-password --bundle lite

Once Kubeflow was up, I created a persistent volume to enable shared storage between my notebook servers and the host system:

mkdir $HOME/k8s-share
cat << EOF | microk8s kubectl create -f -
apiVersion: v1
kind: PersistentVolume
metadata:
  name: $USER-share
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "$HOME/k8s-share"
EOF

In my case, this created a persistent volume called willb-share so that I could mount k8s-share from my home directory as a data volume on a Kubeflow notebook server.

The next step was to get RAPIDS.ai set up in a Kubeflow notebook image, since Kubeflow no longer ships a RAPIDS image. I could have installed individual libraries in a notebook container but there was an easier option. Since the RAPIDS project publishes a variety of Docker images, I could pick one of those as a starting point and ensure that the resulting image would be usable by Kubeflow, like in this Dockerfile:

FROM rapidsai/rapidsai-core:0.18-cuda11.0-runtime-centos7-py3.7

ENV NB_PREFIX /

RUN ldconfig

CMD ["sh","-c", "jupyter notebook --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]

The only interesting part of that recipe is RUN ldconfig, which I found necessary so that the Python notebook kernels could find CUDA. I built that image locally and pushed it to an accessible repository so that I could use it while launching a new notebook server from the Kubeflow dashboard.

Remote access

I often prefer to access my workstation remotely even if I’m at my desk. For running a regular Jupyter notebook server from the command line, this is just a matter of binding to 0.0.0.0 or setting up an SSH tunnel.2 Accessing services running on a single-node Kubernetes deployment is slightly more complicated, though.

First up, all requests will need to go through a load balancer (in this case, Istio). We can access Istio through an external IP, and we can find out which IP that is by inspecting the ingress gateway service:

kubectl get svc istio-ingressgateway -n kubeflow -o jsonpath="{..loadBalancer..ip}{'\n'}"

However, simply connecting to my workstation and forwarding traffic to port 80 of Istio’s IP didn’t do me a lot of good; I found that I also needed to be able to connect to other cluster IPs in order to access the services Istio was exposing.3 To access all of these IPs remotely, we have a couple of options:

  1. Set up a dynamic proxy over SSH. By connecting with ssh -D 9999 workstation and then configuring a local proxy to point to localhost on port 9999, we can access anything that’s accessible from workstation. This is an easy way to smoke-test a deployment but it isn’t an ideal long-term solution because it requires you to maintain a ssh connection to your workstation and it proxies everything though the workstation unless you explicitly configure the proxy.4 Relying on a dynamic proxy like this can also lead to confusing errors when the SSH connection is down and may be difficult or impossible to configure when using cellular data on a mobile device.
  2. Use a VPN-like service to relay traffic to given subnets. I use Tailscale for a personal VPN and configured my workstation as a relay node for traffic to cluster IP addresses. This means that if I can access cluster IPs from my workstation, I can also access them from any computer connected to my Tailscale account (whether or not I’ve connected over SSH first). This was very easy and it’s also possible to do with upstream WireGuard.

Once we’re correctly forwarding traffic, connecting to the Istio load balancer will show us the Kubeflow dashboard; clicking the different links on that page (e.g., to create a new notebook server) will send requests to the appropriate internal services.

Challenges and false starts

Sometimes knowing what didn’t work is more useful than knowing what did. In this section, I’ll briefly cover some problems I encountered along the way so you’ll know what to look out for.

Device plugin errors

With microk8s 1.19, I was unable to get the device plugin pod to run successfully and always got some variant of this error in my logs:

Loading NVML
Failed to initialize NVML: could not load NVML library.
If this is a GPU node, did you set the docker default runtime to nvidia?

This is a confusing error because microk8s uses containerd and not Docker. While many people seem to have run in to this error online, none of the recommended solutions worked for me. (I also tried specifying a newer version of the device plugin container image, which was also not successful.)

GPU operator errors

The beta release of microk8s 1.21 uses the NVIDIA GPU operator to manage GPU drivers. As of mid-March 2021, the GPU operator is not intended to work on nodes that already have GPU drivers installed, which makes it more suitable for provisioning new nodes or VMs and adding them to a cluster (its intended use case, to be fair) than for enabling GPU support for a single-node Kubernetes on a workstation.5

I was able to enable GPU support in microk8s 1.21 by first removing CUDA and GPU drivers from my system, but this was an unpalatable hack since I’d prefer to be able to manage system dependencies with a native package manager (and also to use the GPU and CUDA outside of Kubernetes). I also noticed that the GPU operator failed to start after I had rebooted my system, presumably because it had installed the drivers before and they loaded on boot.

Image pull failures

After installing microk8s 1.20, the Calico pod failed due to an image pull timeout. I was able to explicitly pull it using the bundled ctr tool before restarting the pods:

microk8s ctr images pull docker.io/calico/cni:v3.13.2

Conclusions

While I wouldn’t recommend single-node Kubernetes to most machine learning practitioners (it still requires a lot of interaction with Kubernetes proper to get to a productive state or troubleshoot problems), Kubernetes provides some useful primitives for managing resources, isolating jobs, and making work reproducible. Furthermore, developing ML tools on Kubernetes ensures that they’ll be consumable in multiple contexts. The combination of microk8s and Kubeflow provides a relatively-painless way to get to a productive discovery environment with RAPIDS and GPUs. In future posts, I’d like to look at using my single-node Kubernetes deployment to orchestrate other machine-learning and data processing workloads.


  1. Be warned that I’m almost certainly doing some basic administration tasks suboptimally – while I used Debian at a consulting gig in the late 1990s and briefly used Ubuntu in the public cloud in graduate school, my main Linux distributions have been RPM-based for over 25 years. I chose Ubuntu for this application because it offered frictionless installation of GPU drivers – but the long support cycle vis-à-vis other community Linux distributions is also a plus. 

  2. I’m often connecting remotely from a tablet, and tunneling in is especially convenient from my favorite iOS Jupyter client

  3. This didn’t make a lot of sense to me, but individual Kubeflow dashboard components were exposed with wildcard DNS hostnames pointing to cluster IPs – and, if I couldn’t connect to the cluster IPs, it manifested as unusual “Page not found” errors from the Kubeflow dashboard. 

  4. If you wanted to use this solution longer-term, it’d make sense to define a Proxy Auto-Configuration File that deferred to the dynamic proxy only for wildcard DNS hostnames like xip.io. 

  5. This is a totally sensible design decision since a single-node Kubernetes deployment is not anywhere near the primary audience for a tool like the GPU operator. However, the upcoming release of the GPU operator will support this workstation use case by allowing users to skip driver installation; microk8s will incorporate this fix as well

(This post is also available as an interactive notebook.)

Apache Parquet is a great default choice for a data serialization format in data processing and machine learning pipelines, but just because it’s available in many environments doesn’t mean it has the same behavior everywhere. In the remaining discussion, we’ll look at how to work around some potential interoperability headaches when using Parquet to transfer data from a data engineering pipeline running in the JVM ecosystem to a machine learning pipeline running in the Python data ecosystem.1

We’ll start by looking at a Parquet file generated by Apache Spark with the output of an ETL job.

from pyspark.sql import SparkSession

session = SparkSession.builder.getOrCreate()

We can look at the schema for this file and inspect a few rows:

spark_df = session.read.parquet("colors.parquet")
spark_df.printSchema()
root
 |-- rowID: string (nullable = true)
 |-- YesNo: string (nullable = true)
 |-- Color: string (nullable = true)
 |-- Categorical: string (nullable = true)
spark_df.limit(10).toPandas()
rowID YesNo Color Categorical
0 00000267 No red 62
1 000004c2 No red ba
2 00002dcf No blue 75
3 000035be No green 2f
4 00005f19 No green 0a
5 00007c1e No blue 79
6 0000be2c No green 38
7 0000d29d No green 60
8 0000d313 Yes blue f7
9 0000d66c No blue 94

The “file” we’re reading from (colors.parquet) is a partitioned Parquet file, so it’s really a directory. We can inspect the Parquet metadata for each column using the parquet-tools utility from our shell:

parquet-tools meta colors.parquet 2>& 1 | head -70 | grep SNAPPY
rowID:        BINARY SNAPPY DO:0 FPO:4 SZ:4931389/8438901/1.71 VC:703200 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 00000267, max: ffffc225, num_nulls: 0]
YesNo:        BINARY SNAPPY DO:0 FPO:4931393 SZ:105082/108599/1.03 VC:703200 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: No, max: Yes, num_nulls: 0]
Color:        BINARY SNAPPY DO:0 FPO:5036475 SZ:177524/177487/1.00 VC:703200 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: blue, max: red, num_nulls: 0]
Categorical:  BINARY SNAPPY DO:0 FPO:5213999 SZ:705931/706389/1.00 VC:703200 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 00, max: ff, num_nulls: 0]

This output shows that many of our columns are compressed (SNAPPY) Unicode strings (BINARY) and that many of these columns are dictionary-encoded (ENC:...,PLAIN_DICTIONARY), which means that each distinct string is stored as an index into a dictionary rather than as a literal value. By storing values that may be repeated many times in this way, we save space and compute time.2

So far, so good! But what happens when we read these data into pandas? We can load Parquet files into pandas if we have PyArrow installed; let’s try it out.

import pandas as pd
pandas_df = pd.read_parquet("colors.parquet/")
pandas_df
rowID YesNo Color Categorical
0 00000267 No red 62
1 000004c2 No red ba
2 00002dcf No blue 75
3 000035be No green 2f
4 00005f19 No green 0a
... ... ... ... ...
703195 ffff69a9 No green 25
703196 ffff8037 No green 34
703197 ffffa49f No red 3a
703198 ffffa6ae No green 89
703199 ffffc225 Yes blue 40

The data look about like we’d expect them to. However, when we look at how pandas is representing our data, we’re in for a surprise: pandas has taken our efficiently dictionary-encoded strings and represented them with arbitrary Python objects!

pandas_df.dtypes
rowID          object
YesNo          object
Color          object
Categorical    object
dtype: object

We could convert each column to strings and then to categoricals, but this would be tedious and inefficient. (Note that if we’d created a pandas data frame with string- or category-typed columns and saved that to Parquet, the types would survive a round-trip to disk because they’d be stored in pandas-specific Parquet metadata.)

In this case, pandas is using the PyArrow Parquet backend; interestingly, if we use PyArrow directly to read into a pyarrow.Table, the string types are preserved:

import pyarrow.parquet as pq
arrow_table = pq.read_table("colors.parquet/")

…but once we convert that table to pandas, we’ve lost the type information.

arrow_table.to_pandas().dtypes
rowID          object
YesNo          object
Color          object
Categorical    object
dtype: object

However, we can force PyArrow to preserve the dictionary encoding even through the pandas conversion if we specify the read_dictionary option with a list of appropriate columns:

dict_arrow_table = \
    pq.read_table("colors.parquet/", read_dictionary=['YesNo', 'Color', 'Categorical'])

dict_arrow_table
pyarrow.Table
rowID: string
YesNo: dictionary<values=string, indices=int32, ordered=0>
Color: dictionary<values=string, indices=int32, ordered=0> not null
Categorical: dictionary<values=string, indices=int32, ordered=0>
dict_arrow_table.to_pandas().dtypes
rowID            object
YesNo          category
Color          category
Categorical    category
dtype: object

If we don’t know a priori what columns are dictionary-encoded (and thus might hold categoricals), we can find out by programmatically inspecting the Parquet metadata:

dictionary_cols = set([])

# get metadata for each partition
for piece in pq.ParquetDataset("colors.parquet", use_legacy_dataset=False).pieces:
    meta = piece.metadata

    # get column names
    cols = enumerate(meta.schema.names)

    # get column metadata for each row group
    for i in range(meta.num_row_groups):
        rg = meta.row_group(i)
        for col, colname in cols:
            if "PLAIN_DICTIONARY" in rg.column(col).encodings:
                dictionary_cols.add(colname)

dictionary_cols
{'Categorical', 'Color', 'YesNo'}

Preserving column types when transferring data from a JVM-based ETL pipeline to a Python-based machine learning pipeline can save a lot of human effort and compute time – and eliminate an entire class of performance regressions and bugs as well. Fortunately, it just takes a little bit of care to ensure that our entire pipeline preserves the efficiency advantages of Parquet.


  1. There are certainly potential headaches going in the other direction as well (e.g., this and this, but it’s a less-common workflow to generate data in Python for further processing in Spark. 

  2. Parquet defaults to dictionary-encoding small-cardinality string columns, and we can assume that many of these will be treated as categoricals later in a data pipeline. 

It was a great honor to co-present at KubeCon with Sophie Watson on machine learning systems and MLOps today. Kubernetes is an obvious choice for building machine learning systems in 2020, but as you build these systems, you will be faced with several non-obvious choices. In this talk, we sought to distill many of the things we’ve learned while supporting machine learning systems and workflows on Kubernetes over the years and help to make the road ahead straighter and smoother for practitioners and operators who are just getting started.

We had a wonderfully engaged audience, and Q&A was a lot of fun, both during the talk and on Slack afterwards. Several attendees were interested in our slides, which are available here, and in the MLOps “tube map,” which is available here, along with a number of links to other useful resources.

Sophie and I have been collaborating in this space for quite a while and we’ve produced some really cool work. Here are links to some other materials of interest:

  • Our OSCON 2019 talk “Kubernetes for Machine Learning: Productivity over Primitives” shows how Kubernetes provides the basis for machine learning system solutions — and how to build solutions that ML practitioners will actually want to use,
  • our nachlass framework demonstrates how to publish pipeline services directly from unmodified Jupyter notebooks and CI/CD pipelines in Kubernetes using source-to-image builders, and
  • our two interactive workshops (here and here) show how to do entire end-to-end ML lifecycles – discovery, training, CI/CD, inference, and monitoring – all on Kubernetes.

The complexity of machine learning systems doesn’t subsist in the complexity of individual components, but rather in their orchestration, connections, and interactions. We can thus think of machine learning systems as special cases of general distributed systems. Furthermore, it’s relatively easy to argue that machine learning systems can benefit from general-purpose distributed systems infrastructure, tools, and frameworks to the extent that these make it easier to understand, develop, and maintain distributed systems. As evidence that this argument is noncontroversial, consider that Kubernetes is today’s most popular framework for managing distributed applications and is increasingly seen as a sensible default choice for machine learning systems.1

Once we’ve accepted that most of the complexity of machine learning systems isn’t specific to machine learning, the really interesting remaining questions are about how machine learning systems can benefit from existing infrastructure, tools, and techniques – and about how to address the challenges that are unique to machine learning. For example:

  • what additional complexity comes from failure modes that are characterized more by the divergence of probability distributions rather than by failed assertions or crossing clear performance thresholds?
  • to what extent are traditional devops workflows and idioms appropriate for building and maintaining machine learning systems, and where do they break down?
  • how can we address managing the complexity of entire machine learning systems rather than individual components?
  • how can we make contemporary infrastructure more accessible to machine learning practitioners?2

In order to evaluate how proposed solutions actually address these questions, it can be valuable to map out several aspects of machine learning systems:

The fact that these maps apparently overlap to some extent can be a source of confusion. A team of data scientists may have a feature engineering specialist and a modeling specialist. A production training pipeline may have feature extraction and model training components. While these different parts may connect together in analogous ways, they are not the same; our maps of systems, human processes, and organizations should each reveal different details of how we should support machine learning systems and the humans who build and maintain them.

The value of maps is as much in what they omit as it is in what they include: a good map will show the important details to navigate a given situation without including irrelevant details that obscure the presentation. By looking at these maps and identifying what areas of each are addressed by given solutions, it becomes easier to understand the strengths and shortcomings of various approaches. Ideally, a solution should address both complete workflows (human processes and interactions) and complete systems (software components and interactions).

Some solutions only support particular workloads, like particular training or inference frameworks, but not entire systems. Perhaps an “end-to-end” framework only addresses part of the problem, like model operationalization or data versioning – this will be obvious if we ascribe aspects of the solution to features of our map. Some solutions offer impressive demos but don’t address the problems our organization actually faces3 – again, this will be obvious by placing the solutions on our maps. Some tools that are ostensibly targeted for one audience have user experience assumptions that strongly imply that the developer had a different audience in mind, like “data science” tools that expect near-prurient interest in the accidental details of infrastructure4 – this will be obvious if we consider the interfaces of the tools corresponding to different map features in light of the humans responsible for these parts of our map. Perhaps a particular comprehensive solution only makes sense if organizations adopt an idiosyncratic workflow – this will be obvious because the solution will include some features that our map doesn’t and omit some features that our map includes.

Transit maps, which typically show the connections between lines and stations as a stylized graph, rather than aiming for geographical accuracy, present a particularly useful framework for understanding machine learning systems. In addition to capturing the components (stations or stops) and kinds of interactions (lines), other details like the existence of transfer stations and fare zones can expose other interesting aspects of the problem space. Here’s such a map that I designed to capture typical machine learning systems:

A map of a typical machine learning system in the style of a transit map

This map supports the story that Sophie Watson and I will be telling in our session at KubeCon North America next month – we’ll begin with the premise that Kubernetes is the right place to start for managing machine learning systems and then talk about some of the challenges unique to machine learning workflows and systems that Kubernetes and popular machine learning frameworks targeting Kubernetes don’t address. I hope you’ll be able to (virtually) join us!


  1. You can see my version of the argument for machine learning on Kubernetes in this 2017 Berlin Buzzwords talk or in this 2020 IEEE Software article

  2. See “Kubernetes for machine learning: Productivity over primitives” for a detailed argument. 

  3. For more on this widespread issue, see my talk from Berlin Buzzwords 2019

  4. Many tools developed by and for the Kubernetes community are guilty of this shortcoming in that they assume, for example, that an end-user is as excited about the particular structure of a tool’s YAML files as its developers were. 

My article “Machine learning systems and intelligent applications” has recently been accepted for publication in IEEE Software and distills many of the arguments I’ve been making over the last few years about the intelligent applications concept, machine learning on Kubernetes, and about how we should structure machine learning systems. You can read an unedited preprint of my accepted manuscript or download the final version from IEEE Xplore. The rest of this post provides some brief motivation and context for the article.


What’s the difference between a machine learning workload and a machine learning system? Once we have a trained model, what else do we need to solve a business problem? How should we put machine learning into production on contemporary application infrastructure like Kubernetes?

In the past, machine learning (like business analytics more generally) has been a separate workload that runs asynchronously alongside the rest of a business, for example optimizing a supply chain once per quarter, informing the periodic arrangement of a physical retail store based on the prior month’s sales and upcoming product releases, identifying the characteristics of an ideal customer in a new market to inform ongoing product development, or even training a model to incorporate into an existing application.

Today, we often put machine learning in to production in the context of an intelligent application. Intelligent applications continuously learn from data to support essential functionality and thus improve with longevity and popularity. Intelligent applications are interesting for many reasons, but especially because:

  • in many cases they couldn’t exist without machine learning,
  • they are developed by cross-functional teams including data engineers, data scientists, and application developers – and thus involve several engineering processes and lifecycles in parallel: the data management pipeline, the machine learning discovery workflow, the model lifecycle, and the conventional software development lifecycle, and
  • they are deployed not as separate workloads but as a single system consisting of compute, storage, streaming, and application components.

While the first and second points have serious implications for monitoring, validation, and automated retraining, the last point may be even more interesting: in contrast to legacy architectures, which had application infastructure running in one place and a separate analytic database, compute scheduler, or colocated-storage-and-compute cluster elsewhere, intelligent applications schedule all components together in a single, logical application-specific cluster, as in the following figure.

This architecture is possible because Kubernetes is flexible enough to orchestrate all of these components, but it is necessary because much of the complexity of machine learning systems appears not in the components themselves but in their interactions. The intelligent applications concept helps tame this complexity by enabling us to manage and audit all intelligent application components — controllers and views, data pipelines, predictive models, and more — from a single control plane.

To learn more, check out “Machine learning systems and intelligent applications” (in preprint or final version) and please let me know what you think!

I’ve been using Altair (and thus Vega-Lite) for most of my data visualization work since early last year. In general, I appreciate the declarative approach to visualization, in which one starts with long-form tidy data and in which each column of a data frame can define some aspect of a visualization.

If each row represents an observation, and each column represents an attribute of that observation, then the attributes can map directly to visual properties of a plotted point corresponding to that observation.

When my teammates and I have taught others how to use Altair in the past, we’ve shown them how to tidy data with Pandas (or through some other preprocessing step), but it’s possible to tidy data directly in Altair. I developed an interactive notebook that starts by showing how to tidy data (both via preprocessing and directly in Altair) and then demonstrates some other intermediate Altair features like interactive plotting and choropleths. You can check it out on GitHub or run it on Binder!

You probably already know that if you’re modeling multiple independent phenomena in a repeatable simulation, you want multiple independent pseudorandom number generators. But you may be surprised by a consequence of following this approach if you’re using the excellent probability distributions supplied by the scipy.stats package. Read on to learn what the problem is and how to solve it!

Two ways to sample

Say you’re simulating the operations of a large retailer and have modeled the number of customer arrivals in a particular timespan with a Poisson distribution with some parameter λ. There are at least two ways to get a dozen samples from that distribution using SciPy.

We could supply the distribution parameters and a random state in each sampling call:

import scipy.stats
import numpy as np
seed = 0x00c0ffee

mean = 5
rs = np.random.RandomState(seed)
samples = scipy.stats.poisson.rvs(mean, size=12, random_state=rs)

or we could use a distribution object, which allows us to specify the parameters (including a random seed) once:

import scipy.stats

mean = 5
seed = 0x00c0ffee

distribution = scipy.stats.poisson(mean)
distribution.random_state = seed

samples = distribution.rvs(size=12)

In the first example, we have twelve samples from a Poisson distribution with a λ of mean; we specify the shape parameter when we draw from the distribution. In the second example, we’re creating a distribution object with a fixed λ, backed by a private pseudorandom number generator, seeded with a supplied value.

Interfaces and implementations

The second approach has two advantages: Firstly, we have an object with fixed distribution parameters (depending on the distribution, there can be several, including location and scale), so we don’t have to worry about tracking these every time we want to sample from this distribution. Secondly, we have a way to make sampling from this distribution deterministic by seeding it but without passing the same RandomState for each independent stream of values.

The disadvantage of the second approach only becomes obvious when we have many distribution objects in a single program. To get a hint for what goes wrong, let’s run a little experiment. The following two functions, which simulate running a certain number of steps of a simulation that depends on a certain number of independent actors, should have identical behavior.

def experiment_one(agents, steps):
    def mkpoisson(l,seed):
        p = scipy.stats.poisson(l)
        p.random_state = seed
        return p

    seeds = np.random.randint(1<<32, size=agents)
    streams = [mkpoisson(12, seed) for seed in seeds]
    for p in streams:
        p.rvs(steps)

def experiment_two(agents, steps):
    seeds = np.random.randint(1<<32, size=agents)
    states = [np.random.RandomState(seed) for seed in seeds]
    for rs in states:
        scipy.stats.poisson.rvs(12, size=steps, random_state=rs)
        

If we run both of these functions, though, we’ll see how they behave differently: running experiment_one for a thousand steps with ten thousand agents takes roughly 14 seconds on my laptop, but running experiment_two with the same parameters takes roughly 3¼ seconds. (You can try it for yourself locally or on binder.)

Explaining the performance difference

Why is the less-convenient API so much faster? To see why, let’s profile the first function:

import cProfile
import pstats
from pstats import SortKey

cProfile.run("experiment_one(10000,1000)", sort=SortKey.TIME)

This will show us the top function calls by exclusive time (i.e., not including time spent in callees). In my environment, the top function is docformat in doccer.py, which is called twice for each agent. In terms of exclusive time, it accounts for roughly 20% of the total execution of the experiment; in terms of inclusive time (i.e., including callees), it accounts for over half the time spent in the experiment.

What does docformat do? It reformats function docstrings and performs textual substitution on them. This makes sense in one context – building up a library of distribution classes from abstract bases and filling in documentation for all of the subclasses. In the context of creating an individual instance of a distribution object with particular parameters, it’s an interesting design decision indeed, especially since we’d be unlikely to examine the documentation for thousands of distribution objects that are internal to a simulation. (SciPy refers to this as “freezing” a distribution. The documentation briefly mentions that it’s convenient to fix the shape and parameters of a distribution instance but doesn’t mention the performance impact, although searching StackOverflow and GitHub shows that others have been bitten by this issue as well.)

Some solutions

Fortunately, there are a couple of ways to work around this problem. We could simply write code that looks like experiment_two, passing distribution parameters and a stateful random number generator to each function. This would be fast but clunky.

We could also sample from a uniform distribution and map those samples to samples of our target distribution by using the inverse cumulative distribution function (or percentage point function) of the target distribution, like this example that takes ten samples from a Poisson distribution:

prng = np.random.RandomState(seed=0x00c0ffee)
scipy.stats.poisson.ppf(prng.uniform(size=10), mu=12)

(Note that SciPy calls the λ parameter mu, presumably to avoid conflict with the Python keyword lambda.)

We can make either of these approaches somewhat cleaner by wrapping them in a Python generator, like this:

def mkpoisson(l, prng):
    while True:
        yield from scipy.stats.poisson.ppf(prng.uniform(size=1024), mu=l)

We can then use the iterators returned by this generator to repeatedly sample from the distribution:

p = mkpoisson(12, np.random.RandomState(seed=0x00c0ffee))

for i in range(10):
    print(p.next())

Postscript and sidebar

Of course, if we want a deterministic simulation involving a truly large number of independent phenomena, the properties of the pseudorandom number generation algorithm we use can become important. The RandomState class from NumPy, like the pseudorandom number generator in the Python standard library, uses the Mersenne Twister, which has an extremely long period but requires roughly 2kb of internal state, which you can inspect for yourself:

rs = np.random.RandomState(seed=0x00c0ffee)
rs.get_state()

The new NumPy RNG policy, which was implemented in NumPy 1.17, features a Generator class backed by an underlying source of bit-level randomness.1 The default bit-level source is Melissa O’Neill’s PCG, which requires only two 128-bit integers of state and has better statistical properties than the Mersenne Twister. Other approaches to bit-level generation may be worth investigating in the future due to the possibility of better performance.

You can use the new PCG implementation like this:

prng = np.random.default_rng(seed=0x00c0ffee)
scipy.stats.poisson.ppf(prng.uniform(size=10), mu=12)

If you’re maintaining a lot of Python functions that depend on having pseudorandom number generation — like in a discrete-event simulation — you probably want different random states for each consumer of randomness. As a concrete example, if you’re simulating the behavior of multiple users in a store and their arrival times and basket sizes can be modeled by certain probability distributions, you probably want a separate source of randomness for each simulated user.

Using a global generator, like the one backing the module methods in numpy.random or Python’s random, makes it difficult to seed your simulation appropriately and can also introduce implicit dependencies between the global parameters of the simulation (e.g., how many users are involved in a run of the simulation) and the local behavior of any particular user.

Once you’ve decided you need multiple sources of randomness, you’ll probably have a lot of code that looks something like this:

import random
import numpy as np

def somefunc(seed=None):
  if seed is None:
    seed = random.randrange(1 << 32)
    
  prng = np.random.RandomState(seed)

  while True:
    step_result = None
    
    # use prng to do something interesting 
    # as part of the simulation and assign 
    # it to step_result (omitted here) ...
    
    yield step_result

Initializing random number generators at the beginning of each function is not only repetitive, it’s also ugly and error-prone. The aesthetic and moral costs of this sort of boilerplate were weighing heavily on my conscience while I was writing a simulation earlier this week, but an easy solution lifted my spirits.

Python decorators are a natural way to generate a wrapper for our simulation functions that can automatically initialize a pseudorandom number generator if a seed is supplied (or create a seed if one isn’t). Here’s an example of how you could use a decorator in this way:

def makeprng(func):
  def call_with_prng(*args, prng=None, seed=None, **kwargs):
    if prng is None:
      if seed is None:
        seed = random.randrange(1 << 32)
   
      prng = np.random.RandomState(seed)
    return func(*args, prng=prng, seed=seed, **kwargs)
    
  return call_with_prng

@makeprng
def somefunc(seed=None, prng=None):

  while True:
    step_result = None
    
    # use prng to do something interesting 
    # as part of the simulation and assign 
    # it to step_result (omitted here) ...
    
    yield step_result

With the @makeprng annotation, somefunc will be replaced with the output of makeprng(somefunc), which is a function that generates a prng and passes it to somefunc before calling it. So if you invoke somefunc(seed=1234), it’ll construct a pseudorandom number generator seeded with 1234. If you invoke somefunc(), it’ll construct a pseudorandom number generator with an arbitrary seed.

Decorators are a convenient, low-overhead way to provide default values that must be constructed on demand for function parameters — and they make code that needs to create multiple streams of pseudorandom numbers much less painful to write and maintain.

I had a lot of fun presenting a tutorial at Strata Data NYC with my teammate Sophie Watson yesterday. In just over three hours, we covered a variety of hash-based data structures for answering interesting queries about large data sets or streams. These structures all have the following properties:

  • they’re incremental, meaning that you can update a summary of a stream by adding a single observation to it,
  • they’re parallel, meaning that you can combine a summary of A and a summary of B to get a summary of the combination of A and B.
  • they’re scalable, meaning that it’s possible to summarize an arbitrary number of observations in a fixed-size structure.

I’ve been interested in these sorts of structures for a while and it was great to have a chance to develop a tutorial covering the magic of hashing and some fun applications like Sophie’s recent work on using MinHash for recommendation engines.

If you’re interested in the tutorial, you can run through our notebooks at your own pace.

In my last post, I showed some applications of source-to-image workflows for data scientists. In this post, I’ll show another: automatically generating a model serving microservice from a git repository containing a Jupyter notebook that trains a model. The prototype s2i builder I’ll be describing is available here as source or here as an image (check the blog-201810 tag).

Basic constraints

Obviously, practitioners can create notebooks that depend on any combination of packages or data, and that require any sort of oddball execution pattern you can imagine. For the purposes of this prototype, we’re going to be (somewhat) opinionated and impose a few requirements on the notebook:

  1. The notebook must work properly if all the cells execute in order.
  2. One of the notebook cells will declare the library dependencies for the notebook as a list of name, version lists called requirements, e.g., requirements = [['numpy', '1.10']]
  3. The notebook must declare a function called predictor, which will return the result of scoring the model on a provided sample.
  4. The notebook may declare a function called validator, which takes a sample and will return True if the sample provided is of the correct type and False otherwise. The generated service will use this to check if a sample has the right shape before scoring it. (If no validator is provided, the generated service will do no error-checking on arguments.)

A running example

Consider a simple example notebook. This notebook has requirements specified:

requirements = [["numpy", "1.15"], ["scikit-learn", "0.19.2"], ["scipy", "1.0.1"]]

It also trains a model (in this case, simply optimizing 7 cluster centers for random data):

import numpy as np
from sklearn.cluster import KMeans
DIMENSIONS = 2
randos = np.random.random((40000,DIMENSIONS))
kmodel = KMeans(n_clusters=7).fit(randos)

Finally, the notebook also specifies predictor and validator methods. (Note that the validator method is particularly optimistic – you’d want to do something more robust in production.)

def predictor(x):
    return kmodel.predict([x])[0]

def validator(x):
    return len(x) == DIMENSIONS

What the builder does

Our goal with a source-to-image builder is to turn this (indeed, any notebook satisfying the constraints mentioned above) into a microservice automatically. This service will run a basic application skeleton that exposes the model trained by the notebook on a REST endpoint. Here’s a high-level overview of how my prototype builder accomplishes this:

  1. It preprocesses the input notebook twice, once to generate a script that produces a requirements file from the requirements variable in the notebook and once to generate a script that produces a serialized model from the contents of the notebook,
  2. It runs the first script, generating a requirements.txt file, which it then uses to install the dependencies of the notebook and the model service in a new virtual environment (which the model service will ultimately run under), and
  3. It runs the second script, which executes every cell of the notebook in order and then captures and serializes the predictor and validator functions to a file.

The model service itself is a very simple Flask application that runs in the virtual Python environment created from the notebook’s requirements and reads the serialized model generated after executing the notebook. In the case of our running example, it would take a JSON array POSTed to /predict and return the number of the closest cluster center.

Future work and improvements

The goal of the prototype service is to show that it is possible to automatically convert notebooks that train predictive models to services that expose those models to clients. There are several ways in which the prototype could be improved:

Deploying a more robust service: currently, the model is wrapped in a simple Flask application running in a standalone (or development) server. Wrapping a model in a Flask application is essentially a running joke in the machine learning community because it’s obviously imperfect but it’s ubiquitous in any case. While Flask itself offers an attractive set of tradeoffs for developing microservices, the Flask development server is not appropriate for production deployments; other options would be better.

Serving a single prediction at once with a HTTP roundtrip and JSON serialization may not meet the latency or throughput requirements of the most demanding intelligent applications. Providing multiple service backends can address this problem: a more sophisticated builder could use the same source notebook to generate several services, e.g., a batch scoring endpoint, a service that consumes samples from a messaging bus and writes predictions to another, or even a service that delivers a signed, serialized model for direct execution within another application component.

The current prototype builder image is built up from the Fedora 27 source-to-image base image; on this base, it then installs Python and a bunch of packages to make it possible to execute Jupyter notebooks. The generated service image also installs its extra requirements in a virtual environment, but it retains some baggage from the builder image.1 A multi-stage build would make it possible to jettison dependencies only necessary for actually executing the notebook and building the image (in particular, Jupyter itself) while retaining only those dependencies necessary to actually execute the model.

Finally, a multi-stage build would enable cleverer dependency handling. The requirements to run any notebook are a subset of the requirements to run a particular notebook from start to finish, but the requirements to evaluate a model scoring function or sample validation function likely do not include all of the packages necessary to run the whole notebook (or even all of the packages necessary to run any notebook at all). By identifying only the dependencies necessary for model serving – perhaps even automatically – the serving image can be smaller and simpler.


  1. The virtual environment is necessary so that the builder image can run without special privileges – that is, it need only write to the application directory to update the virtual environment. If we needed to update system packages, we’d need to run the builder image as root