The complexity of machine learning systems doesn’t subsist in the complexity of individual components, but rather in their orchestration, connections, and interactions. We can thus think of machine learning systems as special cases of general distributed systems. Furthermore, it’s relatively easy to argue that machine learning systems can benefit from general-purpose distributed systems infrastructure, tools, and frameworks to the extent that these make it easier to understand, develop, and maintain distributed systems. As evidence that this argument is noncontroversial, consider that Kubernetes is today’s most popular framework for managing distributed applications and is increasingly seen as a sensible default choice for machine learning systems.1
Once we’ve accepted that most of the complexity of machine learning systems isn’t specific to machine learning, the really interesting remaining questions are about how machine learning systems can benefit from existing infrastructure, tools, and techniques – and about how to address the challenges that are unique to machine learning. For example:
- what additional complexity comes from failure modes that are characterized more by the divergence of probability distributions rather than by failed assertions or crossing clear performance thresholds?
- to what extent are traditional devops workflows and idioms appropriate for building and maintaining machine learning systems, and where do they break down?
- how can we address managing the complexity of entire machine learning systems rather than individual components?
- how can we make contemporary infrastructure more accessible to machine learning practitioners?2
In order to evaluate how proposed solutions actually address these questions, it can be valuable to map out several aspects of machine learning systems:
- what human processes are involved in solving real problems with machine learning techniques?
- how do the diverse teams who build machine learning systems work and coordinate?
- what components comprise production machine learning systems, and how do they interact, both at a high level of abstraction (globally) and in fine detail (locally)?
The fact that these maps apparently overlap to some extent can be a source of confusion. A team of data scientists may have a feature engineering specialist and a modeling specialist. A production training pipeline may have feature extraction and model training components. While these different parts may connect together in analogous ways, they are not the same; our maps of systems, human processes, and organizations should each reveal different details of how we should support machine learning systems and the humans who build and maintain them.
The value of maps is as much in what they omit as it is in what they include: a good map will show the important details to navigate a given situation without including irrelevant details that obscure the presentation. By looking at these maps and identifying what areas of each are addressed by given solutions, it becomes easier to understand the strengths and shortcomings of various approaches. Ideally, a solution should address both complete workflows (human processes and interactions) and complete systems (software components and interactions).
Some solutions only support particular workloads, like particular training or inference frameworks, but not entire systems. Perhaps an “end-to-end” framework only addresses part of the problem, like model operationalization or data versioning – this will be obvious if we ascribe aspects of the solution to features of our map. Some solutions offer impressive demos but don’t address the problems our organization actually faces3 – again, this will be obvious by placing the solutions on our maps. Some tools that are ostensibly targeted for one audience have user experience assumptions that strongly imply that the developer had a different audience in mind, like “data science” tools that expect near-prurient interest in the accidental details of infrastructure4 – this will be obvious if we consider the interfaces of the tools corresponding to different map features in light of the humans responsible for these parts of our map. Perhaps a particular comprehensive solution only makes sense if organizations adopt an idiosyncratic workflow – this will be obvious because the solution will include some features that our map doesn’t and omit some features that our map includes.
Transit maps, which typically show the connections between lines and stations as a stylized graph, rather than aiming for geographical accuracy, present a particularly useful framework for understanding machine learning systems. In addition to capturing the components (stations or stops) and kinds of interactions (lines), other details like the existence of transfer stations and fare zones can expose other interesting aspects of the problem space. Here’s such a map that I designed to capture typical machine learning systems:
This map supports the story that Sophie Watson and I will be telling in our session at KubeCon North America next month – we’ll begin with the premise that Kubernetes is the right place to start for managing machine learning systems and then talk about some of the challenges unique to machine learning workflows and systems that Kubernetes and popular machine learning frameworks targeting Kubernetes don’t address. I hope you’ll be able to (virtually) join us!
See “Kubernetes for machine learning: Productivity over primitives” for a detailed argument. ↩
Many tools developed by and for the Kubernetes community are guilty of this shortcoming in that they assume, for example, that an end-user is as excited about the particular structure of a tool’s YAML files as its developers were. ↩
My article “Machine learning systems and intelligent applications” has recently been accepted for publication in IEEE Software and distills many of the arguments I’ve been making over the last few years about the intelligent applications concept, machine learning on Kubernetes, and about how we should structure machine learning systems. You can read an unedited preprint of my accepted manuscript or download the final version from IEEE Xplore. The rest of this post provides some brief motivation and context for the article.
What’s the difference between a machine learning workload and a machine learning system? Once we have a trained model, what else do we need to solve a business problem? How should we put machine learning into production on contemporary application infrastructure like Kubernetes?
In the past, machine learning (like business analytics more generally) has been a separate workload that runs asynchronously alongside the rest of a business, for example optimizing a supply chain once per quarter, informing the periodic arrangement of a physical retail store based on the prior month’s sales and upcoming product releases, identifying the characteristics of an ideal customer in a new market to inform ongoing product development, or even training a model to incorporate into an existing application.
Today, we often put machine learning in to production in the context of an intelligent application. Intelligent applications continuously learn from data to support essential functionality and thus improve with longevity and popularity. Intelligent applications are interesting for many reasons, but especially because:
- in many cases they couldn’t exist without machine learning,
- they are developed by cross-functional teams including data engineers, data scientists, and application developers – and thus involve several engineering processes and lifecycles in parallel: the data management pipeline, the machine learning discovery workflow, the model lifecycle, and the conventional software development lifecycle, and
- they are deployed not as separate workloads but as a single system consisting of compute, storage, streaming, and application components.
While the first and second points have serious implications for monitoring, validation, and automated retraining, the last point may be even more interesting: in contrast to legacy architectures, which had application infastructure running in one place and a separate analytic database, compute scheduler, or colocated-storage-and-compute cluster elsewhere, intelligent applications schedule all components together in a single, logical application-specific cluster, as in the following figure.
This architecture is possible because Kubernetes is flexible enough to orchestrate all of these components, but it is necessary because much of the complexity of machine learning systems appears not in the components themselves but in their interactions. The intelligent applications concept helps tame this complexity by enabling us to manage and audit all intelligent application components — controllers and views, data pipelines, predictive models, and more — from a single control plane.
I’ve been using Altair (and thus Vega-Lite) for most of my data visualization work since early last year. In general, I appreciate the declarative approach to visualization, in which one starts with long-form tidy data and in which each column of a data frame can define some aspect of a visualization.
If each row represents an observation, and each column represents an attribute of that observation, then the attributes can map directly to visual properties of a plotted point corresponding to that observation.
When my teammates and I have taught others how to use Altair in the past, we’ve shown them how to tidy data with Pandas (or through some other preprocessing step), but it’s possible to tidy data directly in Altair. I developed an interactive notebook that starts by showing how to tidy data (both via preprocessing and directly in Altair) and then demonstrates some other intermediate Altair features like interactive plotting and choropleths. You can check it out on GitHub or run it on Binder!
You probably already know that if you’re modeling multiple independent phenomena in a repeatable simulation, you want multiple independent pseudorandom number generators. But you may be surprised by a consequence of following this approach if you’re using the excellent probability distributions supplied by the
scipy.stats package. Read on to learn what the problem is and how to solve it!
Two ways to sample
Say you’re simulating the operations of a large retailer and have modeled the number of customer arrivals in a particular timespan with a Poisson distribution with some parameter λ. There are at least two ways to get a dozen samples from that distribution using SciPy.
We could supply the distribution parameters and a random state in each sampling call:
or we could use a distribution object, which allows us to specify the parameters (including a random seed) once:
In the first example, we have twelve samples from a Poisson distribution with a λ of
mean; we specify the shape parameter when we draw from the distribution. In the second example, we’re creating a distribution object with a fixed λ, backed by a private pseudorandom number generator, seeded with a supplied value.
Interfaces and implementations
The second approach has two advantages: Firstly, we have an object with fixed distribution parameters (depending on the distribution, there can be several, including location and scale), so we don’t have to worry about tracking these every time we want to sample from this distribution. Secondly, we have a way to make sampling from this distribution deterministic by seeding it but without passing the same
RandomState for each independent stream of values.
The disadvantage of the second approach only becomes obvious when we have many distribution objects in a single program. To get a hint for what goes wrong, let’s run a little experiment. The following two functions, which simulate running a certain number of steps of a simulation that depends on a certain number of independent actors, should have identical behavior.
If we run both of these functions, though, we’ll see how they behave differently: running
experiment_one for a thousand steps with ten thousand agents takes roughly 14 seconds on my laptop, but running
experiment_two with the same parameters takes roughly 3¼ seconds. (You can try it for yourself locally or on binder.)
Explaining the performance difference
Why is the less-convenient API so much faster? To see why, let’s profile the first function:
This will show us the top function calls by exclusive time (i.e., not including time spent in callees). In my environment, the top function is
doccer.py, which is called twice for each agent. In terms of exclusive time, it accounts for roughly 20% of the total execution of the experiment; in terms of inclusive time (i.e., including callees), it accounts for over half the time spent in the experiment.
docformat do? It reformats function docstrings and performs textual substitution on them. This makes sense in one context – building up a library of distribution classes from abstract bases and filling in documentation for all of the subclasses. In the context of creating an individual instance of a distribution object with particular parameters, it’s an interesting design decision indeed, especially since we’d be unlikely to examine the documentation for thousands of distribution objects that are internal to a simulation. (SciPy refers to this as “freezing” a distribution. The documentation briefly mentions that it’s convenient to fix the shape and parameters of a distribution instance but doesn’t mention the performance impact, although searching StackOverflow and GitHub shows that others have been bitten by this issue as well.)
Fortunately, there are a couple of ways to work around this problem. We could simply write code that looks like
experiment_two, passing distribution parameters and a stateful random number generator to each function. This would be fast but clunky.
We could also sample from a uniform distribution and map those samples to samples of our target distribution by using the inverse cumulative distribution function (or percentage point function) of the target distribution, like this example that takes ten samples from a Poisson distribution:
(Note that SciPy calls the λ parameter
mu, presumably to avoid conflict with the Python keyword
We can make either of these approaches somewhat cleaner by wrapping them in a Python generator, like this:
We can then use the iterators returned by this generator to repeatedly sample from the distribution:
Postscript and sidebar
Of course, if we want a deterministic simulation involving a truly large number of independent phenomena, the properties of the pseudorandom number generation algorithm we use can become important. The
RandomState class from NumPy, like the pseudorandom number generator in the Python standard library, uses the Mersenne Twister, which has an extremely long period but requires roughly 2kb of internal state, which you can inspect for yourself:
The new NumPy RNG policy, which was implemented in NumPy 1.17, features a
Generator class backed by an underlying source of bit-level randomness.1 The default bit-level source is Melissa O’Neill’s PCG, which requires only two 128-bit integers of state and has better statistical properties than the Mersenne Twister. Other approaches to bit-level generation may be worth investigating in the future due to the possibility of better performance.
You can use the new PCG implementation like this:
If you’re maintaining a lot of Python functions that depend on having pseudorandom number generation — like in a discrete-event simulation — you probably want different random states for each consumer of randomness. As a concrete example, if you’re simulating the behavior of multiple users in a store and their arrival times and basket sizes can be modeled by certain probability distributions, you probably want a separate source of randomness for each simulated user.
Using a global generator, like the one backing the module methods in
numpy.random or Python’s
random, makes it difficult to seed your simulation appropriately and can also introduce implicit dependencies between the global parameters of the simulation (e.g., how many users are involved in a run of the simulation) and the local behavior of any particular user.
Once you’ve decided you need multiple sources of randomness, you’ll probably have a lot of code that looks something like this:
Initializing random number generators at the beginning of each function is not only repetitive, it’s also ugly and error-prone. The aesthetic and moral costs of this sort of boilerplate were weighing heavily on my conscience while I was writing a simulation earlier this week, but an easy solution lifted my spirits.
Python decorators are a natural way to generate a wrapper for our simulation functions that can automatically initialize a pseudorandom number generator if a seed is supplied (or create a seed if one isn’t). Here’s an example of how you could use a decorator in this way:
somefunc will be replaced with the output of
makeprng(somefunc), which is a function that generates a
prng and passes it to
somefunc before calling it. So if you invoke
somefunc(seed=1234), it’ll construct a pseudorandom number generator seeded with
1234. If you invoke
somefunc(), it’ll construct a pseudorandom number generator with an arbitrary seed.
Decorators are a convenient, low-overhead way to provide default values that must be constructed on demand for function parameters — and they make code that needs to create multiple streams of pseudorandom numbers much less painful to write and maintain.
I had a lot of fun presenting a tutorial at Strata Data NYC with my teammate Sophie Watson yesterday. In just over three hours, we covered a variety of hash-based data structures for answering interesting queries about large data sets or streams. These structures all have the following properties:
- they’re incremental, meaning that you can update a summary of a stream by adding a single observation to it,
- they’re parallel, meaning that you can combine a summary of A and a summary of B to get a summary of the combination of A and B.
- they’re scalable, meaning that it’s possible to summarize an arbitrary number of observations in a fixed-size structure.
I’ve been interested in these sorts of structures for a while and it was great to have a chance to develop a tutorial covering the magic of hashing and some fun applications like Sophie’s recent work on using MinHash for recommendation engines.
If you’re interested in the tutorial, you can run through our notebooks at your own pace.
In my last post, I showed some applications of source-to-image workflows for data scientists. In this post, I’ll show another: automatically generating a model serving microservice from a
git repository containing a Jupyter notebook that trains a model. The prototype s2i builder I’ll be describing is available here as source or here as an image (check the
Obviously, practitioners can create notebooks that depend on any combination of packages or data, and that require any sort of oddball execution pattern you can imagine. For the purposes of this prototype, we’re going to be (somewhat) opinionated and impose a few requirements on the notebook:
- The notebook must work properly if all the cells execute in order.
- One of the notebook cells will declare the library dependencies for the notebook as a list of name, version lists called
requirements = [['numpy', '1.10']]
- The notebook must declare a function called
predictor, which will return the result of scoring the model on a provided sample.
- The notebook may declare a function called
validator, which takes a sample and will return
Trueif the sample provided is of the correct type and
Falseotherwise. The generated service will use this to check if a sample has the right shape before scoring it. (If no
validatoris provided, the generated service will do no error-checking on arguments.)
A running example
Consider a simple example notebook. This notebook has
It also trains a model (in this case, simply optimizing 7 cluster centers for random data):
Finally, the notebook also specifies
validator methods. (Note that the
validator method is particularly optimistic – you’d want to do something more robust in production.)
What the builder does
Our goal with a source-to-image builder is to turn this (indeed, any notebook satisfying the constraints mentioned above) into a microservice automatically. This service will run a basic application skeleton that exposes the model trained by the notebook on a REST endpoint. Here’s a high-level overview of how my prototype builder accomplishes this:
- It preprocesses the input notebook twice, once to generate a script that produces a requirements file from the
requirementsvariable in the notebook and once to generate a script that produces a serialized model from the contents of the notebook,
- It runs the first script, generating a
requirements.txtfile, which it then uses to install the dependencies of the notebook and the model service in a new virtual environment (which the model service will ultimately run under), and
- It runs the second script, which executes every cell of the notebook in order and then captures and serializes the
validatorfunctions to a file.
The model service itself is a very simple Flask application that runs in the virtual Python environment created from the notebook’s requirements and reads the serialized model generated after executing the notebook. In the case of our running example, it would take a JSON array
/predict and return the number of the closest cluster center.
Future work and improvements
The goal of the prototype service is to show that it is possible to automatically convert notebooks that train predictive models to services that expose those models to clients. There are several ways in which the prototype could be improved:
Deploying a more robust service: currently, the model is wrapped in a simple Flask application running in a standalone (or development) server. Wrapping a model in a Flask application is essentially a running joke in the machine learning community because it’s obviously imperfect but it’s ubiquitous in any case. While Flask itself offers an attractive set of tradeoffs for developing microservices, the Flask development server is not appropriate for production deployments; other options would be better.
Serving a single prediction at once with a HTTP roundtrip and JSON serialization may not meet the latency or throughput requirements of the most demanding intelligent applications. Providing multiple service backends can address this problem: a more sophisticated builder could use the same source notebook to generate several services, e.g., a batch scoring endpoint, a service that consumes samples from a messaging bus and writes predictions to another, or even a service that delivers a signed, serialized model for direct execution within another application component.
The current prototype builder image is built up from the Fedora 27 source-to-image base image; on this base, it then installs Python and a bunch of packages to make it possible to execute Jupyter notebooks. The generated service image also installs its extra requirements in a virtual environment, but it retains some baggage from the builder image.1 A multi-stage build would make it possible to jettison dependencies only necessary for actually executing the notebook and building the image (in particular, Jupyter itself) while retaining only those dependencies necessary to actually execute the model.
Finally, a multi-stage build would enable cleverer dependency handling. The requirements to run any notebook are a subset of the requirements to run a particular notebook from start to finish, but the requirements to evaluate a model scoring function or sample validation function likely do not include all of the packages necessary to run the whole notebook (or even all of the packages necessary to run any notebook at all). By identifying only the dependencies necessary for model serving – perhaps even automatically – the serving image can be smaller and simpler.
The virtual environment is necessary so that the builder image can run without special privileges – that is, it need only write to the application directory to update the virtual environment. If we needed to update system packages, we’d need to run the builder image as
I’m excited to be speaking at Strata Data in New York this Wednesday afternoon! My talk introduces the benefits of Linux containers and container application platforms for data science workflows.
There are a lot of introductory tutorials about Linux containers, some of which are even ostensibly targeted to data scientists. However, most of these assume that readers in general (and data scientists in particular) really want to get their hands dirty right away packaging software in containers: “here’s a container recipe, here’s a YAML file, now change these to meet your requirements and you’re ready to go.”
I’ve spent a lot of time packaging software and, while I’m not bad at it, there are definitely things I’d rather be doing. Unfortunately, the ubiquity of container tooling has democratized software packaging without making the hard parts any easier; in the worst case, container tooling just makes it really easy to produce bad or unsafe binary packages. So, instead of showing my audience how to make container recipes, I wanted to focus on a few high-level tools that can enable anyone to enjoy the benefits of containers without having to become a packaging expert.
In the remainder of this post, I’ll share some more information about the tools and communities I mentioned.
The first tool I discussed is Binder, which is a service that takes a link to a Git repository with iPython notebooks and a Python requirements file and will build and start up a Jupyter server in a container to serve those notebooks. The example I showed was [this notebook repository] (https://github.com/willb/probabilistic-structures/) from my DevConf.us talk, which you can run under Binder by clicking here. Finally, like all of the tools I’ll mention, Binder is open-source if you want to run your own or contribute.
If you want a little more flexibility to build container images from source repositories without dealing with the hassles of packaging, the source-to-image tool developed by the OpenShift team at Red Hat is a great place to get started. The source-to-image tooling lets developers or data scientists focus on code while leaving the details of building container images to a packaging expert who develops a particular builder image. In my talk, I showed how to use
s2i to build the same notebook I’d served with Docker, using Graham Dumpleton’s excellent notebook s2i builder image and then deployed this image with OKD running on my laptop to get much the same result as I would with Binder; watch the embedded video to see what it looked like:
You aren’t restricted to reimplementing notebook serving with s2i, though; any time you want a repeatable way to create a container from a source repository is a candidate for a source-to-image build. Here are two especially cool examples:
- Seldon are using s2i to make it easier to deploy trained models.
- The radanalytics.io community have developed a source-to-image builder that deploys an intelligent application on Kubernetes along with its own Apache Spark cluster.
It’s also possible to set up source-to-image builds to trigger automatically when your git repository is updated – check the OpenShift architecture documentation and the OpenShift developer documentation for more details.
radanalytics.io and Kubeflow
The radanalytics.io community is focused on enabling intelligent applications on Kubernetes and OpenShift. The community has produced a containerized distribution of Apache Spark, source-to-image builders (as mentioned above), container images for Jupyter notebooks, and TensorFlow training and serving containers, as well as a source-to-image builder to generate custom TensorFlow binaries optimized for any machine. If your work involves bridging the gap between prototypes and production, or if you work with a cross-functional team to build applications that depend on machine learning, check it out!
Kubeflow is a community effort to package a variety of machine learning libraries and environments for Kubernetes, so that data scientists can work against the same environments locally that their organizations will ultimately deploy in production. So far, the community has packaged JupyterHub, TensorFlow, Seldon, Katib, PyTorch, and other frameworks.
Both of these communities are extremely friendly to newcomers, so if you want to get started building tools to make it easier to use containers for data science or machine learning, they’re great places to start!
This is a lightly-edited excerpt from a post on my long-defunct personal blog. Careful readers will note applications to engineering leadership, mentoring junior researchers, and public policy, among other domains.
When I was in the sixth grade, I entered the school science fair. I wrote a BASIC program to calculate what lengths of steel conduit would vibrate at certain frequencies and used its output to build an equal-tempered glockenspiel.1 Across the aisle from me was a poster for Pyramid Power, which is perhaps the greatest science fair project I’ve ever seen.
The greatness of this project started with an elaborate hand-drawn logo, which could have passed for that of a rock band with ample pinch harmonics and complicated hair-care protocols had it been etched on a desk or inked on the rear panel of a denim jacket. Beyond the exceptional logo, the poster contained all of the typical elementary-school science fair details: hypothesis, experimental method, equipment, results, and conclusion. The hypothesis was simple and implied the necessary equipment: if one built a pyramid out of cardboard and covered it with electrical tape, then one could run a wire from this pyramid to the soil of a potted plant. The plant would then flourish, the young scientist hypothesized, thanks to Pyramid Power.2
To demonstrate Pyramid Power, the student had executed a controlled experiment by raising two plants in nearly identical conditions, except that one plant would have the wire in its soil and benefit from Pyramid Power, while the control would not. Unfortunately, the experiment ended unexpectedly: the control group plant had flourished, but the experimental plant had withered and died almost immediately. However, as the researcher concluded, this apparently-confounding finding did not challenge the validity of the Pyramid Power hypothesis.
“Clearly, we needed a bigger pyramid.”
Greenspun’s tenth rule of programming states that
Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
Expressive high-level languages with powerful runtimes are far more common now than they were in 1993, but the general insight behind Greenspun’s rule remains undeniable – lower-level environments may seem desirable because they’re unfettered by certain kinds of complexity and lack the (percieved) baggage of richer ones, but this baggage often turns out to be necessary to get real work done and winds up getting reinvented poorly.1
Linux containers present the illusion of a relatively baggage-free environment for software distribution, and it’s wonderful that people have built workflows to let you go from a commit passing CI to an immutable deployment. But the fantastic developer experience that container tooling offers has also inspired a lot of people to do unsafe things in production, because there’s effectively no barrier to entry; building containers essentially turns everyone into a Linux distribution vendor; and being a Linux distribution vendor is not a part of most people’s skill set.2
Even if we just consider security (and ignore issues of legality and stability, among others), there are many places that these ad-hoc distributions can go off the rails. Just think of how many
Dockerfiles (or similar image recipes) do things like
- running services as
- pulling down random binaries or tarballs from the public internet,
- building static binaries against an environment defined by an unvetted image downloaded from a public registry,
- building static binaries without any machine-readable or human-auditable representation of their dependencies, or
- relying on alternative C library implementations that are designed to save code size and are only ever deployed in containers.
I’ve had many conversations in the last five years in which someone has asserted that container tooling obviates other packaging mechanisms.3 But this assumes that the hard part of packaging, e.g., an RPM for Fedora is in using the Fedora release tooling to get binaries into an RPM-shaped container. The hard part, of course, is in satisfying the guidelines that the Fedora project has put in place to make it more likely that Fedora will be stable, secure, legal, and usable. Since the issue is not the shape of the package but rather what it contains, saying that you don’t need to know how to make, e.g., an RPM if you have containers misses the point: it’s like saying “I know how to encode an audio stream as an MP3 file, so I could have produced this MP3 of Palestrina’s ‘Sicut cervus.’”4
Container tooling makes it very easy to produce ad-hoc systems software distributions that don’t offer any of the value of traditional systems software distributions but still have many of their potential liabilities. Indeed, one might say that any containerized software distribution of sufficient complexity includes an ad-hoc, informally-specified, bug-ridden, and probably legally dubious implementation of half of the Fedora packaging guidelines.
(I’ve been meaning to write this post for a while; thanks to Paul Snively for inspiring me to finally get it done!)
Indeed, the concerns of distributing systems software aren’t even particularly obvious to people who haven’t spent time in this world. ↩
This conversation has even happened with people who work in the business of making open-source software consumable and supportable (and should probably know better). ↩
The analogy with Palestrina’s contrapuntal style, governed as it is by rules and constraints, is deliberate. ↩