Automatically generating model services from Jupyter notebooks
In my last post, I showed some applications of source-to-image workflows for data scientists. In this post, I’ll show another: automatically generating a model serving microservice from a git
repository containing a Jupyter notebook that trains a model. The prototype s2i builder I’ll be describing is available here as source or here as an image (check the blog-201810
tag).
Basic constraints
Obviously, practitioners can create notebooks that depend on any combination of packages or data, and that require any sort of oddball execution pattern you can imagine. For the purposes of this prototype, we’re going to be (somewhat) opinionated and impose a few requirements on the notebook:
- The notebook must work properly if all the cells execute in order.
- One of the notebook cells will declare the library dependencies for the notebook as a list of name, version lists called
requirements
, e.g.,requirements = [['numpy', '1.10']]
- The notebook must declare a function called
predictor
, which will return the result of scoring the model on a provided sample. - The notebook may declare a function called
validator
, which takes a sample and will returnTrue
if the sample provided is of the correct type andFalse
otherwise. The generated service will use this to check if a sample has the right shape before scoring it. (If novalidator
is provided, the generated service will do no error-checking on arguments.)
A running example
Consider a simple example notebook. This notebook has requirements
specified:
requirements = [["numpy", "1.15"], ["scikit-learn", "0.19.2"], ["scipy", "1.0.1"]]
It also trains a model (in this case, simply optimizing 7 cluster centers for random data):
import numpy as np
from sklearn.cluster import KMeans
DIMENSIONS = 2
randos = np.random.random((40000,DIMENSIONS))
kmodel = KMeans(n_clusters=7).fit(randos)
Finally, the notebook also specifies predictor
and validator
methods. (Note that the validator
method is particularly optimistic – you’d want to do something more robust in production.)
def predictor(x):
return kmodel.predict([x])[0]
def validator(x):
return len(x) == DIMENSIONS
What the builder does
Our goal with a source-to-image builder is to turn this (indeed, any notebook satisfying the constraints mentioned above) into a microservice automatically. This service will run a basic application skeleton that exposes the model trained by the notebook on a REST endpoint. Here’s a high-level overview of how my prototype builder accomplishes this:
- It preprocesses the input notebook twice, once to generate a script that produces a requirements file from the
requirements
variable in the notebook and once to generate a script that produces a serialized model from the contents of the notebook, - It runs the first script, generating a
requirements.txt
file, which it then uses to install the dependencies of the notebook and the model service in a new virtual environment (which the model service will ultimately run under), and - It runs the second script, which executes every cell of the notebook in order and then captures and serializes the
predictor
andvalidator
functions to a file.
The model service itself is a very simple Flask application that runs in the virtual Python environment created from the notebook’s requirements and reads the serialized model generated after executing the notebook. In the case of our running example, it would take a JSON array POST
ed to /predict
and return the number of the closest cluster center.
Future work and improvements
The goal of the prototype service is to show that it is possible to automatically convert notebooks that train predictive models to services that expose those models to clients. There are several ways in which the prototype could be improved:
Deploying a more robust service: currently, the model is wrapped in a simple Flask application running in a standalone (or development) server. Wrapping a model in a Flask application is essentially a running joke in the machine learning community because it’s obviously imperfect but it’s ubiquitous in any case. While Flask itself offers an attractive set of tradeoffs for developing microservices, the Flask development server is not appropriate for production deployments; other options would be better.
Serving a single prediction at once with a HTTP roundtrip and JSON serialization may not meet the latency or throughput requirements of the most demanding intelligent applications. Providing multiple service backends can address this problem: a more sophisticated builder could use the same source notebook to generate several services, e.g., a batch scoring endpoint, a service that consumes samples from a messaging bus and writes predictions to another, or even a service that delivers a signed, serialized model for direct execution within another application component.
The current prototype builder image is built up from the Fedora 27 source-to-image base image; on this base, it then installs Python and a bunch of packages to make it possible to execute Jupyter notebooks. The generated service image also installs its extra requirements in a virtual environment, but it retains some baggage from the builder image.1 A multi-stage build would make it possible to jettison dependencies only necessary for actually executing the notebook and building the image (in particular, Jupyter itself) while retaining only those dependencies necessary to actually execute the model.
Finally, a multi-stage build would enable cleverer dependency handling. The requirements to run any notebook are a subset of the requirements to run a particular notebook from start to finish, but the requirements to evaluate a model scoring function or sample validation function likely do not include all of the packages necessary to run the whole notebook (or even all of the packages necessary to run any notebook at all). By identifying only the dependencies necessary for model serving – perhaps even automatically – the serving image can be smaller and simpler.
-
The virtual environment is necessary so that the builder image can run without special privileges – that is, it need only write to the application directory to update the virtual environment. If we needed to update system packages, we’d need to run the builder image as
root
. ↩
Materials for my Strata Data talk
I’m excited to be speaking at Strata Data in New York this Wednesday afternoon! My talk introduces the benefits of Linux containers and container application platforms for data science workflows.
There are a lot of introductory tutorials about Linux containers, some of which are even ostensibly targeted to data scientists. However, most of these assume that readers in general (and data scientists in particular) really want to get their hands dirty right away packaging software in containers: “here’s a container recipe, here’s a YAML file, now change these to meet your requirements and you’re ready to go.”
I’ve spent a lot of time packaging software and, while I’m not bad at it, there are definitely things I’d rather be doing. Unfortunately, the ubiquity of container tooling has democratized software packaging without making the hard parts any easier; in the worst case, container tooling just makes it really easy to produce bad or unsafe binary packages. So, instead of showing my audience how to make container recipes, I wanted to focus on a few high-level tools that can enable anyone to enjoy the benefits of containers without having to become a packaging expert.
In the remainder of this post, I’ll share some more information about the tools and communities I mentioned.
Binder
The first tool I discussed is Binder, which is a service that takes a link to a Git repository with iPython notebooks and a Python requirements file and will build and start up a Jupyter server in a container to serve those notebooks. The example I showed was [this notebook repository] (https://github.com/willb/probabilistic-structures/) from my DevConf.us talk, which you can run under Binder by clicking here. Finally, like all of the tools I’ll mention, Binder is open-source if you want to run your own or contribute.
source-to-image
If you want a little more flexibility to build container images from source repositories without dealing with the hassles of packaging, the source-to-image tool developed by the OpenShift team at Red Hat is a great place to get started. The source-to-image tooling lets developers or data scientists focus on code while leaving the details of building container images to a packaging expert who develops a particular builder image. In my talk, I showed how to use s2i
to build the same notebook I’d served with Docker, using Graham Dumpleton’s excellent notebook s2i builder image and then deployed this image with OKD running on my laptop to get much the same result as I would with Binder; watch the embedded video to see what it looked like:
You aren’t restricted to reimplementing notebook serving with s2i, though; any time you want a repeatable way to create a container from a source repository is a candidate for a source-to-image build. Here are two especially cool examples:
- Seldon are using s2i to make it easier to deploy trained models.
- The radanalytics.io community have developed a source-to-image builder that deploys an intelligent application on Kubernetes along with its own Apache Spark cluster.
It’s also possible to set up source-to-image builds to trigger automatically when your git repository is updated – check the OpenShift architecture documentation and the OpenShift developer documentation for more details.
radanalytics.io and Kubeflow
The radanalytics.io community is focused on enabling intelligent applications on Kubernetes and OpenShift. The community has produced a containerized distribution of Apache Spark, source-to-image builders (as mentioned above), container images for Jupyter notebooks, and TensorFlow training and serving containers, as well as a source-to-image builder to generate custom TensorFlow binaries optimized for any machine. If your work involves bridging the gap between prototypes and production, or if you work with a cross-functional team to build applications that depend on machine learning, check it out!
Kubeflow is a community effort to package a variety of machine learning libraries and environments for Kubernetes, so that data scientists can work against the same environments locally that their organizations will ultimately deploy in production. So far, the community has packaged JupyterHub, TensorFlow, Seldon, Katib, PyTorch, and other frameworks.
Both of these communities are extremely friendly to newcomers, so if you want to get started building tools to make it easier to use containers for data science or machine learning, they’re great places to start!
A parable about causality and abduction
This is a lightly-edited excerpt from a post on my long-defunct personal blog. Careful readers will note applications to engineering leadership, mentoring junior researchers, and public policy, among other domains.
When I was in the sixth grade, I entered the school science fair. I wrote a BASIC program to calculate what lengths of steel conduit would vibrate at certain frequencies and used its output to build an equal-tempered glockenspiel.1 Across the aisle from me was a poster for Pyramid Power, which is perhaps the greatest science fair project I’ve ever seen.
The greatness of this project started with an elaborate hand-drawn logo, which could have passed for that of a rock band with ample pinch harmonics and complicated hair-care protocols had it been etched on a desk or inked on the rear panel of a denim jacket. Beyond the exceptional logo, the poster contained all of the typical elementary-school science fair details: hypothesis, experimental method, equipment, results, and conclusion. The hypothesis was simple and implied the necessary equipment: if one built a pyramid out of cardboard and covered it with electrical tape, then one could run a wire from this pyramid to the soil of a potted plant. The plant would then flourish, the young scientist hypothesized, thanks to Pyramid Power.2
To demonstrate Pyramid Power, the student had executed a controlled experiment by raising two plants in nearly identical conditions, except that one plant would have the wire in its soil and benefit from Pyramid Power, while the control would not. Unfortunately, the experiment ended unexpectedly: the control group plant had flourished, but the experimental plant had withered and died almost immediately. However, as the researcher concluded, this apparently-confounding finding did not challenge the validity of the Pyramid Power hypothesis.
“Clearly, we needed a bigger pyramid.”
Ad-hoc, informally-specified, bug-ridden operating system distributions
Greenspun’s tenth rule of programming states that
Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
Expressive high-level languages with powerful runtimes are far more common now than they were in 1993, but the general insight behind Greenspun’s rule remains undeniable – lower-level environments may seem desirable because they’re unfettered by certain kinds of complexity and lack the (percieved) baggage of richer ones, but this baggage often turns out to be necessary to get real work done and winds up getting reinvented poorly.1
Linux containers present the illusion of a relatively baggage-free environment for software distribution, and it’s wonderful that people have built workflows to let you go from a commit passing CI to an immutable deployment. But the fantastic developer experience that container tooling offers has also inspired a lot of people to do unsafe things in production, because there’s effectively no barrier to entry; building containers essentially turns everyone into a Linux distribution vendor; and being a Linux distribution vendor is not a part of most people’s skill set.2
Even if we just consider security (and ignore issues of legality and stability, among others), there are many places that these ad-hoc distributions can go off the rails. Just think of how many Dockerfile
s (or similar image recipes) do things like
- running services as
root
, - pulling down random binaries or tarballs from the public internet,
- building static binaries against an environment defined by an unvetted image downloaded from a public registry,
- building static binaries without any machine-readable or human-auditable representation of their dependencies, or
- relying on alternative C library implementations that are designed to save code size and are only ever deployed in containers.
I’ve had many conversations in the last five years in which someone has asserted that container tooling obviates other packaging mechanisms.3 But this assumes that the hard part of packaging, e.g., an RPM for Fedora is in using the Fedora release tooling to get binaries into an RPM-shaped container. The hard part, of course, is in satisfying the guidelines that the Fedora project has put in place to make it more likely that Fedora will be stable, secure, legal, and usable. Since the issue is not the shape of the package but rather what it contains, saying that you don’t need to know how to make, e.g., an RPM if you have containers misses the point: it’s like saying “I know how to encode an audio stream as an MP3 file, so I could have produced this MP3 of Palestrina’s ‘Sicut cervus.’”4
Container tooling makes it very easy to produce ad-hoc systems software distributions that don’t offer any of the value of traditional systems software distributions but still have many of their potential liabilities. Indeed, one might say that any containerized software distribution of sufficient complexity includes an ad-hoc, informally-specified, bug-ridden, and probably legally dubious implementation of half of the Fedora packaging guidelines.
(I’ve been meaning to write this post for a while; thanks to Paul Snively for inspiring me to finally get it done!)
-
There’s a corollary to Greenspun’s rule for distributed systems and Erlang, naturally. ↩
-
Indeed, the concerns of distributing systems software aren’t even particularly obvious to people who haven’t spent time in this world. ↩
-
This conversation has even happened with people who work in the business of making open-source software consumable and supportable (and should probably know better). ↩
-
The analogy with Palestrina’s contrapuntal style, governed as it is by rules and constraints, is deliberate. ↩
Spark's RDD API, variance, and typeclasses
This brief post is based on material that Erik and I didn’t have time to cover in our Spark+AI Summit talk; it will show you how to use Scala’s implicit parameter mechanism to work around an aspect of the RDD API that can make it difficult to write generic functions. This post will be especially useful for experienced Spark users who are relatively new to Scala.
If you’ve written reusable code that uses Spark’s RDD API, you might have run into headaches related to variance. The RDD is an invariant API, meaning that RDD[T]
and RDD[U]
are unrelated types if T
and U
are different types – even if there is a subtyping relation between T
and U
.
Let’s say you had a Scala trait and some concrete class extending that trait, like these:
trait HasUserId { val userid: Int }
case class Transaction(override val userid: Int,
timestamp: Int,
amount: Double)
extends HasUserId {}
You might then want to write a function operating on an RDD of any type that is a subtype of your HasUserId
trait, like this:
def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x))
Unfortunately, this code isn’t that useful, because RDDs are invariant. Let’s apply it to a concrete RDD of some type that is a subtype of HasUserId
:
val xacts = spark.parallelize(Array(
Transaction(1, 1, 1.0),
Transaction(2, 2, 1.0)
))
badKeyByUserId(xacts)
This will fail to compile due to the type mismatch: we’ve supplied an org.apache.spark.rdd.RDD[Transaction]
but the function required an org.apache.spark.rdd.RDD[HasUserId]
. Since there is no subtyping relation between these two, we cannot supply the former in place of the latter. We could explicitly cast our RDD or its elements and get our code to compile and run:
/* cast the collection all at once */
badKeyByUserId(xacts.asInstanceOf[RDD[HasUserId]])
/* or cast each element */
badKeyByUserId(xacts.map(x => x.asInstanceOf[HasUserId]))
Explicit casts are clunky, though, and they also cost us precision: once we’ve cast up to RDD[(Int, HasUserId)]
, we have no safe way to get back to an RDD[(Int, Transaction)]
.
A better approach is to use Scala’s generic types in conjunction with implicit parameters to write a generic function that only accepts RDDs of some concrete type that is a subtype of HasUserId
, like this:
def keyByUserId[T: ClassTag](r: RDD[T])(implicit bid: T => HasUserId) =
r.map(x => (bid(x).userid, x))
Let’s walk through what’s happening here. When we invoke keyByUserId
with an RDD of some type T
, the Scala compiler will first make sure there is a function in scope mapping from T
to HasUserId
.1 Put another way, the implicit formal parameter imposes a constraint on T
– if there is a function that supplies evidence that T
satisfies the constraint, the code will compile. This function will exist for any concrete subtype of HasUserId
. We’ll then use that function to get a HasUserId
-typed reference for each element of the collection so we can safely access the userid
field. We’ll not only be able to apply that function to an RDD of Transaction
objects, but it will return a result with a specific type: RDD[(Int, Transaction)]
.
It’s worth noting that we could also define a conversion from instances of some type unrelated to HasUserId
to instances of HasUserId
, meaning we aren’t restricted by the subtyping relationship. You can see a similar approach in action in my explanation of implementing database-style type translation in Scala’s type system.
It should be clear that using generics in this way can capture most of what we’d like to capture with a covariant collection (that is, a collection C
such that C[T] <: C[U]
iff T <: U
). However, the general technique is more powerful than simply simulating covariance: what we’re doing here is using Scala’s implicit resolution to implement typeclasses so we can support typesafe ad hoc polymorphism. To see an example of how this affords us additional flexibility, let’s look at a generic method operating on RDDs of numeric values:
def multiply[T: ClassTag](r: RDD[T], v: T)(implicit ev: Numeric[T]) =
r.map(x => ev.times(x, v))
multiply(spark.parallelize(Array(1,2,3,4)),4).collect()
// => Array(4, 8, 12, 16)
multiply(spark.parallelize(Array(1.0,2.0,3.0,4.0)),4.0).collect()
// => Array(4.0, 8.0, 12.0, 16.0)
As you can see, the same multiply
method works for integers and doubles; indeed, it will work on any of Scala’s numeric types as well as any type T
for which you define an implicit instance of Numeric[T]
.
In conclusion, the RDD is invariant, but you can still do useful generic programming with it as long as you’re willing to use Scala’s implcit conversions.
-
It is also possible to supply one explicitly, in case there are several possible options. We can use
implicitly
to simulate the Scala compiler’s implicit resolution, so we could invoke our function the way that the Scala compiler does like this:keyByUserId(xacts)(implicitly[Transaction => HasUserId])
↩
Virtual handout for my Red Hat Summit talk
It’s an honor to present at Red Hat Summit again this year! I’m giving a brief introduction to machine learning concepts for developers. Of course, one can’t do justice to such a broad topic in a forty-minute session, but I have some materials for people who’d like to experiment with some fundamental ML techniques on their own time.
These materials are all presented as Jupyter notebooks, which combine code, narrative explanations, and output. These notebooks mean that you can inspect code, run it, change it, and experiment with it. The main thing to know about Jupyter is that notebooks are made up of cells, and pressing shift+enter will run the cell you’re currently on and move to the next one. If you get stuck, you can go up to the “Kernel” menu, and select “Restart and clear output.”
First up, this notebook can be run directly in your browser through the mybinder.org service – it presents an introduction to the scalable analytic techniques I mentioned in the beginning of the session.
If you’d like to dive deeper into specific machine learning techniques, you’ll need to fire up OpenShift:
- log in to an OpenShift cluster, or create a temporary one on your local machine with
oc cluster up
. - create a pod to serve some more example notebooks with
oc new-app radanalyticsio/workshop-notebook -e JUPYTER_NOTEBOOK_PASSWORD=developer
, and - expose a route to that pod with
oc expose workshop-notebook
.
When you visit the route for the Jupyter pod, you’ll need to log in. The password is developer
. After you log in, you’ll be presented a with a list of notebook files. Here’s what each of them contain:
ml-basics.ipynb
contains visual explanations and examples of clustering, classification, and regression using Apache Spark,pyspark.ipynb
introduces data engineering and data cleaning using Apache Spark and shows you how to train a natural language model on a data set from an open-source project,var.ipynb
shows you how to model data and run Monte Carlo simulations with Apache Spark using an example from the financial domain.
Finally, be sure to visit radanalytics.io to see examples of intelligent applications on OpenShift and strimzi.io to learn how to enable Apache Kafka on OpenShift.
You’re at the beginning of a really exciting journey! I hope these resources are helpful as you get started.
Virtual handout for my DevConf talk
I’m thrilled to be in Brno for DevConf again! This year I’m speaking about structures and techniques for scalable data processing, and this post is a virtual handout for my talk.
- Here’s a Jupyter notebook containing all of the code I discussed in the talk, complete with a few exercises and some links to other resources.
- There are lots of implementations of these techniques that you can use in production. Apache Spark, for example, uses some of these structures to support library operations like aggregates in structured queries. Algebird provides scalable and parallel implementations of all of the techniques I discussed and it is especially nice if you’re an algebraically-inclined fan of functional programming (guilty as charged).
- A very cool data structure that I didn’t have time to talk about is the t-digest, which calculates approximate cumulative distributions (so that you can take a stream of metric observations and ask, e.g., what’s the median latency, or the latency at the 99th percentile?) My friend Erik Erlandson has a really elegant scala implementation of the t-digest and has also given several great talks on the t-digest and some really clever applications for scalable cumulative distribution estimates. Start with this one.
- radanalytics.io is a community effort to enable scalable data processing and intelligent applications on OpenShift, including tooling to manage compute resources in intelligent applications and a distribution of Apache Spark for OpenShift.
My slides are available here.
Misfeature recovery and APIs
My team recently agreed that it would improve the usability of our main Trello board if we moved lists containing cards we’d completed in previous years to archival boards. The idea was that those lists and cards would still be searchable and accessible but that they wouldn’t be cluttering our view of our current work. I moved very old lists late in November, moved all of our lists from 2017 at the beginning of this week, and prepared to bask in a web page containing only recent virtual index cards.
My basking ended abruptly, as baskings are wont to do. In this case, the abrupt end was occasioned by an offhand question from a colleague:
“By the way, what’s the deal with me getting removed from all of my old cards?”
I looked at the Trello board I’d created to archive activity from 2017 and saw that the only cards that had a member attached were my cards. Even though I’d made the archive board visible to the whole team, every other person on the team was removed from her cards when I moved the lists.
Now, I’m not a Trello expert and hope I’ll never become one. It may be that removing users from cards on boards they don’t belong to is actually the correct behavior. However, having such a drastic side effect occur without warning is absolutely user-hostile.1
Since software is rarely content to injure without also insulting, Trello also insinuated that I had explicitly removed my colleagues from their cards, like this:
So not only had I screwed up our team’s task history, but I looked like a jerk with too much free time.
Fortunately, all of those spurious explicit removals gave me a way to start unwinding the mess. Those member removals were captured in the actions
log for each card as removeMemberFromCard
actions; I was able to see them by exporting cards as JSON:2
"actions": [
{
/* ... */
"type": "removeMemberFromCard",
"date": "2018-01-11T15:57:16.652Z",
"member": {
"id": /* ... */,
"avatarHash": /* ... */,
"fullName": "Erik Erlandson",
"initials": "EJE",
"username": /* ... */
},
"memberCreator": {
"id": /* ... */,
"avatarHash": /* ... */,
"fullName": "William Benton",
"initials": "WB",
"username": /* ... */
}
}
],
Trello provides a pretty decent API, so I got to work. (The official Trello Python client appears to lack support for Python 3; I used py-trello instead.) My basic approach was to look for removeMemberFromCard
actions that had happened since just before I moved the lists, identify the removed members from each card, and then add them back to the card.
I was able to get our history restored pretty quickly. Here are some of the minor snags I hit with the Trello API and how I worked around them:
- By default, querying for actions on cards only returns card-creation actions and comments. You will need to specify an explicit action type filter to the API (e.g.,
removeMemberFromCard
orall
) in order to get all relevant actions. - Even though I cached results and thought I took adequate care to avoid Trello’s rate limits, I found myself regularly getting rate-limited at the
/1/members
endpoint while resolving from member IDs topy-trello
Member
objects to pass to theadd_member
function on a card. I was able to work around this by converting thedict
corresponding to the member in the action to anamedtuple
instance, which acted enough like aMember
object to do the trick.3 - Some cards didn’t have the
removeMemberFromCard
actions. This actually seems like a Trello bug, but I was able to work around it by adding everyone who had ever been added to a card but wasn’t currently on it. This means that there may be some people spuriously ascribed to cards now (i.e., people who should have been explicitly removed from cards), but I think it’s better to have slightly lower precision than virtually zero recall in this application. (Also, our team’s practice is to only add members to cards when they’re in progress or complete, which minimizes the potential impact here.)
My code, which is quick, dirty, and profoundly underengineered, is available for your review. To use it, you’ll need a Trello API key, OAuth secret, and token, all of which you can get from Trello’s developer site.
The code is certainly not that broadly useful but hopefully the takeaway lesson is: you can recover from a lot of application bugs and misfeatures if your data model explicitly tracks state changes.4 It may even be worth going to a data representation that explicitly allows rollback in some cases. Finally, If you expose a way to inspect history with your API, users can even recover from your bugs without your help.
-
Could Trello have asked if I wanted to invite users to the board? Told me I’d be removing member ascriptions from the cards before moving? It certainly seems like it should have. ↩
-
I’ve redacted unnecessary information, including usernames, member IDs, and other IDs. ↩
-
Hooray for the Wild West of untyped languages, eh? ↩
-
It’s almost as if those wacky functional programming zealots have a point about their persistent data structures. ↩
Identifying Python module dependencies
This post is also available as an interactive notebook.
Background
Consider the following problem: you’d like to enable users to automatically extract a function from a Jupyter notebook and publish it as a service. Actually serializing a closure from a function in a given environment is not difficult, given the cloudpickle
module. But merely constructing a serialized closure isn’t enough, since in general this function may require other modules to be available to run in another context.
Therefore, we need some way to identify the modules required by a function (and, ultimately, the packages that provide these modules). Since engineering time is limited, it’s probably better to have an optimistic-but-incomplete (or unsound) estimate and allow users to override it (by supplying additional dependencies when they publish a function) than it is to have a sound and conservative module list.1
The modulefinder
module in the Python standard library might initially seem like an attractive option, but it is unsuitable because it operates at the level of scripts. In order to use modulefinder
on a single function from a notebook, we’d either have an imprecise module list (due to running the whole notebook) or we’d need to essentially duplicate a lot of its effort in order to slice backwards from the function invocation so we could extract a suitably pruned script.
Fortunately, you can interrogate nearly any property of any object in a Python program, including functions. If we could inspect the captured variables in a closure, we could identify the ones that are functions and figure out which modules they were declared in. That would look something like this:
import inspect
def module_frontier(f):
worklist = [f]
seen = set()
mods = set()
for fn in worklist:
cvs = inspect.getclosurevars(fn)
gvars = cvs.globals
for k, v in gvars.items():
if inspect.ismodule(v):
mods.add(v.__name__)
elif inspect.isfunction(v) and id(v) not in seen:
seen.add(id(v))
mods.add(v.__module__)
worklist.append(v)
elif hasattr(v, "__module__"):
mods.add(v.__module__)
return list(mods)
The inspect
module provides a friendly interface to inspecting object metadata. In the above function, we’re constructing a worklist of all of the captured variables in a given function’s closure. We’re then constructing a set of all of the modules directly or transitively referred to by those captured variables, whether these are modules referred to directly, modules declaring functions referred to by captured variables, or modules declaring other values referred to by captured variables (e.g., native functions). Note that we add any functions we find to the worklist (although we don’t handle eval
or other techniques), so we’ll capture at least some of the transitive call graph in this case.
This approach seems to work pretty sensibly on simple examples:
import numpy as np
from numpy import dot
def f(a, b):
return np.dot(a, b)
def g(a, b):
return dot(a, b)
def h(a, b):
return f(a, b)
{k.__name__ : module_frontier(k) for k in [f,g,h]}
{'f': ['numpy.core.multiarray', 'numpy'],
'g': ['numpy.core.multiarray'],
'h': ['numpy.core.multiarray', 'numpy', '__main__']}
It also works on itself, which is a relief:
module_frontier(module_frontier)
['inspect']
Problem cases
While these initial experiments are promising, we shouldn’t expect that a simple approach will cover everything we might want to do. Let’s look at a (slightly) more involved example to see if it breaks down.
We’ll use the k-means clustering implementation from scikit-learn to optimize some cluster centers in a model object. We’ll then capture that model object in a closure and analyze it to see what we might need to import to run it elsewhere.
from sklearn.cluster import KMeans
import numpy as np
data = np.random.rand(1000, 2)
model = KMeans(random_state=0).fit(data)
def km_predict_one(sample):
sample = np.array(sample).reshape(1,-1)
return model.predict(sample)[0]
km_predict_one([0.5, 0.5])
7
module_frontier(km_predict_one)
['sklearn.cluster.k_means_', 'numpy']
List comprehensions
So far, so good. Let’s say we want to publish this simple model as a lighter-weight service (without a scikit-learn dependency). We can get that by reimplementing the predict
method from the k-means model:
centers = model.cluster_centers_
from numpy.linalg import norm
def km_predict_two(sample):
_, idx = min([(norm(sample - center), idx) for idx, center in enumerate(centers)])
return idx
km_predict_two([0.5, 0.5])
7
What do we get if we analyze the second method?
module_frontier(km_predict_two)
[]
This is a problem! We’d expect that norm
would be a captured variable in the body of km_predict_two
(and thus that numpy.linalg
would be listed in its module frontier), but that isn’t the case. We can inspect the closure variables:
inspect.getclosurevars(km_predict_two)
ClosureVars(nonlocals={}, globals={'centers': array([[ 0.15441674, 0.15065163],
[ 0.47375581, 0.78146907],
[ 0.83512659, 0.19018115],
[ 0.16262154, 0.86710792],
[ 0.83007508, 0.83832402],
[ 0.16133578, 0.49974156],
[ 0.49490377, 0.22475294],
[ 0.75499895, 0.51576093]])}, builtins={'min': <built-in function min>, 'enumerate': <class 'enumerate'>}, unbound=set())
We can see the cluster centers as well as the min
function and the enumerate
type. But norm
isn’t in the list. Let’s dive deeper. We can use the dis
module (and some functionality that was introduced in Python 3.4) to inspect the Python bytecode for a given function:
from dis import Bytecode
for inst in Bytecode(km_predict_two):
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
km_predict_two.__code__.co_consts
(None,
<code object <listcomp> at 0x116dffd20, file "<ipython-input-8-5a350184a257>", line 5>,
'km_predict_two.<locals>.<listcomp>')
for inst in Bytecode(km_predict_two.__code__.co_consts[1]):
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
km_predict_two.__globals__["norm"]
<function numpy.linalg.linalg.norm>
_.__module__
'numpy.linalg.linalg'
import sys
def km_predict_three(sample):
# unnecessary nested function
def find_best(sample):
(n, i) = (sys.float_info.max, -1)
for idx, center in enumerate(centers):
(n, i) = min((n, i), (norm(sample - center), idx))
return i
return find_best(sample)
km_predict_three([0.5, 0.5])
7
module_frontier(km_predict_three)
[]
from dis import Bytecode
for inst in Bytecode(km_predict_three):
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
from dis import Bytecode
for inst in [op for op in Bytecode(km_predict_three.__code__.co_consts[1]) if "LOAD_" in op.opname]:
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
def km_predict_four(sample):
_, idx = min(map(lambda tup: (norm(sample - tup[1]), tup[0]), enumerate(centers)))
return idx
km_predict_four([0.5, 0.5])
7
module_frontier(km_predict_four)
[]
def km_predict_five(sample):
import numpy
import sys
from numpy.linalg import norm
from sys.float_info import max as MAX_FLOAT
(n, i) = (MAX_FLOAT, -1)
for idx, center in enumerate(centers):
(n, i) = min((n, i), (norm(sample - center), idx))
return i
module_frontier(km_predict_five)
['numpy.core.numeric',
'builtins',
'numpy.core.umath',
'numpy.linalg.linalg',
'numpy.core.multiarray',
'numpy.core.numerictypes',
'numpy',
'sys',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg']
from dis import Bytecode
for inst in Bytecode(km_predict_five):
print("%d: %s(%r)" % (inst.offset, inst.opname, inst.argval))
def example_six():
import json
return json.loads("{'this-sure-is': 'confusing'}")
module_frontier(example_six)
[]
inspect.getclosurevars(example_six)
ClosureVars(nonlocals={}, globals={}, builtins={}, unbound={'loads', 'json'})
import json
module_frontier(example_six)
['json']
inspect.getclosurevars(example_six)
ClosureVars(nonlocals={}, globals={'json': <module 'json' from '/Users/willb/anaconda/lib/python3.6/json/__init__.py'>}, builtins={}, unbound={'loads'})
from dis import Bytecode
for inst in Bytecode(example_six):
print("%d: %s(%r)" % (inst.offset, inst.opname, inst.argval))
def interesting(inst):
from types import CodeType, FunctionType, ModuleType
from importlib import import_module
from functools import reduce
if inst.opname == "IMPORT_NAME":
path = inst.argval.split(".")
path[0] = [import_module(path[0])]
result = reduce(lambda x, a: x + [getattr(x[-1], a)], path)
return ("modules", result)
if inst.opname == "LOAD_GLOBAL":
if inst.argval in globals() and type(globals()[inst.argval]) in [CodeType, FunctionType]:
return ("code", globals()[inst.argval])
if inst.argval in globals() and type(globals()[inst.argval]) == ModuleType:
return ("modules", [globals()[inst.argval]])
else:
return None
if "LOAD_" in inst.opname and type(inst.argval) in [CodeType, FunctionType]:
return ("code", inst.argval)
return None
def mf_revised(f):
worklist = [f]
seen = set()
mods = set()
for fn in worklist:
codeworklist = [fn]
cvs = inspect.getclosurevars(fn)
gvars = cvs.globals
for k, v in gvars.items():
if inspect.ismodule(v):
mods.add(v.__name__)
elif inspect.isfunction(v) and id(v) not in seen:
seen.add(id(v))
mods.add(v.__module__)
worklist.append(v)
elif hasattr(v, "__module__"):
mods.add(v.__module__)
for block in codeworklist:
for (k, v) in [interesting(inst) for inst in Bytecode(block) if interesting(inst)]:
if k == "modules":
newmods = [mod.__name__ for mod in v if hasattr(mod, "__name__")]
mods.update(set(newmods))
elif k == "code" and id(v) not in seen:
seen.add(id(v))
if hasattr(v, "__module__"):
mods.add(v.__module__)
if(inspect.isfunction(v)):
worklist.append(v)
elif(inspect.iscode(v)):
codeworklist.append(v)
result = list(mods)
result.sort()
return result
mf_revised(km_predict_one)
['numpy', 'sklearn.cluster.k_means_']
mf_revised(km_predict_two)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg']
mf_revised(km_predict_three)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg',
'sys']
mf_revised(km_predict_four)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg']
mf_revised(km_predict_five)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg',
'sys']
-
Sound program analyses present conservative overapproximations of program behavior. Consider a may-alias analysis, which determines if two reference variables may refer to the same location in memory. Precise may-alias analysis is undecidable, but certain kinds of imprecision are acceptable. Often we’re interested in sound analyses to support verification or semantics-preserving program transformations, so false positives are acceptable but false negatives are not. Put another way, the worst that can come of spuriously identifying a pair of variables as potentially-aliasing is that we’d miss an opportunity to optimize our program; the worst that can come of not identifying a pair of potentially-aliasing variables as such is a program tranformation that introduces a behavior change. By contrast, unsound analyses are imprecise but not conservative: both false positives and false negatives are possible. These analyses can still be useful for program understanding (e.g., in linters or static bug detectors) even if they are not sufficient to support safe program transformations. ↩
Red Hat talks at Spark Summit EU 2017
Dublin is a charming city and a burgeoning technology hub, but it also has special significance for anyone whose work involves making sense of data, since William Sealy Gosset was working as the head brewer at Guinness when he developed the t-statistic. Last week, Dublin had extra special significance for anyone whose work involves using Apache Spark for data processing. Our group at Red Hat gave three talks at Spark Summit EU this year, and videos of these are now online. You should check them out!
- One Pass Data Science In Apache Spark With Generative T-Digests , by Erik Erlandson,
- Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud, by Mike McCune, and
- Building Machine Learning Algorithms on Apache Spark, by Will Benton.
A lot of the work we discussed is available from radanalytics.io or from the Isarn project; if you’d like to see other talks about data science, distributed computing, and best practices for contemporary intelligent applications, you should see our team’s list of presentations.