Identifying Python module dependencies
This post is also available as an interactive notebook.
Background
Consider the following problem: you’d like to enable users to automatically extract a function from a Jupyter notebook and publish it as a service. Actually serializing a closure from a function in a given environment is not difficult, given the cloudpickle
module. But merely constructing a serialized closure isn’t enough, since in general this function may require other modules to be available to run in another context.
Therefore, we need some way to identify the modules required by a function (and, ultimately, the packages that provide these modules). Since engineering time is limited, it’s probably better to have an optimistic-but-incomplete (or unsound) estimate and allow users to override it (by supplying additional dependencies when they publish a function) than it is to have a sound and conservative module list.1
The modulefinder
module in the Python standard library might initially seem like an attractive option, but it is unsuitable because it operates at the level of scripts. In order to use modulefinder
on a single function from a notebook, we’d either have an imprecise module list (due to running the whole notebook) or we’d need to essentially duplicate a lot of its effort in order to slice backwards from the function invocation so we could extract a suitably pruned script.
Fortunately, you can interrogate nearly any property of any object in a Python program, including functions. If we could inspect the captured variables in a closure, we could identify the ones that are functions and figure out which modules they were declared in. That would look something like this:
import inspect
def module_frontier(f):
worklist = [f]
seen = set()
mods = set()
for fn in worklist:
cvs = inspect.getclosurevars(fn)
gvars = cvs.globals
for k, v in gvars.items():
if inspect.ismodule(v):
mods.add(v.__name__)
elif inspect.isfunction(v) and id(v) not in seen:
seen.add(id(v))
mods.add(v.__module__)
worklist.append(v)
elif hasattr(v, "__module__"):
mods.add(v.__module__)
return list(mods)
The inspect
module provides a friendly interface to inspecting object metadata. In the above function, we’re constructing a worklist of all of the captured variables in a given function’s closure. We’re then constructing a set of all of the modules directly or transitively referred to by those captured variables, whether these are modules referred to directly, modules declaring functions referred to by captured variables, or modules declaring other values referred to by captured variables (e.g., native functions). Note that we add any functions we find to the worklist (although we don’t handle eval
or other techniques), so we’ll capture at least some of the transitive call graph in this case.
This approach seems to work pretty sensibly on simple examples:
import numpy as np
from numpy import dot
def f(a, b):
return np.dot(a, b)
def g(a, b):
return dot(a, b)
def h(a, b):
return f(a, b)
{k.__name__ : module_frontier(k) for k in [f,g,h]}
{'f': ['numpy.core.multiarray', 'numpy'],
'g': ['numpy.core.multiarray'],
'h': ['numpy.core.multiarray', 'numpy', '__main__']}
It also works on itself, which is a relief:
module_frontier(module_frontier)
['inspect']
Problem cases
While these initial experiments are promising, we shouldn’t expect that a simple approach will cover everything we might want to do. Let’s look at a (slightly) more involved example to see if it breaks down.
We’ll use the k-means clustering implementation from scikit-learn to optimize some cluster centers in a model object. We’ll then capture that model object in a closure and analyze it to see what we might need to import to run it elsewhere.
from sklearn.cluster import KMeans
import numpy as np
data = np.random.rand(1000, 2)
model = KMeans(random_state=0).fit(data)
def km_predict_one(sample):
sample = np.array(sample).reshape(1,-1)
return model.predict(sample)[0]
km_predict_one([0.5, 0.5])
7
module_frontier(km_predict_one)
['sklearn.cluster.k_means_', 'numpy']
List comprehensions
So far, so good. Let’s say we want to publish this simple model as a lighter-weight service (without a scikit-learn dependency). We can get that by reimplementing the predict
method from the k-means model:
centers = model.cluster_centers_
from numpy.linalg import norm
def km_predict_two(sample):
_, idx = min([(norm(sample - center), idx) for idx, center in enumerate(centers)])
return idx
km_predict_two([0.5, 0.5])
7
What do we get if we analyze the second method?
module_frontier(km_predict_two)
[]
This is a problem! We’d expect that norm
would be a captured variable in the body of km_predict_two
(and thus that numpy.linalg
would be listed in its module frontier), but that isn’t the case. We can inspect the closure variables:
inspect.getclosurevars(km_predict_two)
ClosureVars(nonlocals={}, globals={'centers': array([[ 0.15441674, 0.15065163],
[ 0.47375581, 0.78146907],
[ 0.83512659, 0.19018115],
[ 0.16262154, 0.86710792],
[ 0.83007508, 0.83832402],
[ 0.16133578, 0.49974156],
[ 0.49490377, 0.22475294],
[ 0.75499895, 0.51576093]])}, builtins={'min': <built-in function min>, 'enumerate': <class 'enumerate'>}, unbound=set())
We can see the cluster centers as well as the min
function and the enumerate
type. But norm
isn’t in the list. Let’s dive deeper. We can use the dis
module (and some functionality that was introduced in Python 3.4) to inspect the Python bytecode for a given function:
from dis import Bytecode
for inst in Bytecode(km_predict_two):
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
km_predict_two.__code__.co_consts
(None,
<code object <listcomp> at 0x116dffd20, file "<ipython-input-8-5a350184a257>", line 5>,
'km_predict_two.<locals>.<listcomp>')
for inst in Bytecode(km_predict_two.__code__.co_consts[1]):
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
km_predict_two.__globals__["norm"]
<function numpy.linalg.linalg.norm>
_.__module__
'numpy.linalg.linalg'
import sys
def km_predict_three(sample):
# unnecessary nested function
def find_best(sample):
(n, i) = (sys.float_info.max, -1)
for idx, center in enumerate(centers):
(n, i) = min((n, i), (norm(sample - center), idx))
return i
return find_best(sample)
km_predict_three([0.5, 0.5])
7
module_frontier(km_predict_three)
[]
from dis import Bytecode
for inst in Bytecode(km_predict_three):
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
from dis import Bytecode
for inst in [op for op in Bytecode(km_predict_three.__code__.co_consts[1]) if "LOAD_" in op.opname]:
print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
def km_predict_four(sample):
_, idx = min(map(lambda tup: (norm(sample - tup[1]), tup[0]), enumerate(centers)))
return idx
km_predict_four([0.5, 0.5])
7
module_frontier(km_predict_four)
[]
def km_predict_five(sample):
import numpy
import sys
from numpy.linalg import norm
from sys.float_info import max as MAX_FLOAT
(n, i) = (MAX_FLOAT, -1)
for idx, center in enumerate(centers):
(n, i) = min((n, i), (norm(sample - center), idx))
return i
module_frontier(km_predict_five)
['numpy.core.numeric',
'builtins',
'numpy.core.umath',
'numpy.linalg.linalg',
'numpy.core.multiarray',
'numpy.core.numerictypes',
'numpy',
'sys',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg']
from dis import Bytecode
for inst in Bytecode(km_predict_five):
print("%d: %s(%r)" % (inst.offset, inst.opname, inst.argval))
def example_six():
import json
return json.loads("{'this-sure-is': 'confusing'}")
module_frontier(example_six)
[]
inspect.getclosurevars(example_six)
ClosureVars(nonlocals={}, globals={}, builtins={}, unbound={'loads', 'json'})
import json
module_frontier(example_six)
['json']
inspect.getclosurevars(example_six)
ClosureVars(nonlocals={}, globals={'json': <module 'json' from '/Users/willb/anaconda/lib/python3.6/json/__init__.py'>}, builtins={}, unbound={'loads'})
from dis import Bytecode
for inst in Bytecode(example_six):
print("%d: %s(%r)" % (inst.offset, inst.opname, inst.argval))
def interesting(inst):
from types import CodeType, FunctionType, ModuleType
from importlib import import_module
from functools import reduce
if inst.opname == "IMPORT_NAME":
path = inst.argval.split(".")
path[0] = [import_module(path[0])]
result = reduce(lambda x, a: x + [getattr(x[-1], a)], path)
return ("modules", result)
if inst.opname == "LOAD_GLOBAL":
if inst.argval in globals() and type(globals()[inst.argval]) in [CodeType, FunctionType]:
return ("code", globals()[inst.argval])
if inst.argval in globals() and type(globals()[inst.argval]) == ModuleType:
return ("modules", [globals()[inst.argval]])
else:
return None
if "LOAD_" in inst.opname and type(inst.argval) in [CodeType, FunctionType]:
return ("code", inst.argval)
return None
def mf_revised(f):
worklist = [f]
seen = set()
mods = set()
for fn in worklist:
codeworklist = [fn]
cvs = inspect.getclosurevars(fn)
gvars = cvs.globals
for k, v in gvars.items():
if inspect.ismodule(v):
mods.add(v.__name__)
elif inspect.isfunction(v) and id(v) not in seen:
seen.add(id(v))
mods.add(v.__module__)
worklist.append(v)
elif hasattr(v, "__module__"):
mods.add(v.__module__)
for block in codeworklist:
for (k, v) in [interesting(inst) for inst in Bytecode(block) if interesting(inst)]:
if k == "modules":
newmods = [mod.__name__ for mod in v if hasattr(mod, "__name__")]
mods.update(set(newmods))
elif k == "code" and id(v) not in seen:
seen.add(id(v))
if hasattr(v, "__module__"):
mods.add(v.__module__)
if(inspect.isfunction(v)):
worklist.append(v)
elif(inspect.iscode(v)):
codeworklist.append(v)
result = list(mods)
result.sort()
return result
mf_revised(km_predict_one)
['numpy', 'sklearn.cluster.k_means_']
mf_revised(km_predict_two)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg']
mf_revised(km_predict_three)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg',
'sys']
mf_revised(km_predict_four)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg']
mf_revised(km_predict_five)
['builtins',
'numpy',
'numpy.core._methods',
'numpy.core.fromnumeric',
'numpy.core.multiarray',
'numpy.core.numeric',
'numpy.core.numerictypes',
'numpy.core.umath',
'numpy.lib.type_check',
'numpy.linalg',
'numpy.linalg._umath_linalg',
'numpy.linalg.linalg',
'sys']
-
Sound program analyses present conservative overapproximations of program behavior. Consider a may-alias analysis, which determines if two reference variables may refer to the same location in memory. Precise may-alias analysis is undecidable, but certain kinds of imprecision are acceptable. Often we’re interested in sound analyses to support verification or semantics-preserving program transformations, so false positives are acceptable but false negatives are not. Put another way, the worst that can come of spuriously identifying a pair of variables as potentially-aliasing is that we’d miss an opportunity to optimize our program; the worst that can come of not identifying a pair of potentially-aliasing variables as such is a program tranformation that introduces a behavior change. By contrast, unsound analyses are imprecise but not conservative: both false positives and false negatives are possible. These analyses can still be useful for program understanding (e.g., in linters or static bug detectors) even if they are not sufficient to support safe program transformations. ↩