I’m thrilled to be in Brno for DevConf again! This year I’m speaking about structures and techniques for scalable data processing, and this post is a virtual handout for my talk.

  • Here’s a Jupyter notebook containing all of the code I discussed in the talk, complete with a few exercises and some links to other resources.
  • There are lots of implementations of these techniques that you can use in production. Apache Spark, for example, uses some of these structures to support library operations like aggregates in structured queries. Algebird provides scalable and parallel implementations of all of the techniques I discussed and it is especially nice if you’re an algebraically-inclined fan of functional programming (guilty as charged).
  • A very cool data structure that I didn’t have time to talk about is the t-digest, which calculates approximate cumulative distributions (so that you can take a stream of metric observations and ask, e.g., what’s the median latency, or the latency at the 99th percentile?) My friend Erik Erlandson has a really elegant scala implementation of the t-digest and has also given several great talks on the t-digest and some really clever applications for scalable cumulative distribution estimates. Start with this one.
  • radanalytics.io is a community effort to enable scalable data processing and intelligent applications on OpenShift, including tooling to manage compute resources in intelligent applications and a distribution of Apache Spark for OpenShift.

My slides are available here.

My team recently agreed that it would improve the usability of our main Trello board if we moved lists containing cards we’d completed in previous years to archival boards. The idea was that those lists and cards would still be searchable and accessible but that they wouldn’t be cluttering our view of our current work. I moved very old lists late in November, moved all of our lists from 2017 at the beginning of this week, and prepared to bask in a web page containing only recent virtual index cards.

My basking ended abruptly, as baskings are wont to do. In this case, the abrupt end was occasioned by an offhand question from a colleague:

“By the way, what’s the deal with me getting removed from all of my old cards?”

I looked at the Trello board I’d created to archive activity from 2017 and saw that the only cards that had a member attached were my cards. Even though I’d made the archive board visible to the whole team, every other person on the team was removed from her cards when I moved the lists.

Now, I’m not a Trello expert and hope I’ll never become one. It may be that removing users from cards on boards they don’t belong to is actually the correct behavior. However, having such a drastic side effect occur without warning is absolutely user-hostile.1

Since software is rarely content to injure without also insulting, Trello also insinuated that I had explicitly removed my colleagues from their cards, like this:

So not only had I screwed up our team’s task history, but I looked like a jerk with too much free time.

Fortunately, all of those spurious explicit removals gave me a way to start unwinding the mess. Those member removals were captured in the actions log for each card as removeMemberFromCard actions; I was able to see them by exporting cards as JSON:2

    "actions": [
      {
        /* ... */
        "type": "removeMemberFromCard",
        "date": "2018-01-11T15:57:16.652Z",
        "member": {
          "id":  /* ... */,
          "avatarHash":  /* ... */,
          "fullName": "Erik Erlandson",
          "initials": "EJE",
          "username":  /* ... */
        },
        "memberCreator": {
          "id":  /* ... */,
          "avatarHash":  /* ... */,
          "fullName": "William Benton",
          "initials": "WB",
          "username":  /* ... */
        }
      }
    ],

Trello provides a pretty decent API, so I got to work. (The official Trello Python client appears to lack support for Python 3; I used py-trello instead.) My basic approach was to look for removeMemberFromCard actions that had happened since just before I moved the lists, identify the removed members from each card, and then add them back to the card.

I was able to get our history restored pretty quickly. Here are some of the minor snags I hit with the Trello API and how I worked around them:

  • By default, querying for actions on cards only returns card-creation actions and comments. You will need to specify an explicit action type filter to the API (e.g., removeMemberFromCard or all) in order to get all relevant actions.
  • Even though I cached results and thought I took adequate care to avoid Trello’s rate limits, I found myself regularly getting rate-limited at the /1/members endpoint while resolving from member IDs to py-trello Member objects to pass to the add_member function on a card. I was able to work around this by converting the dict corresponding to the member in the action to a namedtuple instance, which acted enough like a Member object to do the trick.3
  • Some cards didn’t have the removeMemberFromCard actions. This actually seems like a Trello bug, but I was able to work around it by adding everyone who had ever been added to a card but wasn’t currently on it. This means that there may be some people spuriously ascribed to cards now (i.e., people who should have been explicitly removed from cards), but I think it’s better to have slightly lower precision than virtually zero recall in this application. (Also, our team’s practice is to only add members to cards when they’re in progress or complete, which minimizes the potential impact here.)

My code, which is quick, dirty, and profoundly underengineered, is available for your review. To use it, you’ll need a Trello API key, OAuth secret, and token, all of which you can get from Trello’s developer site.

The code is certainly not that broadly useful but hopefully the takeaway lesson is: you can recover from a lot of application bugs and misfeatures if your data model explicitly tracks state changes.4 It may even be worth going to a data representation that explicitly allows rollback in some cases. Finally, If you expose a way to inspect history with your API, users can even recover from your bugs without your help.


  1. Could Trello have asked if I wanted to invite users to the board? Told me I’d be removing member ascriptions from the cards before moving? It certainly seems like it should have. 

  2. I’ve redacted unnecessary information, including usernames, member IDs, and other IDs. 

  3. Hooray for the Wild West of untyped languages, eh? 

  4. It’s almost as if those wacky functional programming zealots have a point about their persistent data structures. 

This post is also available as an interactive notebook.

Background

Consider the following problem: you’d like to enable users to automatically extract a function from a Jupyter notebook and publish it as a service. Actually serializing a closure from a function in a given environment is not difficult, given the cloudpickle module. But merely constructing a serialized closure isn’t enough, since in general this function may require other modules to be available to run in another context.

Therefore, we need some way to identify the modules required by a function (and, ultimately, the packages that provide these modules). Since engineering time is limited, it’s probably better to have an optimistic-but-incomplete (or unsound) estimate and allow users to override it (by supplying additional dependencies when they publish a function) than it is to have a sound and conservative module list.1

The modulefinder module in the Python standard library might initially seem like an attractive option, but it is unsuitable because it operates at the level of scripts. In order to use modulefinder on a single function from a notebook, we’d either have an imprecise module list (due to running the whole notebook) or we’d need to essentially duplicate a lot of its effort in order to slice backwards from the function invocation so we could extract a suitably pruned script.

Fortunately, you can interrogate nearly any property of any object in a Python program, including functions. If we could inspect the captured variables in a closure, we could identify the ones that are functions and figure out which modules they were declared in. That would look something like this:

In: 1
import inspect

def module_frontier(f):
  worklist = [f]
  seen = set()
  mods = set()
  for fn in worklist:
    cvs = inspect.getclosurevars(fn)
    gvars = cvs.globals
    for k, v in gvars.items():
      if inspect.ismodule(v):
        mods.add(v.__name__)
      elif inspect.isfunction(v) and id(v) not in seen:
        seen.add(id(v))
        mods.add(v.__module__)
        worklist.append(v)
      elif hasattr(v, "__module__"):
        mods.add(v.__module__)
  return list(mods)

The inspect module provides a friendly interface to inspecting object metadata. In the above function, we’re constructing a worklist of all of the captured variables in a given function’s closure. We’re then constructing a set of all of the modules directly or transitively referred to by those captured variables, whether these are modules referred to directly, modules declaring functions referred to by captured variables, or modules declaring other values referred to by captured variables (e.g., native functions). Note that we add any functions we find to the worklist (although we don’t handle eval or other techniques), so we’ll capture at least some of the transitive call graph in this case.

This approach seems to work pretty sensibly on simple examples:

In: 2
import numpy as np
from numpy import dot

def f(a, b):
    return np.dot(a, b)

def g(a, b):
    return dot(a, b)

def h(a, b):
    return f(a, b)
In: 3
{k.__name__ : module_frontier(k) for k in [f,g,h]}
Out: 3
{'f': ['numpy.core.multiarray', 'numpy'],
 'g': ['numpy.core.multiarray'],
 'h': ['numpy.core.multiarray', 'numpy', '__main__']}

It also works on itself, which is a relief:

In: 4
module_frontier(module_frontier)
Out: 4
['inspect']

Problem cases

While these initial experiments are promising, we shouldn’t expect that a simple approach will cover everything we might want to do. Let’s look at a (slightly) more involved example to see if it breaks down.

We’ll use the k-means clustering implementation from scikit-learn to optimize some cluster centers in a model object. We’ll then capture that model object in a closure and analyze it to see what we might need to import to run it elsewhere.

In: 5
from sklearn.cluster import KMeans
import numpy as np

data = np.random.rand(1000, 2)

model = KMeans(random_state=0).fit(data)

def km_predict_one(sample):
    sample = np.array(sample).reshape(1,-1)
    return model.predict(sample)[0]
In: 6
km_predict_one([0.5, 0.5])
Out: 6
7
In: 7
module_frontier(km_predict_one)
Out: 7
['sklearn.cluster.k_means_', 'numpy']

List comprehensions

So far, so good. Let’s say we want to publish this simple model as a lighter-weight service (without a scikit-learn dependency). We can get that by reimplementing the predict method from the k-means model:

In: 8
centers = model.cluster_centers_
from numpy.linalg import norm

def km_predict_two(sample):
    _, idx = min([(norm(sample - center), idx) for idx, center in enumerate(centers)])
    return idx
In: 9
km_predict_two([0.5, 0.5])
Out: 9
7

What do we get if we analyze the second method?

In: 10
module_frontier(km_predict_two)
Out: 10
[]

This is a problem! We’d expect that norm would be a captured variable in the body of km_predict_two (and thus that numpy.linalg would be listed in its module frontier), but that isn’t the case. We can inspect the closure variables:

In: 11
inspect.getclosurevars(km_predict_two)
Out: 11
ClosureVars(nonlocals={}, globals={'centers': array([[ 0.15441674,  0.15065163],
       [ 0.47375581,  0.78146907],
       [ 0.83512659,  0.19018115],
       [ 0.16262154,  0.86710792],
       [ 0.83007508,  0.83832402],
       [ 0.16133578,  0.49974156],
       [ 0.49490377,  0.22475294],
       [ 0.75499895,  0.51576093]])}, builtins={'min': <built-in function min>, 'enumerate': <class 'enumerate'>}, unbound=set())

We can see the cluster centers as well as the min function and the enumerate type. But norm isn’t in the list. Let’s dive deeper. We can use the dis module (and some functionality that was introduced in Python 3.4) to inspect the Python bytecode for a given function:

In: 12
from dis import Bytecode
for inst in Bytecode(km_predict_two):
    print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
Out: 12
0: LOAD_GLOBAL(min)
2: LOAD_CLOSURE(sample)
4: BUILD_TUPLE()
6: LOAD_CONST(<code object <listcomp> at 0x116dffd20, file "<ipython-input-8-5a350184a257>", line 5>)
8: LOAD_CONST('km_predict_two.<locals>.<listcomp>')
10: MAKE_FUNCTION()
12: LOAD_GLOBAL(enumerate)
14: LOAD_GLOBAL(centers)
16: CALL_FUNCTION()
18: GET_ITER()
20: CALL_FUNCTION()
22: CALL_FUNCTION()
24: UNPACK_SEQUENCE()
26: STORE_FAST(_)
28: STORE_FAST(idx)
30: LOAD_FAST(idx)
32: RETURN_VALUE()

Ah ha! The body of our list comprehension, which contains the call to norm, is a separate code object that has been stored in a constant. Let’s look at the constants for our function:

In: 13
km_predict_two.__code__.co_consts
Out: 13
(None,
 <code object <listcomp> at 0x116dffd20, file "<ipython-input-8-5a350184a257>", line 5>,
 'km_predict_two.<locals>.<listcomp>')

We can see the code object in the constant list and use dis to disassemble it as well:

In: 14
for inst in Bytecode(km_predict_two.__code__.co_consts[1]):
    print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
Out: 14
0: BUILD_LIST()
2: LOAD_FAST(.0)
4: FOR_ITER(to 30)
6: UNPACK_SEQUENCE()
8: STORE_FAST(idx)
10: STORE_FAST(center)
12: LOAD_GLOBAL(norm)
14: LOAD_DEREF(sample)
16: LOAD_FAST(center)
18: BINARY_SUBTRACT()
20: CALL_FUNCTION()
22: LOAD_FAST(idx)
24: BUILD_TUPLE()
26: LIST_APPEND()
28: JUMP_ABSOLUTE()
30: RETURN_VALUE()

Once we’ve done so, we can see that the list comprehension has loaded norm from a global, which we can then resolve and inspect:

In: 15
km_predict_two.__globals__["norm"]
Out: 15
<function numpy.linalg.linalg.norm>
In: 16
_.__module__
Out: 16
'numpy.linalg.linalg'

Nested functions and lambda expressions

We can see a similar problem if we look at a function with local definitions (note that there is no need for the nesting in this example other than to expose a limitation of our technique):

In: 17
import sys

def km_predict_three(sample):
    # unnecessary nested function
    def find_best(sample):
        (n, i) = (sys.float_info.max, -1)
        for idx, center in enumerate(centers):
            (n, i) = min((n, i), (norm(sample - center), idx))
        return i
    return find_best(sample)

km_predict_three([0.5, 0.5])
Out: 17
7
In: 18
module_frontier(km_predict_three)
Out: 18
[]

In this case, we can see that Python compiles these nested functions in essentially the same way it compiles the bodies of list comprehensions:

In: 19
from dis import Bytecode
for inst in Bytecode(km_predict_three):
    print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
Out: 19
0: LOAD_CONST(<code object find_best at 0x116e0a390, file "<ipython-input-17-e19a1ac37885>", line 5>)
2: LOAD_CONST('km_predict_three.<locals>.find_best')
4: MAKE_FUNCTION()
6: STORE_FAST(find_best)
8: LOAD_FAST(find_best)
10: LOAD_FAST(sample)
12: CALL_FUNCTION()
14: RETURN_VALUE()

But we can inspect the nested function just as we did the list comprehension. Let’s do that but just look at the load instructions; we’ll see that we’ve loaded sys and norm as we’d expect.

In: 20
from dis import Bytecode
for inst in [op for op in Bytecode(km_predict_three.__code__.co_consts[1]) if "LOAD_" in op.opname]:
    print("%d: %s(%s)" % (inst.offset, inst.opname, inst.argrepr))
Out: 20
0: LOAD_GLOBAL(sys)
2: LOAD_ATTR(float_info)
4: LOAD_ATTR(max)
6: LOAD_CONST(-1)
16: LOAD_GLOBAL(enumerate)
18: LOAD_GLOBAL(centers)
32: LOAD_GLOBAL(min)
34: LOAD_FAST(n)
36: LOAD_FAST(i)
40: LOAD_GLOBAL(norm)
42: LOAD_FAST(sample)
44: LOAD_FAST(center)
50: LOAD_FAST(idx)
66: LOAD_FAST(i)

Predictably, we can also see a similar problem if we analyze a function with lambda expressions:

In: 21
def km_predict_four(sample):
    _, idx = min(map(lambda tup: (norm(sample - tup[1]), tup[0]), enumerate(centers)))
    return idx

km_predict_four([0.5, 0.5])
Out: 21
7
In: 22
module_frontier(km_predict_four)
Out: 22
[]

Explicit imports

Let’s look at what happens when we import modules inside the function we’re analyzing. Because of the semantics of import in Python, the module dependency list for this function will depend on whether or not we’ve already imported numpy and sys in the global namespace. If we have, we’ll get a reasonable module list; if we haven’t, we’ll get an empty module list. (If you’re running this code in a notebook, you can try it out by restarting the kernel, re-executing the cell with the definition of module_frontier, and then executing this cell.)

In: 23
def km_predict_five(sample):
    import numpy
    import sys
    from numpy.linalg import norm
    from sys.float_info import max as MAX_FLOAT
    
    (n, i) = (MAX_FLOAT, -1)
    for idx, center in enumerate(centers):
        (n, i) = min((n, i), (norm(sample - center), idx))
    return i

module_frontier(km_predict_five)
Out: 23
['numpy.core.numeric',
 'builtins',
 'numpy.core.umath',
 'numpy.linalg.linalg',
 'numpy.core.multiarray',
 'numpy.core.numerictypes',
 'numpy',
 'sys',
 'numpy.core._methods',
 'numpy.core.fromnumeric',
 'numpy.lib.type_check',
 'numpy.linalg._umath_linalg']
In: 24
from dis import Bytecode
for inst in Bytecode(km_predict_five):
    print("%d: %s(%r)" % (inst.offset, inst.opname, inst.argval))
Out: 24
0: LOAD_CONST(0)
2: LOAD_CONST(None)
4: IMPORT_NAME('numpy')
6: STORE_FAST('numpy')
8: LOAD_CONST(0)
10: LOAD_CONST(None)
12: IMPORT_NAME('sys')
14: STORE_FAST('sys')
16: LOAD_CONST(0)
18: LOAD_CONST(('norm',))
20: IMPORT_NAME('numpy.linalg')
22: IMPORT_FROM('norm')
24: STORE_FAST('norm')
26: POP_TOP(None)
28: LOAD_CONST(0)
30: LOAD_CONST(('max',))
32: IMPORT_NAME('sys.float_info')
34: IMPORT_FROM('max')
36: STORE_FAST('MAX_FLOAT')
38: POP_TOP(None)
40: LOAD_FAST('MAX_FLOAT')
42: LOAD_CONST(-1)
44: ROT_TWO(None)
46: STORE_FAST('n')
48: STORE_FAST('i')
50: SETUP_LOOP(102)
52: LOAD_GLOBAL('enumerate')
54: LOAD_GLOBAL('centers')
56: CALL_FUNCTION(1)
58: GET_ITER(None)
60: FOR_ITER(100)
62: UNPACK_SEQUENCE(2)
64: STORE_FAST('idx')
66: STORE_FAST('center')
68: LOAD_GLOBAL('min')
70: LOAD_FAST('n')
72: LOAD_FAST('i')
74: BUILD_TUPLE(2)
76: LOAD_FAST('norm')
78: LOAD_FAST('sample')
80: LOAD_FAST('center')
82: BINARY_SUBTRACT(None)
84: CALL_FUNCTION(1)
86: LOAD_FAST('idx')
88: BUILD_TUPLE(2)
90: CALL_FUNCTION(2)
92: UNPACK_SEQUENCE(2)
94: STORE_FAST('n')
96: STORE_FAST('i')
98: JUMP_ABSOLUTE(60)
100: POP_BLOCK(None)
102: LOAD_FAST('i')
104: RETURN_VALUE(None)

We can see this more clearly by importing a module in a function’s scope that we haven’t imported into the global namespace:

In: 25
def example_six():
    import json
    return json.loads("{'this-sure-is': 'confusing'}")

module_frontier(example_six)
Out: 25
[]
In: 26
inspect.getclosurevars(example_six)
Out: 26
ClosureVars(nonlocals={}, globals={}, builtins={}, unbound={'loads', 'json'})

json is an unbound variable (since it isn’t bound in the enclosing environment of the closure). If it were bound in the global namespace, however, the json we’re referring to in example_six would be captured as a global variable:

In: 27
import json

module_frontier(example_six)
Out: 27
['json']
In: 28
inspect.getclosurevars(example_six)
Out: 28
ClosureVars(nonlocals={}, globals={'json': <module 'json' from '/Users/willb/anaconda/lib/python3.6/json/__init__.py'>}, builtins={}, unbound={'loads'})

Obviously, we’d like to return the same module-dependency results for functions that import modules locally independently of whether those modules have been imported into the global namespace. We can look at the bytecode for this function to see what instructions might be relevant:

In: 29
from dis import Bytecode
for inst in Bytecode(example_six):
    print("%d: %s(%r)" % (inst.offset, inst.opname, inst.argval))
Out: 29
0: LOAD_CONST(0)
2: LOAD_CONST(None)
4: IMPORT_NAME('json')
6: STORE_FAST('json')
8: LOAD_FAST('json')
10: LOAD_ATTR('loads')
12: LOAD_CONST("{'this-sure-is': 'confusing'}")
14: CALL_FUNCTION(1)
16: RETURN_VALUE(None)

Solving problems

To address the cases that the closure-inspecting approach misses, we can inspect the bytecode of each function. (We could also inspect abstract syntax trees, using the ast module, but in general it’s easier to do this sort of work with a lower-level, more regular representation. ASTs have more cases to treat than bytecode.)

Python bytecode is stack-based, meaning that each instruction may take one or more arguments from the stack (in addition to explicit arguments encoded in the instruction). For a more involved analysis, we’d probably want to convert Python bytecode to a representation with explicit operands (like three-address code; see section 2 of Vallée-Rai et al. for a reasonable approach), but let’s see how far we can get by just operating on bytecode.

Identifying interesting bytecodes

We know from the problem cases we examined earlier that we need to worry about a few different kinds of bytecode instructions to find some of the modules that our inspect-based approach missed:

  • LOAD_CONST instructions that load code objects (e.g., list comprehension bodies, lambda expressions, or nested functions);
  • other LOAD_ instructions that might do the same; and
  • IMPORT_NAME instructions that import a module into a function’s namespace.

Let’s extend our technique to also inspect relevant bytecodes. We’ll see how far we can get by just looking at bytecodes in isolation (without modeling the stack or value flow). First, we’ll identify the “interesting” bytecodes and return the modules, functions, or code blocks that they implicate:

In: 30
def interesting(inst):
    from types import CodeType, FunctionType, ModuleType
    from importlib import import_module
    from functools import reduce
    
    if inst.opname == "IMPORT_NAME":
        path = inst.argval.split(".")
        path[0] = [import_module(path[0])]
        result = reduce(lambda x, a: x + [getattr(x[-1], a)], path)
        return ("modules", result)
    if inst.opname == "LOAD_GLOBAL":
        if inst.argval in globals() and type(globals()[inst.argval]) in [CodeType, FunctionType]:
            return ("code", globals()[inst.argval])
        if inst.argval in globals() and type(globals()[inst.argval]) == ModuleType:
            return ("modules", [globals()[inst.argval]])
        else:
            return None
    if "LOAD_" in inst.opname and type(inst.argval) in [CodeType, FunctionType]:
        return ("code", inst.argval)
    return None

Now we can make a revised version of our module_frontier function. This starts with the same basic approach as the initial function but it also:

  • processes the bytecode for each code block transitively referred to in each function,
  • processes any modules explicitly imported in code.
In: 31
def mf_revised(f):
  worklist = [f]
  seen = set()
  mods = set()
  for fn in worklist:
    codeworklist = [fn]
    cvs = inspect.getclosurevars(fn)
    gvars = cvs.globals
    for k, v in gvars.items():
        if inspect.ismodule(v):
            mods.add(v.__name__)
        elif inspect.isfunction(v) and id(v) not in seen:
            seen.add(id(v))
            mods.add(v.__module__)
            worklist.append(v)
        elif hasattr(v, "__module__"):
            mods.add(v.__module__)
    for block in codeworklist:
        for (k, v) in [interesting(inst) for inst in Bytecode(block) if interesting(inst)]:
            if k == "modules":
                newmods = [mod.__name__ for mod in v if hasattr(mod, "__name__")]
                mods.update(set(newmods))
            elif k == "code" and id(v) not in seen:
                seen.add(id(v))
                if hasattr(v, "__module__"):
                    mods.add(v.__module__)
            if(inspect.isfunction(v)):
                worklist.append(v)
            elif(inspect.iscode(v)):
                codeworklist.append(v)
   
  result = list(mods)
  result.sort()
  return result

As you can see, this new approach produces sensible results for all of our examples, including the ones that had confounded the closure-variable approach.

In: 32
mf_revised(km_predict_one)
Out: 32
['numpy', 'sklearn.cluster.k_means_']
In: 33
mf_revised(km_predict_two)
Out: 33
['builtins',
 'numpy',
 'numpy.core._methods',
 'numpy.core.fromnumeric',
 'numpy.core.multiarray',
 'numpy.core.numeric',
 'numpy.core.numerictypes',
 'numpy.core.umath',
 'numpy.lib.type_check',
 'numpy.linalg._umath_linalg',
 'numpy.linalg.linalg']
In: 34
mf_revised(km_predict_three)
Out: 34
['builtins',
 'numpy',
 'numpy.core._methods',
 'numpy.core.fromnumeric',
 'numpy.core.multiarray',
 'numpy.core.numeric',
 'numpy.core.numerictypes',
 'numpy.core.umath',
 'numpy.lib.type_check',
 'numpy.linalg._umath_linalg',
 'numpy.linalg.linalg',
 'sys']
In: 35
mf_revised(km_predict_four)
Out: 35
['builtins',
 'numpy',
 'numpy.core._methods',
 'numpy.core.fromnumeric',
 'numpy.core.multiarray',
 'numpy.core.numeric',
 'numpy.core.numerictypes',
 'numpy.core.umath',
 'numpy.lib.type_check',
 'numpy.linalg._umath_linalg',
 'numpy.linalg.linalg']
In: 36
mf_revised(km_predict_five)
Out: 36
['builtins',
 'numpy',
 'numpy.core._methods',
 'numpy.core.fromnumeric',
 'numpy.core.multiarray',
 'numpy.core.numeric',
 'numpy.core.numerictypes',
 'numpy.core.umath',
 'numpy.lib.type_check',
 'numpy.linalg',
 'numpy.linalg._umath_linalg',
 'numpy.linalg.linalg',
 'sys']

This is ongoing work and I hope to cover refinements to and extensions of this technique in future posts; as I mentioned at the beginning of the post, the ultimate goal is a tool to publish functions in notebook cells as self-contained services. It has been fun to learn about the power (and limitations) of the inspect module – as well as a little more about how Python compiles code blocks and nested functions.

Thanks to my friend and colleague Erik Erlandson for suggesting improvements to the presentation of this post.


  1. Sound program analyses present conservative overapproximations of program behavior. Consider a may-alias analysis, which determines if two reference variables may refer to the same location in memory. Precise may-alias analysis is undecidable, but certain kinds of imprecision are acceptable. Often we’re interested in sound analyses to support verification or semantics-preserving program transformations, so false positives are acceptable but false negatives are not. Put another way, the worst that can come of spuriously identifying a pair of variables as potentially-aliasing is that we’d miss an opportunity to optimize our program; the worst that can come of not identifying a pair of potentially-aliasing variables as such is a program tranformation that introduces a behavior change. By contrast, unsound analyses are imprecise but not conservative: both false positives and false negatives are possible. These analyses can still be useful for program understanding (e.g., in linters or static bug detectors) even if they are not sufficient to support safe program transformations. 

The Custom House

Dublin is a charming city and a burgeoning technology hub, but it also has special significance for anyone whose work involves making sense of data, since William Sealy Gosset was working as the head brewer at Guinness when he developed the t-statistic. Last week, Dublin had extra special significance for anyone whose work involves using Apache Spark for data processing. Our group at Red Hat gave three talks at Spark Summit EU this year, and videos of these are now online. You should check them out!

A lot of the work we discussed is available from radanalytics.io or from the Isarn project; if you’d like to see other talks about data science, distributed computing, and best practices for contemporary intelligent applications, you should see our team’s list of presentations.

I’m giving a talk this afternoon at Spark Summit EU on extending Spark with new machine learning algorithms. Here are some additional resources and links:

  • Our team’s Silex library is where I’ve published my ongoing work to develop a self-organizing map implementation for Spark and to extend it with support for data frames and ML pipelines
  • I gave a talk about using self-organizing maps in Spark last year at Spark Summit
  • If you like the idea of developing new ML techniques on Spark, you’ll also want to attend a session tomorrow in which my friend and teammate Erik Erlandson will be talking about using his parallel t-digest implementation to support feature importance and other applications.
  • Finally, if you’re doing anything where parallelism and scale matter, especially in a cloud-native environment, you should also check out Mike McCune’s talk on Spark monitoring and metrics.

I’m speaking this morning at the OpenShift Commons Gathering about my team’s experience running Apache Spark on Kubernetes and OpenShift. Here are some links to learn more:

I’ll be speaking about Spark on Kubernetes at Spark Summit EU this week. The main thesis of my talk is that the old way of running Spark in a dedicated cluster that is shared between applications makes sense when analytics is a separate workload. However, analytics is no longer a separate workload – instead, analytics is now an essential part of long-running data-driven applications. This realization motivated my team to switch from a shared Spark cluster to multiple logical clusters that are co-scheduled with the applications that depend on them.

I’m glad for the opportunity to get together with the Spark community and present on some of the cool work my team has done lately. Here are some links you can visit to learn more about our work and other topics related to running Spark on Kubernetes and OpenShift:

I’m delighted to have a chance to present at HTCondor Week this year and am looking forward to seeing some old friends and collaborators. The thesis of my talk is that HTCondor users who aren’t already leading data science initiatives are well-equipped to start doing so. The talk is brief and high-level, so here are a few quick links to learn more if you’re interested:

I also gave a quick overview of some of my team’s recent data science projects; visit these links to learn more:

As I mentioned earlier, I’ll be talking about feature engineering and outlier detection for infrastructure log data at Apache: Big Data next week. Consider this post a virtual handout for that talk. (I’ll also be presenting another talk on scalable log data analysis later this summer. That talk is also inspired by my recent work with logs but will focus on different parts of the problem, so stay tuned if you’re interested in the domain!)

Some general links:

  • You can download a PDF of my slide deck. I recognize that people often want to download slides, although I’d prefer you look at the rest of this post instead since my slides are not intended to stand alone without my presentation.
  • Check out my team’s Silex library, which is intended to extend the standard Spark library with high-quality, reusable components for real-world data science. The most recent release includes the self-organizing map implementation I mentioned in my talk.
  • Watch this short video presentation showing some of the feature engineering and dimensionality-reduction techniques I discussed in the talk.

The following blog posts provide a deeper dive into some of the topics I covered in the talk:

  • When I started using Spark and ElasticSearch, the upstream documentation was pretty sparse (it was especially confusing because it required some unidiomatic configuration steps). So I wrote up my experiences getting things working. This is an older post but may still be helpful.
  • If you’re interested in applying natural-language techniques to log data, you should consider your preprocessing pipeline. Here are the choices I made when I was evaluating word2vec on log messages.
  • Here’s a brief (and not-overly technical) overview of self-organizing maps, including static visual explanations and an animated demo.