Efficiently sampling from many probability distributions in SciPy

Avoid a performance pitfall when using SciPy’s probability distributions.
data science
python
scipy
simulation
Published

March 15, 2020

You probably already know that if you’re modeling multiple independent phenomena in a repeatable simulation, you want multiple independent pseudorandom number generators. But you may be surprised by a consequence of following this approach if you’re using the excellent probability distributions supplied by the scipy.stats package. Read on to learn what the problem is and how to solve it!

Two ways to sample

Say you’re simulating the operations of a large retailer and have modeled the number of customer arrivals in a particular timespan with a Poisson distribution with some parameter λ. There are at least two ways to get a dozen samples from that distribution using SciPy.

We could supply the distribution parameters and a random state in each sampling call:


import scipy.stats
import numpy as np
seed = 0x00c0ffee

mean = 5
rs = np.random.RandomState(seed)
samples = scipy.stats.poisson.rvs(mean, size=12, random_state=rs)

or we could use a distribution object, which allows us to specify the parameters (including a random seed) once:


import scipy.stats

mean = 5
seed = 0x00c0ffee

distribution = scipy.stats.poisson(mean)
distribution.random_state = seed

samples = distribution.rvs(size=12)

In the first example, we have twelve samples from a Poisson distribution with a λ of mean; we specify the shape parameter when we draw from the distribution. In the second example, we’re creating a distribution object with a fixed λ, backed by a private pseudorandom number generator, seeded with a supplied value.

Interfaces and implementations

The second approach has two advantages: Firstly, we have an object with fixed distribution parameters (depending on the distribution, there can be several, including location and scale), so we don’t have to worry about tracking these every time we want to sample from this distribution. Secondly, we have a way to make sampling from this distribution deterministic by seeding it but without passing the same RandomState for each independent stream of values.

The disadvantage of the second approach only becomes obvious when we have many distribution objects in a single program. To get a hint for what goes wrong, let’s run a little experiment. The following two functions, which simulate running a certain number of steps of a simulation that depends on a certain number of independent actors, should have identical behavior.

def experiment_one(agents, steps):
    def mkpoisson(l,seed):
        p = scipy.stats.poisson(l)
        p.random_state = seed
        return p

    seeds = np.random.randint(1<<32, size=agents)
    streams = [mkpoisson(12, seed) for seed in seeds]
    for p in streams:
        p.rvs(steps)

def experiment_two(agents, steps):
    seeds = np.random.randint(1<<32, size=agents)
    states = [np.random.RandomState(seed) for seed in seeds]
    for rs in states:
        scipy.stats.poisson.rvs(12, size=steps, random_state=rs)
        

If we run both of these functions, though, we’ll see how they behave differently: running experiment_one for a thousand steps with ten thousand agents takes roughly 14 seconds on my laptop, but running experiment_two with the same parameters takes roughly 3¼ seconds. (You can try it for yourself locally or on binder.)

Explaining the performance difference

Why is the less-convenient API so much faster? To see why, let’s profile the first function:

import cProfile
import pstats
from pstats import SortKey

cProfile.run("experiment_one(10000,1000)", sort=SortKey.TIME)

This will show us the top function calls by exclusive time (i.e., not including time spent in callees). In my environment, the top function is docformat in doccer.py, which is called twice for each agent. In terms of exclusive time, it accounts for roughly 20% of the total execution of the experiment; in terms of inclusive time (i.e., including callees), it accounts for over half the time spent in the experiment.

What does docformat do? It reformats function docstrings and performs textual substitution on them. This makes sense in one context – building up a library of distribution classes from abstract bases and filling in documentation for all of the subclasses. In the context of creating an individual instance of a distribution object with particular parameters, it’s an interesting design decision indeed, especially since we’d be unlikely to examine the documentation for thousands of distribution objects that are internal to a simulation. (SciPy refers to this as “freezing” a distribution. The documentation briefly mentions that it’s convenient to fix the shape and parameters of a distribution instance but doesn’t mention the performance impact, although searching StackOverflow and GitHub shows that others have been bitten by this issue as well.)

Some solutions

Fortunately, there are a couple of ways to work around this problem. We could simply write code that looks like experiment_two, passing distribution parameters and a stateful random number generator to each function. This would be fast but clunky.

We could also sample from a uniform distribution and map those samples to samples of our target distribution by using the inverse cumulative distribution function (or percentage point function) of the target distribution, like this example that takes ten samples from a Poisson distribution:

prng = np.random.RandomState(seed=0x00c0ffee)
scipy.stats.poisson.ppf(prng.uniform(size=10), mu=12)

(Note that SciPy calls the λ parameter mu, presumably to avoid conflict with the Python keyword lambda.)

We can make either of these approaches somewhat cleaner by wrapping them in a Python generator, like this:

def mkpoisson(l, prng):
    while True:
        yield from scipy.stats.poisson.ppf(prng.uniform(size=1024), mu=l)

We can then use the iterators returned by this generator to repeatedly sample from the distribution:

p = mkpoisson(12, np.random.RandomState(seed=0x00c0ffee))

for i in range(10):
    print(p.next())

Postscript and sidebar

Of course, if we want a deterministic simulation involving a truly large number of independent phenomena, the properties of the pseudorandom number generation algorithm we use can become important. The RandomState class from NumPy, like the pseudorandom number generator in the Python standard library, uses the Mersenne Twister, which has an extremely long period but requires roughly 2kb of internal state, which you can inspect for yourself:

rs = np.random.RandomState(seed=0x00c0ffee)
rs.get_state()

The new NumPy RNG policy, which was implemented in NumPy 1.17, features a Generator class backed by an underlying source of bit-level randomness.1 The default bit-level source is Melissa O’Neill’s PCG, which requires only two 128-bit integers of state and has better statistical properties than the Mersenne Twister. Other approaches to bit-level generation may be worth investigating in the future due to the possibility of better performance.

You can use the new PCG implementation like this:

prng = np.random.default_rng(seed=0x00c0ffee)
scipy.stats.poisson.ppf(prng.uniform(size=10), mu=12)

Footnotes

  1. That’s an eminently sensible approach!↩︎