Efficiently sampling from many probability distributions in SciPy
You probably already know that if you’re modeling multiple independent phenomena in a repeatable simulation, you want multiple independent pseudorandom number generators. But you may be surprised by a consequence of following this approach if you’re using the excellent probability distributions supplied by the scipy.stats
package. Read on to learn what the problem is and how to solve it!
Two ways to sample
Say you’re simulating the operations of a large retailer and have modeled the number of customer arrivals in a particular timespan with a Poisson distribution with some parameter λ. There are at least two ways to get a dozen samples from that distribution using SciPy.
We could supply the distribution parameters and a random state in each sampling call:
or we could use a distribution object, which allows us to specify the parameters (including a random seed) once:
In the first example, we have twelve samples from a Poisson distribution with a λ of mean
; we specify the shape parameter when we draw from the distribution. In the second example, we’re creating a distribution object with a fixed λ, backed by a private pseudorandom number generator, seeded with a supplied value.
Interfaces and implementations
The second approach has two advantages: Firstly, we have an object with fixed distribution parameters (depending on the distribution, there can be several, including location and scale), so we don’t have to worry about tracking these every time we want to sample from this distribution. Secondly, we have a way to make sampling from this distribution deterministic by seeding it but without passing the same RandomState
for each independent stream of values.
The disadvantage of the second approach only becomes obvious when we have many distribution objects in a single program. To get a hint for what goes wrong, let’s run a little experiment. The following two functions, which simulate running a certain number of steps of a simulation that depends on a certain number of independent actors, should have identical behavior.
If we run both of these functions, though, we’ll see how they behave differently: running experiment_one
for a thousand steps with ten thousand agents takes roughly 14 seconds on my laptop, but running experiment_two
with the same parameters takes roughly 3¼ seconds. (You can try it for yourself locally or on binder.)
Explaining the performance difference
Why is the lessconvenient API so much faster? To see why, let’s profile the first function:
This will show us the top function calls by exclusive time (i.e., not including time spent in callees). In my environment, the top function is docformat
in doccer.py
, which is called twice for each agent. In terms of exclusive time, it accounts for roughly 20% of the total execution of the experiment; in terms of inclusive time (i.e., including callees), it accounts for over half the time spent in the experiment.
What does docformat
do? It reformats function docstrings and performs textual substitution on them. This makes sense in one context – building up a library of distribution classes from abstract bases and filling in documentation for all of the subclasses. In the context of creating an individual instance of a distribution object with particular parameters, it’s an interesting design decision indeed, especially since we’d be unlikely to examine the documentation for thousands of distribution objects that are internal to a simulation. (SciPy refers to this as “freezing” a distribution. The documentation briefly mentions that it’s convenient to fix the shape and parameters of a distribution instance but doesn’t mention the performance impact, although searching StackOverflow and GitHub shows that others have been bitten by this issue as well.)
Some solutions
Fortunately, there are a couple of ways to work around this problem. We could simply write code that looks like experiment_two
, passing distribution parameters and a stateful random number generator to each function. This would be fast but clunky.
We could also sample from a uniform distribution and map those samples to samples of our target distribution by using the inverse cumulative distribution function (or percentage point function) of the target distribution, like this example that takes ten samples from a Poisson distribution:
(Note that SciPy calls the λ parameter mu
, presumably to avoid conflict with the Python keyword lambda
.)
We can make either of these approaches somewhat cleaner by wrapping them in a Python generator, like this:
We can then use the iterators returned by this generator to repeatedly sample from the distribution:
Postscript and sidebar
Of course, if we want a deterministic simulation involving a truly large number of independent phenomena, the properties of the pseudorandom number generation algorithm we use can become important. The RandomState
class from NumPy, like the pseudorandom number generator in the Python standard library, uses the Mersenne Twister, which has an extremely long period but requires roughly 2kb of internal state, which you can inspect for yourself:
The new NumPy RNG policy, which was implemented in NumPy 1.17, features a Generator
class backed by an underlying source of bitlevel randomness.^{1} The default bitlevel source is Melissa O’Neill’s PCG, which requires only two 128bit integers of state and has better statistical properties than the Mersenne Twister. Other approaches to bitlevel generation may be worth investigating in the future due to the possibility of better performance.
You can use the new PCG implementation like this:

That’s an eminently sensible approach! ↩