You probably already know that if you’re modeling multiple independent phenomena in a repeatable simulation, you want multiple independent pseudorandom number generators. But you may be surprised by a consequence of following this approach if you’re using the excellent probability distributions supplied by the scipy.stats
package. Read on to learn what the problem is and how to solve it!
Two ways to sample
Say you’re simulating the operations of a large retailer and have modeled the number of customer arrivals in a particular timespan with a Poisson distribution with some parameter λ. There are at least two ways to get a dozen samples from that distribution using SciPy.
We could supply the distribution parameters and a random state in each sampling call:
import scipy.stats
import numpy as np
= 0x00c0ffee
seed
= 5
mean = np.random.RandomState(seed)
rs = scipy.stats.poisson.rvs(mean, size=12, random_state=rs) samples
or we could use a distribution object, which allows us to specify the parameters (including a random seed) once:
import scipy.stats
= 5
mean = 0x00c0ffee
seed
= scipy.stats.poisson(mean)
distribution = seed
distribution.random_state
= distribution.rvs(size=12) samples
In the first example, we have twelve samples from a Poisson distribution with a λ of mean
; we specify the shape parameter when we draw from the distribution. In the second example, we’re creating a distribution object with a fixed λ, backed by a private pseudorandom number generator, seeded with a supplied value.
Interfaces and implementations
The second approach has two advantages: Firstly, we have an object with fixed distribution parameters (depending on the distribution, there can be several, including location and scale), so we don’t have to worry about tracking these every time we want to sample from this distribution. Secondly, we have a way to make sampling from this distribution deterministic by seeding it but without passing the same RandomState
for each independent stream of values.
The disadvantage of the second approach only becomes obvious when we have many distribution objects in a single program. To get a hint for what goes wrong, let’s run a little experiment. The following two functions, which simulate running a certain number of steps of a simulation that depends on a certain number of independent actors, should have identical behavior.
def experiment_one(agents, steps):
def mkpoisson(l,seed):
= scipy.stats.poisson(l)
p = seed
p.random_state return p
= np.random.randint(1<<32, size=agents)
seeds = [mkpoisson(12, seed) for seed in seeds]
streams for p in streams:
p.rvs(steps)
def experiment_two(agents, steps):
= np.random.randint(1<<32, size=agents)
seeds = [np.random.RandomState(seed) for seed in seeds]
states for rs in states:
12, size=steps, random_state=rs)
scipy.stats.poisson.rvs(
If we run both of these functions, though, we’ll see how they behave differently: running experiment_one
for a thousand steps with ten thousand agents takes roughly 14 seconds on my laptop, but running experiment_two
with the same parameters takes roughly 3¼ seconds. (You can try it for yourself locally or on binder.)
Explaining the performance difference
Why is the less-convenient API so much faster? To see why, let’s profile the first function:
import cProfile
import pstats
from pstats import SortKey
"experiment_one(10000,1000)", sort=SortKey.TIME) cProfile.run(
This will show us the top function calls by exclusive time (i.e., not including time spent in callees). In my environment, the top function is docformat
in doccer.py
, which is called twice for each agent. In terms of exclusive time, it accounts for roughly 20% of the total execution of the experiment; in terms of inclusive time (i.e., including callees), it accounts for over half the time spent in the experiment.
What does docformat
do? It reformats function docstrings and performs textual substitution on them. This makes sense in one context – building up a library of distribution classes from abstract bases and filling in documentation for all of the subclasses. In the context of creating an individual instance of a distribution object with particular parameters, it’s an interesting design decision indeed, especially since we’d be unlikely to examine the documentation for thousands of distribution objects that are internal to a simulation. (SciPy refers to this as “freezing” a distribution. The documentation briefly mentions that it’s convenient to fix the shape and parameters of a distribution instance but doesn’t mention the performance impact, although searching StackOverflow and GitHub shows that others have been bitten by this issue as well.)
Some solutions
Fortunately, there are a couple of ways to work around this problem. We could simply write code that looks like experiment_two
, passing distribution parameters and a stateful random number generator to each function. This would be fast but clunky.
We could also sample from a uniform distribution and map those samples to samples of our target distribution by using the inverse cumulative distribution function (or percentage point function) of the target distribution, like this example that takes ten samples from a Poisson distribution:
= np.random.RandomState(seed=0x00c0ffee)
prng =10), mu=12) scipy.stats.poisson.ppf(prng.uniform(size
(Note that SciPy calls the λ parameter mu
, presumably to avoid conflict with the Python keyword lambda
.)
We can make either of these approaches somewhat cleaner by wrapping them in a Python generator, like this:
def mkpoisson(l, prng):
while True:
yield from scipy.stats.poisson.ppf(prng.uniform(size=1024), mu=l)
We can then use the iterators returned by this generator to repeatedly sample from the distribution:
= mkpoisson(12, np.random.RandomState(seed=0x00c0ffee))
p
for i in range(10):
print(p.next())
Footnotes
That’s an eminently sensible approach!↩︎