experiment with ways to mess up reservoir sampling algorithms L and R
test cases for reservoir sampling


browse log



Experiment with detecting errors in reservoir sampling algorithms L and R.

The iter_reservoir.py module implements three ways to sample k randomly selected and randomly ordered elements from an iterable:

  • read into a list and use Python's random.sample()
  • use algorithm R
  • use algorithm L

In addition, there are a few buggy versions of the latter two algorithms.

The tests checks each algorithm for a small input set, run N times to check all possible subsets appear. (N >> the number of expected subsets.)

The tests also print out the counts for each test, so people can see any skew. However, they don't currently make use of that distribution. (Mostly because I don't know how.)

My experiments showed that if there's a simple bug in the reservoir sampling algorithms then they are pretty clear to pick up by eye in the current tests.

The only bugs that got through are:

  • asking the RNG for a low-quality values
  • adding a soft bias in Algorithm L

While the first is perhaps possible without ill-intent, I don't see how the latter can be done by accident or through oversight.

In any case, both should be easily catchable in code review.