Welcome! Please see the About page for a little more info on how this works.

+1 vote
in data.generators by

An off-by-one error in the function reservoir-sample means that it can never return the first ct members of the collection. This makes it noticeably biased against members at the start of the collection, especially for small values of ct.

4 Answers

0 votes
by

Comment made by: alexmiller

This example (which returns 7) seems in contradiction to the statement in the description:

user=> (g/reservoir-sample 10 (range 100)) [24 59 2 93 81 90 70 7 22 49]

The patch doesn't really make sense to me either in tandem with the description.

0 votes
by

Comment made by: thomasathorne

Hi Alex,

Sorry, my description was a little terse and 'never return the first
ct members of the collection' is misleading---I meant it can never
return exactly those members (ie. your example would never return
(link: 0 1 2 3 4 5 6 7 8 9), though it should have the possibility---very
unlikely of course---of doing that).

The bias I mentioned is statistical so to observe it properly you
should run the function many times---here is the example I've been
using to test this:

(sort-by first (frequencies (into [] cat (repeatedly 10000 #(g/reservoir-sample 5 (range 10)))))) => ([0 4534] [1 4388] [2 4371] [3 4417] [4 4426] [5 5607] [6 5609] [7 5600] [8 5436] [9 5612])

As you can see, the first half of the vector is under-represented and
the second half is over-represented. A more extreme example is the
fact that the expression

(g/reservoir-sample 1 (range 2))

will always return (link: 1) and never (link: 0).

The patch has the effect of slightly lowering the probability of
swapping out one of the result vector at each iteration of the loop.
The new version is correct in that the probability of each element
being sampled is the same. (You can see this with a simple proof by
induction, I will post that here if you wish.)

0 votes
by

Comment made by: thomasathorne

@alexmiller @stu: I just remembered this issue, which I notice has been sitting here for almost a year now. Is there any particular reason that the patch hasn't been merged?

I hope my explanation above satisfies your query, though if there is anything remaining that is unclear, please feel free to ask and I will attempt to clarify!

0 votes
by
Reference: https://clojure.atlassian.net/browse/DGEN-5 (reported by alex+import)
...