Floyd's Sampling Algorithm

I love sampling algorithms. Here's the sampling algorithm that I find most magical. We want to generate a subset of {1, 2, ..., n} of size k.
def floyd(n, k):
s = set()
for i in range(n - k + 1, n + 1):
t = random.randint(1, i)
if t in s:
s.add(i)
else:
s.add(t)
return s
I learned about this algorithm the canonical way all good algorithm lore is imparted; one of my old coworkers wrote this comment in some code I was looking at once:
// Use Robert Floyd's algorithm to generate numSplits distinct integers
// between 0 and len(splitKeys), just because it's so cool!
chosen := make(map[int]struct{})
for j := len(splitKeys) - numSplits; j < len(splitKeys); j++ {
t := fit.rng.Intn(j + 1)
if _, alreadyChosen := chosen[t]; !alreadyChosen {
// Insert T.
chosen[t] = struct{}{}
} else {
// Insert J.
chosen[j] = struct{}{}
}
}
Unlike, say, Reservoir Sampling or something, this algorithm sort of resists an easy intuition to me. The branch makes it weird and makes a normal combinatorial, or visual argument tricky. In fact, I find even a casual intuition hard to grasp. Here are two possible ways to think about what makes it work.
First Perspective
Here's one way to think about it: each iteration of the algorithm goes from a uniformly distributed k-subset of [1..n] and an element of [1..n+1] to a uniformly distributed k+1-subset of [1..n+1]. If we can show that the map from those k-subsets and that sample is k+1-to-1, then we've shown that the resulting set is also uniformly distributed.
So, we need to show that each possible k+1-subset S of [1..n+1] has exactly k+1 inputs that map to it:
- sets containing
n+1come from thek-subset withoutn+1, and any of the elements ofS, and - sets not containing
n+1come from starting with one of itsk-subsets and adding the missing element.
And that's it; the result is uniformly distributed.
This is not so bad, I think it basically makes sense to me. But I also learned of another way of thinking about this algorithm while trying to write this post.
Second Perspective
This one comes from Paul, and it notes a connection between this algorithm and the Fisher-Yates shuffle for permuting an array uniformly at random. Specifically, the "upward" variation that looks like this:
def shuffle(a):
for i in range(0, len(a)):
j = random.randint(0, i)
a[i], a[j] = a[j], a[i]
This builds up a permutation incrementally. After each step, we have a permutation on [0..i], then we augment that and turn it into a permutation on [0..i+1].
Imagine plucking off the last k elements of an array we just shuffled, then that's a uniformly random sample of k elements. The observation here is that Floyd's algorithm is just doing the last k swaps here, directly.
If you call the last k elements of the array the "tail," then for each element i in the tail originally, either we:
- swap
iwithjthat's in the tail, in which caseiwill stay in the tail forever and be part of the sample, or - swap it with
jnot in the tail, thenjwill live in the tail forever and be part of the sample.
These two cases correspond directly to the two branches of Floyd's algorithm and we wind up with the same sample following them both (up to some reordering).
-
That's a really elegant algorithm, which I hadn't heard of before. Thanks for sharing it!
I wondered if it's commonly used, so I checked the implementation of Python's random.sample() method, but found it does something different: it either performs the first k iterations of a Fisher-Yates shuffle (when k is close to n) [1], or it repeats draws (when k is small relative to n) [2].
The second algorithm looks suboptimal, in that it generates more than k random numbers, and at first glance it seems like it might be possible to fix this by changing
while j in selected: j = randbelow(n)intoif j in selected: j = nto turn it into Floyd's algorithm.Unfortunately this is not permissible, because random.sample() returns a list, not a set, and the documentation species the result is a uniformly random permutation [3], which Floyd's algorithm as presented doesn't provide.
Separately, I found a discussion of the algorithm in ACM's Programming Pearls [4], which gives some background on its invention. You might find it interesting because it gives a third type of proof, which is based on a more general algorithm that returns a randomly permuted array (rather than a set) of k distinct elements. note that this provides exactly the behavior required by the Python documentation, which was lost in the more succinct version returning a set.
Just for fun I implemented the general algorithm in Python [5]. It's easy to see how the two versions are related, though the additional bookkeeping required makes the second algorithm not quite as elegant as the first. In theory, floyd2() could replace the Python standard library implementation, since it always produces the result in O(k) time and space and exactly k calls to randbelow(), which is slightly better than the existing implementation, in theory. However, I suspect it's slightly slower in practice. Still, it was fun to explore the possibilities.
- https://github.com/python/cpython/blob/95cbd4a232d9a68680059a0cb42e3f0d2994e206/Lib/random.py#L442-L448
- https://github.com/python/cpython/blob/95cbd4a232d9a68680059a0cb42e3f0d2994e206/Lib/random.py#L450-L457
- https://docs.python.org/3/library/random.html#random.sample
- https://dl.acm.org/doi/pdf/10.1145/30401.315746
- https://gist.github.com/maksverver/e1fefb559da2133fdea3c5804683c725
Add a comment: