module StreamSampler

Constants

VERSION

Public Class Methods

reservoir_sample(iterable, n, iteration_method = :each) click to toggle source

Uses Vitter's reservoir sampling algorithm to take n items without replacement, from a iterable of items of unknown, arbitrary length.

The algorithm is O(n) in the length of the iterable. The original paper is at:

www.cs.umd.edu/~samir/498/vitter.pdf

# File lib/stream_sampler.rb, line 12
def self.reservoir_sample(iterable, n, iteration_method = :each)
  out = []
  return out if n.to_i == 0
  raise ArgumentError, "n must be >= 0" unless n > 0

  iterable.send(iteration_method).with_index do |i, idx|
    if idx < n
      out << i
    else
      j = rand(idx + 1) 
      out[j] = i if j < n
    end
  end

  out
end