class Frequent::Algorithm

`Frequent::Algorithm` is the Ruby implementation of the Demaine et al. FREQUENT algorithm for calculating top-k items in a stream.

The aims of this algorithm are:

Attributes

b[R]

@return [Integer] the number of items in a basic window

delta[R]

@return [Integer] minimum threshold for membership in top-k items

k[R]

@return [Integer] the number of top item categories to track

n[R]

@return [Integer] the number of items in the main window

queue[R]

@return [Array<Hash<Object,Integer>>] global queue for basic window summaries

statistics[R]

@return [Hash<Object,Integer>] global mapping of items and counts

topk[R]

@return [Hash<Object,Integer>] latest top k elements and their counts

window[R]

@return [Array] the window of elements of size b

Public Class Methods

new(n, b, k=1) click to toggle source

Initializes this top-k frequency-calculating instance.

@param [Integer] n number of items in the main window @param [Integer] b number of items in a basic window @param [Integer] k number of top item categories to track @raise [ArgumentError] if n is not greater than 0 @raise [ArgumentError] if b is not greater than 0 @raise [ArgumentError] if k is not greater than 0 @raise [ArgumentError] if n/b is not greater than 1

# File lib/frequent/algorithm.rb, line 47
def initialize(n, b, k=1)
  @lock = Mutex.new

  if n <= 0
    raise ArgumentError.new('n must be greater than 0')
  end
  if b <= 0
    raise ArgumentError.new('b must be greater than 0')
  end
  if k <= 0
    raise ArgumentError.new('k must be greater than 0')
  end
  if n/b < 1
    raise ArgumentError.new('n/b must be greater than 1')
  end
  @n = n
  @b = b
  @k = k

  @queue = [] 
  @statistics = {}
  @delta = 0
  @topk = {}
  @window = []
end

Public Instance Methods

process(element) click to toggle source

Processes a single basic window of b items, by first adding a summary of this basic window in the internal global queue; and then updating the global statistics accordingly.

@param [Object] an object from a data stream

# File lib/frequent/algorithm.rb, line 78
def process(element)
  @lock.synchronize do
    @window << element
    if @window.length == @b 

      # Step 1
      summary = {}
      @window.each do |e|
        if summary.key? e
          summary[e] += 1
        else
          summary[e] = 1
        end
      end
      @window.clear   #current window cleared

      # Step 2
      @queue << summary

      # Step 3
      # Done, implicitly

      # Step 4
      summary.each do |k,v|
        if @statistics.key? k
          @statistics[k] += v
        else
          @statistics[k] = v
        end
      end

      # Step 5
      @delta += kth_largest(summary.values, @k)

      # Step 6 - sizeOf(Q) > N/b
      if @queue.length > @n/@b
        # a
        summary_p = @queue.shift
        @delta -= kth_largest(summary_p.values, @k)

        # b
        summary_p.each { |k,v| @statistics[k] -= v }
        @statistics.delete_if { |k,v| v <= 0 }

        #c
        @topk = @statistics.select { |k,v| v > @delta }
      end
    end
  end
end
report() click to toggle source

Return the latest Tok K elements

@return [Hash<Object,Integer>] a hash which contains the current top K elements and their counts

# File lib/frequent/algorithm.rb, line 132
def report
  @topk
end
version() click to toggle source

Returns the version for this gem.

@return [String] the version for this gem.

# File lib/frequent/algorithm.rb, line 139
def version
  Frequent::VERSION
end

Private Instance Methods

kth_largest(list, k) click to toggle source

Given a list of numbers and a number k which should be between 1 and the length of the given list, return the element x in the list that is larger than exactly k-1 other elements in the list.

@param [Array] list of integers @return [Integer] the kth largest element in list

# File lib/frequent/algorithm.rb, line 151
def kth_largest(list, k)
  raise ArgumentError.new(ERR_BADLIST) if list.nil? or list.empty?
  raise ArgumentError.new(ERR_BADK) if k < 1

  ulist = list.uniq
  k = ulist.size if k > ulist.size

  def quickselect(aset, k)
    p = rand(aset.size)

    lower = aset.select { |e| e < aset[p] }
    upper = aset.select { |e| e > aset[p] }

    if k <= lower.size
      quickselect(lower, k)
    elsif k > aset.size - upper.size
      quickselect(upper, k - (aset.size - upper.size))
    else
      aset[p]
    end
  end
  quickselect(ulist, ulist.size+1-k)
end
quickselect(aset, k) click to toggle source
# File lib/frequent/algorithm.rb, line 158
def quickselect(aset, k)
  p = rand(aset.size)

  lower = aset.select { |e| e < aset[p] }
  upper = aset.select { |e| e > aset[p] }

  if k <= lower.size
    quickselect(lower, k)
  elsif k > aset.size - upper.size
    quickselect(upper, k - (aset.size - upper.size))
  else
    aset[p]
  end
end