class Frequent::Algorithm
`Frequent::Algorithm` is the Ruby implementation of the Demaine et al. FREQUENT algorithm for calculating top-k items in a stream.
The aims of this algorithm are:
-
use limited memory
-
require constant processing time per item
-
require a single-pass only
Attributes
@return [Integer] the number of items in a basic window
@return [Integer] minimum threshold for membership in top-k items
@return [Integer] the number of top item categories to track
@return [Integer] the number of items in the main window
@return [Array<Hash<Object,Integer>>] global queue for basic window summaries
@return [Hash<Object,Integer>] global mapping of items and counts
@return [Hash<Object,Integer>] latest top k elements and their counts
@return [Array] the window of elements of size b
Public Class Methods
Initializes this top-k frequency-calculating instance.
@param [Integer] n number of items in the main window @param [Integer] b number of items in a basic window @param [Integer] k number of top item categories to track @raise [ArgumentError] if n is not greater than 0 @raise [ArgumentError] if b is not greater than 0 @raise [ArgumentError] if k is not greater than 0 @raise [ArgumentError] if n/b is not greater than 1
# File lib/frequent/algorithm.rb, line 47 def initialize(n, b, k=1) @lock = Mutex.new if n <= 0 raise ArgumentError.new('n must be greater than 0') end if b <= 0 raise ArgumentError.new('b must be greater than 0') end if k <= 0 raise ArgumentError.new('k must be greater than 0') end if n/b < 1 raise ArgumentError.new('n/b must be greater than 1') end @n = n @b = b @k = k @queue = [] @statistics = {} @delta = 0 @topk = {} @window = [] end
Public Instance Methods
Processes a single basic window of b items, by first adding a summary of this basic window in the internal global queue; and then updating the global statistics accordingly.
@param [Object] an object from a data stream
# File lib/frequent/algorithm.rb, line 78 def process(element) @lock.synchronize do @window << element if @window.length == @b # Step 1 summary = {} @window.each do |e| if summary.key? e summary[e] += 1 else summary[e] = 1 end end @window.clear #current window cleared # Step 2 @queue << summary # Step 3 # Done, implicitly # Step 4 summary.each do |k,v| if @statistics.key? k @statistics[k] += v else @statistics[k] = v end end # Step 5 @delta += kth_largest(summary.values, @k) # Step 6 - sizeOf(Q) > N/b if @queue.length > @n/@b # a summary_p = @queue.shift @delta -= kth_largest(summary_p.values, @k) # b summary_p.each { |k,v| @statistics[k] -= v } @statistics.delete_if { |k,v| v <= 0 } #c @topk = @statistics.select { |k,v| v > @delta } end end end end
Return the latest Tok K elements
@return [Hash<Object,Integer>] a hash which contains the current top K elements and their counts
# File lib/frequent/algorithm.rb, line 132 def report @topk end
Returns the version for this gem.
@return [String] the version for this gem.
# File lib/frequent/algorithm.rb, line 139 def version Frequent::VERSION end
Private Instance Methods
Given a list of numbers and a number k which should be between 1 and the length of the given list, return the element x in the list that is larger than exactly k-1 other elements in the list.
@param [Array] list of integers @return [Integer] the kth largest element in list
# File lib/frequent/algorithm.rb, line 151 def kth_largest(list, k) raise ArgumentError.new(ERR_BADLIST) if list.nil? or list.empty? raise ArgumentError.new(ERR_BADK) if k < 1 ulist = list.uniq k = ulist.size if k > ulist.size def quickselect(aset, k) p = rand(aset.size) lower = aset.select { |e| e < aset[p] } upper = aset.select { |e| e > aset[p] } if k <= lower.size quickselect(lower, k) elsif k > aset.size - upper.size quickselect(upper, k - (aset.size - upper.size)) else aset[p] end end quickselect(ulist, ulist.size+1-k) end
# File lib/frequent/algorithm.rb, line 158 def quickselect(aset, k) p = rand(aset.size) lower = aset.select { |e| e < aset[p] } upper = aset.select { |e| e > aset[p] } if k <= lower.size quickselect(lower, k) elsif k > aset.size - upper.size quickselect(upper, k - (aset.size - upper.size)) else aset[p] end end