class EncodingEstimator::Detector

Class to perform an encoding detection on strings

Attributes

conversions[R]
languages[R]
num_processes[R]
penalty[R]

Public Class Methods

new( conversions, languages, penalty = 0.01, num_processes = nil ) click to toggle source

Create a new instance with a given configuration consisting of a list of conversions, languages and the number of processes.

@param [Array<EncodingEstimator::Conversion>] conversions Conversions to perform/test on the inputs. @param [Array<EncodingEstimator::LanguageModel>] languages Languages to consider when evaluating the input. Array

of two-letter-codes

@param [Float] penalty Base penalty subtracted from each char's score @param [Integer] num_processes Number of processes the detection will run on -> true

multi-threading through the parallel gem
# File lib/encoding_estimator/detector.rb, line 80
def initialize( conversions, languages, penalty = 0.01, num_processes = nil )
  @conversions   = conversions
  @languages     = languages
  @num_processes = num_processes
  @penalty       = penalty
end

Public Instance Methods

detect( str ) click to toggle source

Detect the encoding using the current configuration given an input string

@param [String] str Input string the detection will be performed on

@return [EncodingEstimator::Detection] Result of the detection process

# File lib/encoding_estimator/detector.rb, line 92
def detect( str )
  sums    = {}
  results = (num_processes.nil? or !EncodingEstimator::ParallelSupport.supported?) ?
                detect_st( str, combinations ) : detect_mt( str, combinations )

  results.each do |result|
    sums[result.key] = sums.fetch(result.key, 0.0) + result.score
  end

  range = EncodingEstimator::RangeScale.new( sums.values.min, sums.values.max )

  scaled_scores = {}
  sums.each do |k,s|
    scaled_scores[ k ] = range.scale s
  end

  EncodingEstimator::Detection.new( scaled_scores, @conversions )
end

Private Instance Methods

combinations() click to toggle source

Calculate the list of all combinations of languages and conversions

@return [Array<EncodingEstimator::CDCombination>] Conversion-Distribution-Combinations of the current config

# File lib/encoding_estimator/detector.rb, line 142
def combinations
  @languages.map {
      |l| @conversions.map { |c| EncodingEstimator::CDCombination.new( c, l.distribution ) }
  }.flatten
end
detect_mt( str, matrix ) click to toggle source

Compute the scores of all combinations of languages and conversions on multiple processes. See num_processes.

@param [String] str Input string to compute the encoding on @param [Array<Hash>] matrix List of Conversion-Distribution-Combinations

@return [Array<Hash>] Hash with the keys “key” and “score”: key is the key of the conversion, score the result of

the evaluation for the input string
# File lib/encoding_estimator/detector.rb, line 133
def detect_mt( str, matrix )
  Parallel.map( matrix, in_processes: num_processes ) do |combination|
    detect_single str, combination
  end
end
detect_single( str, combination ) click to toggle source

Perform the evaluation of a Conversion-Distribution-Combination on an input string

@param [String] str Input to evaluate @param [EncodingEstimator::CDCombination] combination Distribution/Conversion to evaluate on the input

@return [EncodingEstimator::SingleDetectionResult] Result of the evaluation of the given combination on the input

# File lib/encoding_estimator/detector.rb, line 154
def detect_single( str, combination )
  EncodingEstimator::SingleDetectionResult.new(
      combination.conversion.key,
      combination.distribution.evaluate( combination.conversion.perform(str), @penalty )
  )
end
detect_st( str, matrix ) click to toggle source

Compute the scores of all combinations of languages and conversions on a single thread.

@param [String] str Input string to compute the encoding on @param [Array<Hash>] matrix List of Conversion-Distribution-Combinations

@return [Array<Hash>] Hash with the keys “key” and “score”: key is the key of the conversion, score the result of

the evaluation for the input string
# File lib/encoding_estimator/detector.rb, line 120
def detect_st( str, matrix )
  matrix.map do |combination|
    detect_single str, combination
  end
end