module Edits::JaroWinkler
Implements Jaro-Winkler similarity algorithm.
Constants
- WINKLER_PREFIX_WEIGHT
Prefix scaling factor for jaro-winkler metric. Default is 0.1 Should not exceed 0.25 or metric range will leave 0..1
- WINKLER_THRESHOLD
Threshold for boosting
Jaro
with Winkler prefix multiplier. Default is 0.7
Public Class Methods
Calculate Jaro-Winkler distance
@note Not a true distance metric, fails to satisfy triangle inequality.
@example
Edits::JaroWinkler.distance("information", "informant") # => 0.05858585858585863
@param (see distance) @return [Float] distance, from 0.0 (identical) to 1.0 (distant)
# File lib/edits/jaro_winkler.rb, line 63 def self.distance( seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT ) 1.0 - similarity(seq1, seq2, threshold: threshold, weight: weight) end
Calculate Jaro-Winkler similarity of given strings
Adds weight to Jaro
similarity according to the length of a common prefix of up to 4 letters, where exists. The additional weighting is only applied when the original similarity passes a threshold.
`Sw = Sj + (l * p * (1 - Dj))`
Where `Sj` is Jaro
, `l` is prefix length, and `p` is prefix weight
@example
Edits::JaroWinkler.similarity("information", "informant") # => 0.9414141414141414
@param seq1 [String, Array] @param seq2 [String, Array] @param threshold [Float] threshold for applying Winkler prefix weighting @param weight [Float] weighting for common prefix, should not exceed 0.25 @return [Float] similarity, from 0.0 (none) to 1.0 (identical)
# File lib/edits/jaro_winkler.rb, line 35 def self.similarity( seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT ) sj = Jaro.similarity(seq1, seq2) return sj unless sj > threshold # size of common prefix, max 4 max_bound = seq1.length > seq2.length ? seq2.length : seq1.length max_bound = 4 if max_bound > 4 l = 0 l += 1 until seq1[l] != seq2[l] || l >= max_bound l < 1 ? sj : sj + (l * weight * (1 - sj)) end