class KMeansPP
Cluster
data with the k-means++, k-means and Lloyd algorithm.
Constants
- VERSION
Version number, happy now?
Attributes
Centroid
points
@return [Array<Centroid>]
Source data set of points.
@return [Array<Point>]
Public Class Methods
Computed points are a flat structure so this nests each point in an array.
@param centroid [Centroid] Centroid
of the cluster.
@return [Cluster]
# File lib/k_means_pp.rb, line 47 def self.cluster_for_centroid(centroid, points, &block) cluster_points = points.select { |p| p.group == centroid } if block cluster_points.map!(&:original) else cluster_points.map! { |p| [p.x, p.y] } end Cluster.new(centroid, cluster_points) end
Take an array of things and group them into K clusters.
If no block was given, an array of arrays (of two numbers) is expected. At the end an array of +Cluster+s is returned, each wrapping an array or arrays (of two numbers).
If a block was given, the points
is likely an array of other things like hashes or objects. The block is expected to return an array of two numbers. At the end an array of +Cluster+s is returned, each wrapping an array or original objects.
@param points [Array] Source data set of points. @param clusters_count [Fixnum] Number of clusters (“k”). @yieldreturn [Array<Numeric>]
@return [Array<Cluster>]
# File lib/k_means_pp.rb, line 33 def self.clusters(points, clusters_count, &block) instance = new(points, clusters_count, &block) instance.group_points instance.centroids.map do |centroid| cluster_for_centroid(centroid, instance.points, &block) end end
Find nearest centroid for a given point in given centroids.
@param point [Point] Measure distance of this point @param centroids [Array<Centroid>] to those cluster centers
@return [Centroid]
# File lib/k_means_pp.rb, line 65 def self.find_nearest_centroid(point, centroids) find_nearest_centroid_and_distance(point, centroids)[0] end
Find the nearest centroid in given centroids.
@param point [Point] Measure distance of this point @param centroids [Array<Centroid>] to those cluster centers
@return [Array]
# File lib/k_means_pp.rb, line 85 def self.find_nearest_centroid_and_distance(point, centroids) # Assume the current centroid is the closest. nearest_centroid = point.group nearest_distance = Float::INFINITY centroids.each do |centroid| distance = centroid.squared_distance_to(point) next if distance >= nearest_distance nearest_distance = distance nearest_centroid = centroid end [nearest_centroid, nearest_distance] end
Find distance to the nearest centroid for a given point in given centroids.
@param point [Point] Measure distance of this point @param centroids [Array<Centroid>] to those cluster centers
@return [Float]
# File lib/k_means_pp.rb, line 75 def self.find_nearest_centroid_distance(point, centroids) find_nearest_centroid_and_distance(point, centroids)[1] end
Take an array of things and group them into K clusters.
If no block was given, an array of arrays (of two numbers) is expected. Internally we map them with Point
objects.
If a block was given, the points
is likely an array of other things like hashes or objects. In this case we will keep the original object in a property and once we are done, we will swap those objects. The block is expected to retun an array of two numbers.
@param points [Array] Source data set of points. @param clusters_count [Fixnum] Number of clusters (“k”). @yieldreturn [Array<Numeric>]
# File lib/k_means_pp.rb, line 115 def initialize(points, clusters_count) # Do not mutate original structure points = points.dup if block_given? points.map! do |point_obj| point_ary = yield(point_obj) point = Point.new(point_ary[0], point_ary[1]) point.original = point_obj point end else points.map! do |point_ary| Point.new(point_ary[0], point_ary[1]) end end self.points = points self.centroids = Array.new(clusters_count) end
Public Instance Methods
Group points into clusters.
# File lib/k_means_pp.rb, line 137 def group_points define_initial_clusters fine_tune_clusters end
Protected Instance Methods
For each cell calculate its center. This is done by averaging X and Y coordinates.
# File lib/k_means_pp.rb, line 213 def calculate_new_centroids # Clear centroids. centroids.each(&:reset) # Sum all X and Y coords into each point's centroid. points.each do |point| centroid = point.group centroid.add(point) end # And then average it to find a center. centroids.each(&:average) end
K-means++ algorithm.
Find initial centroids and assign points to their nearest centroid, forming cells.
# File lib/k_means_pp.rb, line 148 def define_initial_clusters # Randomly choose a point as the first centroid. centroids[0] = Centroid.new(points.sample) # Initialize an array of distances of every point. distances = points.size.times.map { 0.0 } centroids.each_with_index do |_, centroid_i| # Skip the first centroid as it's already picked but keep the index. next if centroid_i == 0 # Sum points' distances to their nearest centroid distances_sum = 0.0 points.each_with_index do |point, point_i| distance = self.class.find_nearest_centroid_distance( point, centroids[0...centroid_i] ) distances[point_i] = distance distances_sum += distance end # Randomly cut it. distances_sum *= rand # Keep subtracting those distances until we hit a zero (or lower) # in which case we found a new centroid. distances.each_with_index do |distance, point_i| distances_sum -= distance next if distances_sum > 0 centroids[centroid_i] = Centroid.new(points[point_i]) break end end # Assign each point its nearest centroid. points.each do |point| point.group = self.class.find_nearest_centroid(point, centroids) end end
This is Lloyd’s algorithm en.wikipedia.org/wiki/Lloyd%27s_algorithm
At this point we have our points already assigned into cells.
-
We calculate a new center for each cell.
-
For each point find its nearest center and re-assign it if it changed.
-
Repeat until a threshold has been reached.
# File lib/k_means_pp.rb, line 198 def fine_tune_clusters # When a number of changed points reaches this number, we are done. changed_threshold = points.size >> 10 loop do calculate_new_centroids changed = reassign_points # Stop when 99.9% of points are good break if changed <= changed_threshold end end
Loop through all the points and find their nearest centroid. If it’s a different one than current, change it ande take a note.
@return [Fixnum] Number of changed points.
# File lib/k_means_pp.rb, line 231 def reassign_points changed = 0 points.each do |point| centroid = self.class.find_nearest_centroid(point, centroids) next if centroid == point.group changed += 1 point.group = centroid end changed end