[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518

mgaido91 · 2018-02-06T14:14:02Z

What changes were proposed in this pull request?

In #19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters.

The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance.

How was this patch tested?

existing/improved UTs

mgaido91 · 2018-02-06T14:22:44Z

cc @Kevin-Ferret @srowen @viirya @zhengruifeng

SparkQA · 2018-02-06T15:22:15Z

Test build #87111 has finished for PR 20518 at commit ba73fc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-02-06T15:22:52Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+   */
+  override def centroid(sum: Vector, count: Long): VectorWithNorm = {
+    scal(1.0 / count, sum)
+    val norm = Vectors.norm(sum, 2)


Rather than scale sum twice, can you just compute its normal and then scale by 1 / (norm * count * count)?

do you think that the performance improvement would be significant since we are doing it only on k vectors per run? I think the code is clearer in this way, do you agree?

I don't feel strongly about it, yeah. It won't matter much either way.

zhengruifeng · 2018-02-08T01:57:23Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

+   * @param sum the `sum` for a cluster to be updated
+   */
+  override def updateClusterSum(point: VectorWithNorm, sum: Vector): Unit = {
+    axpy(1.0 / point.norm, point.vector, sum)


do we need to ignore zero points here?

the cosine similarity/distance is not defined for zero points: if there were 0 points we would have earlier failures while computing any cosine distance involving them.

In scala, 1.0 / 0.0 generate Infinity, what about directly throw an exception instead?

Thanks, I agree. I added an assertion before computing the cosine distance and a test case for this situation. Thank you for your comment.

SparkQA · 2018-02-11T10:41:51Z

Test build #87309 has finished for PR 20518 at commit 4b41213.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-02-12T02:15:36Z

Merged to master

## What changes were proposed in this pull request? In apache#19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters. The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance. ## How was this patch tested? existing/improved UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20518 from mgaido91/SPARK-22119_followup.

[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance

ba73fc8

srowen reviewed Feb 6, 2018

View reviewed changes

srowen approved these changes Feb 7, 2018

View reviewed changes

zhengruifeng reviewed Feb 8, 2018

View reviewed changes

srowen approved these changes Feb 10, 2018

View reviewed changes

add assertion for 0-length vectors

4b41213

asfgit closed this in c0c902a Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518

[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518

mgaido91 commented Feb 6, 2018

mgaido91 commented Feb 6, 2018

SparkQA commented Feb 6, 2018

srowen Feb 6, 2018

mgaido91 Feb 6, 2018

srowen Feb 7, 2018

zhengruifeng Feb 8, 2018

mgaido91 Feb 8, 2018

zhengruifeng Feb 11, 2018

mgaido91 Feb 11, 2018

SparkQA commented Feb 11, 2018

srowen commented Feb 12, 2018

[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518

[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518

Conversation

mgaido91 commented Feb 6, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Feb 6, 2018

SparkQA commented Feb 6, 2018

srowen Feb 6, 2018

Choose a reason for hiding this comment

mgaido91 Feb 6, 2018

Choose a reason for hiding this comment

srowen Feb 7, 2018

Choose a reason for hiding this comment

zhengruifeng Feb 8, 2018

Choose a reason for hiding this comment

mgaido91 Feb 8, 2018

Choose a reason for hiding this comment

zhengruifeng Feb 11, 2018

Choose a reason for hiding this comment

mgaido91 Feb 11, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 11, 2018

srowen commented Feb 12, 2018