Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518

Closed
wants to merge 2 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Feb 6, 2018

What changes were proposed in this pull request?

In #19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters.

The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance.

How was this patch tested?

existing/improved UTs

@mgaido91
Copy link
Contributor Author

mgaido91 commented Feb 6, 2018

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87111 has finished for PR 20518 at commit ba73fc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
override def centroid(sum: Vector, count: Long): VectorWithNorm = {
scal(1.0 / count, sum)
val norm = Vectors.norm(sum, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than scale sum twice, can you just compute its normal and then scale by 1 / (norm * count * count)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think that the performance improvement would be significant since we are doing it only on k vectors per run? I think the code is clearer in this way, do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about it, yeah. It won't matter much either way.

* @param sum the `sum` for a cluster to be updated
*/
override def updateClusterSum(point: VectorWithNorm, sum: Vector): Unit = {
axpy(1.0 / point.norm, point.vector, sum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to ignore zero points here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cosine similarity/distance is not defined for zero points: if there were 0 points we would have earlier failures while computing any cosine distance involving them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scala, 1.0 / 0.0 generate Infinity, what about directly throw an exception instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree. I added an assertion before computing the cosine distance and a test case for this situation. Thank you for your comment.

@SparkQA
Copy link

SparkQA commented Feb 11, 2018

Test build #87309 has finished for PR 20518 at commit 4b41213.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 12, 2018

Merged to master

@asfgit asfgit closed this in c0c902a Feb 12, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
## What changes were proposed in this pull request?

In apache#19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters.

The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance.

## How was this patch tested?

existing/improved UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes apache#20518 from mgaido91/SPARK-22119_followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants