-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22119][FOLLOWUP][ML] Use spherical KMeans with cosine distance #20518
Conversation
Test build #87111 has finished for PR 20518 at commit
|
*/ | ||
override def centroid(sum: Vector, count: Long): VectorWithNorm = { | ||
scal(1.0 / count, sum) | ||
val norm = Vectors.norm(sum, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than scale sum
twice, can you just compute its normal and then scale by 1 / (norm * count * count)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think that the performance improvement would be significant since we are doing it only on k
vectors per run? I think the code is clearer in this way, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about it, yeah. It won't matter much either way.
* @param sum the `sum` for a cluster to be updated | ||
*/ | ||
override def updateClusterSum(point: VectorWithNorm, sum: Vector): Unit = { | ||
axpy(1.0 / point.norm, point.vector, sum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to ignore zero points here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cosine similarity/distance is not defined for zero points: if there were 0 points we would have earlier failures while computing any cosine distance involving them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scala, 1.0 / 0.0
generate Infinity
, what about directly throw an exception instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree. I added an assertion before computing the cosine distance and a test case for this situation. Thank you for your comment.
Test build #87309 has finished for PR 20518 at commit
|
Merged to master |
## What changes were proposed in this pull request? In apache#19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters. The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance. ## How was this patch tested? existing/improved UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes apache#20518 from mgaido91/SPARK-22119_followup.
What changes were proposed in this pull request?
In #19340 some comments considered needed to use spherical KMeans when cosine distance measure is specified, as Matlab does; instead of the implementation based on the behavior of other tools/libraries like Rapidminer, nltk and ELKI, ie. the centroids are computed as the mean of all the points in the clusters.
The PR introduce the approach used in spherical KMeans. This behavior has the nice feature to minimize the within-cluster cosine distance.
How was this patch tested?
existing/improved UTs