-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K #35457
Conversation
e6a7d9a
to
d399c4f
Compare
@zhengruifeng Thanks for picking this one up! |
cc @srowen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the two code paths but maybe this is the easiest fix. It un-optimizes large k though. Is the idea to un-pack that triangular array not viable?
@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable { | |||
packedValues | |||
} | |||
|
|||
def findClosest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this overload used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is used in both training and prediction. statistics
is optional in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is a new method but I don't see it called, maybe I'm missing something
since the matrix is symmetric, if we un-pack it, then we will get a even bigger matrix of size |
Sorry, I guess I mean make it into an array of arrays, not one big array. |
I think I made it too complex. according to @anders-rydbirk your description in the ticket:
maybe we just need to chang
@srowen yes, using arrays of sizes (1, 2, ..., k) is another choice |
Array sizes can't be long so if it doesn't fit in an int it won't work |
there are two limits: 1, array size, must be less than 2, its size should fit in memory for initialization and broadcasting. with --driver-memory=8G, I can not create an array of 1250025000 doubles. If we switch to arrays of arrays, I am afraid it's prone to OOM for large K. |
I can switch to array[array[double] if you perfer it, I am netural on it. my main concern is, this optional statistics may be too large. In this case, k=50,000, it is much larger than the clustering centers (dim=3). |
Your current design is fine, I trust your judgment |
} else { | ||
findClosest(centers, point) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the function description like the other existing def findClosest
functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I will update this PR
val k = clusterCenters.length | ||
val numFeatures = clusterCenters.head.size | ||
if (DistanceMeasure.shouldComputeStatistics(k) && | ||
DistanceMeasure.shouldComputeStatisticsLocally(k, numFeatures)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
I think this should also be back-ported to 3.1/3.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's adding complexity but I think for a reasonable reason, to fix a perf regression
@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable { | |||
packedValues | |||
} | |||
|
|||
def findClosest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is a new method but I don't see it called, maybe I'm missing something
@srowen It is used in both training (in the .ml side) and prediction (in the .mllib side), the switch is done by just changing the type of |
Do existing call sites bind to the new method? I can't see how a new method is called when nothing new calls it, but if you understand it and it works, nevermind |
NO. existing two methods are used in but |
|
||
// Execute iterations of Lloyd's algorithm until converged | ||
while (iteration < maxIterations && !converged) { | ||
val bcCenters = sc.broadcast(centers) | ||
val stats = if (shouldDistributed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous stats is a Array[Double]
|
||
// Execute iterations of Lloyd's algorithm until converged | ||
while (iteration < maxIterations && !converged) { | ||
val bcCenters = sc.broadcast(centers) | ||
val stats = if (shouldDistributed) { | ||
distanceMeasureInstance.computeStatisticsDistributedly(sc, bcCenters) | ||
val stats = if (shouldComputeStats) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, it is a Option[Array[Double]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the following val (bestCenter, cost) = distanceMeasureInstance.findClosest(centers, stats, point)
will call this new method, without code change in the call sites.
I will merge this tomorrow if there are no further comments. |
### What changes were proposed in this pull request? SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn. However, it needs a array of size `k * (k + 1) / 2`, which may cause overflow or OOM when k is too large. So we should skip this optimization in this case. ### Why are the changes needed? avoid overflow or OOM when k is too large (like 50,000) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #35457 from zhengruifeng/kmean_k_limit. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit ad5427e) Signed-off-by: huaxingao <huaxin_gao@apple.com>
### What changes were proposed in this pull request? SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn. However, it needs a array of size `k * (k + 1) / 2`, which may cause overflow or OOM when k is too large. So we should skip this optimization in this case. ### Why are the changes needed? avoid overflow or OOM when k is too large (like 50,000) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #35457 from zhengruifeng/kmean_k_limit. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit ad5427e) Signed-off-by: huaxingao <huaxin_gao@apple.com>
Merged to master/3.2/3.1. Thanks! |
@huaxingao @srowen @dongjoon-hyun Thanks for reviewing! |
### What changes were proposed in this pull request? SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn. However, it needs a array of size `k * (k + 1) / 2`, which may cause overflow or OOM when k is too large. So we should skip this optimization in this case. ### Why are the changes needed? avoid overflow or OOM when k is too large (like 50,000) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes apache#35457 from zhengruifeng/kmean_k_limit. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit ad5427e) Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit d5e90cf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn.
However, it needs a array of size
k * (k + 1) / 2
, which may cause overflow or OOM when k is too large.So we should skip this optimization in this case.
Why are the changes needed?
avoid overflow or OOM when k is too large (like 50,000)
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites