[SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K #35457

zhengruifeng · 2022-02-09T06:38:47Z

What changes were proposed in this pull request?

SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn.

However, it needs a array of size k * (k + 1) / 2, which may cause overflow or OOM when k is too large.

So we should skip this optimization in this case.

Why are the changes needed?

avoid overflow or OOM when k is too large (like 50,000)

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

init init

anders-rydbirk · 2022-02-09T11:58:09Z

@zhengruifeng Thanks for picking this one up!

zhengruifeng · 2022-02-10T01:48:20Z

cc @srowen

srowen

I don't love the two code paths but maybe this is the easiest fix. It un-optimizes large k though. Is the idea to un-pack that triangular array not viable?

srowen · 2022-02-10T02:11:28Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable {
    packedValues
  }

+  def findClosest(


Is this overload used?

it is used in both training and prediction. statistics is optional in it.

OK this is a new method but I don't see it called, maybe I'm missing something

zhengruifeng · 2022-02-10T02:24:15Z

since the matrix is symmetric, if we un-pack it, then we will get a even bigger matrix of size k * k.

#27758 (comment)

srowen · 2022-02-10T02:32:08Z

Sorry, I guess I mean make it into an array of arrays, not one big array.

zhengruifeng · 2022-02-10T02:55:49Z

I think I made it too complex.

according to @anders-rydbirk your description in the ticket:

Possible workaround:

    Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot be loaded in 3.1.1.
    Reduce K. Currently trying with 45000.

maybe we just need to chang k * (k + 1) / 2 to (k.toLong * (k + 1) / 2).toInt?

scala> val k = 50000
val k: Int = 50000

scala> k * (k + 1) / 2
val res8: Int = -897458648

scala> (k.toLong * (k + 1) / 2).toInt
val res9: Int = 1250025000

scala> val k = 45000
val k: Int = 45000

scala> k * (k + 1) / 2
val res10: Int = 1012522500

scala> (k.toLong * (k + 1) / 2).toInt
val res11: Int = 1012522500

Sorry, I guess I mean make it into an array of arrays, not one big array.

@srowen yes, using arrays of sizes (1, 2, ..., k) is another choice

srowen · 2022-02-10T02:58:39Z

Array sizes can't be long so if it doesn't fit in an int it won't work

zhengruifeng · 2022-02-10T03:12:27Z

there are two limits:

1, array size, must be less than Int.MaxValue;

2, its size should fit in memory for initialization and broadcasting.

with --driver-memory=8G, I can not create an array of 1250025000 doubles. If we switch to arrays of arrays, I am afraid it's prone to OOM for large K.

zhengruifeng · 2022-02-15T01:38:11Z

@srowen

I can switch to array[array[double] if you perfer it, I am netural on it.

my main concern is, this optional statistics may be too large. In this case, k=50,000, it is much larger than the clustering centers (dim=3).

srowen · 2022-02-15T01:57:48Z

Your current design is fine, I trust your judgment

dongjoon-hyun · 2022-02-16T16:57:54Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+    } else {
+      findClosest(centers, point)
+    }
+  }


Shall we add the function description like the other existing def findClosest functions?

good point. I will update this PR

dongjoon-hyun · 2022-02-16T17:00:15Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala

+    val k = clusterCenters.length
+    val numFeatures = clusterCenters.head.size
+    if (DistanceMeasure.shouldComputeStatistics(k) &&
+      DistanceMeasure.shouldComputeStatisticsLocally(k, numFeatures)) {


indentation?

zhengruifeng · 2022-02-18T08:30:54Z

I think this should also be back-ported to 3.1/3.2

srowen

It's adding complexity but I think for a reasonable reason, to fix a perf regression

srowen · 2022-02-18T12:42:31Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

@@ -117,6 +117,17 @@ private[spark] abstract class DistanceMeasure extends Serializable {
    packedValues
  }

+  def findClosest(


OK this is a new method but I don't see it called, maybe I'm missing something

zhengruifeng · 2022-02-21T01:04:59Z

@srowen It is used in both training (in the .ml side) and prediction (in the .mllib side), the switch is done by just changing the type of stats in distanceMeasureInstance.findClosest(centers, stats, point) from Array[Double] to Option[Array[Double]]

srowen · 2022-02-21T02:02:05Z

Do existing call sites bind to the new method? I can't see how a new method is called when nothing new calls it, but if you understand it and it works, nevermind

zhengruifeng · 2022-02-21T07:58:25Z

Do existing call sites bind to the new method?

NO.

existing two methods are used in DistanceMeasure and DistanceMeasureSuite;

but def findClosest(centers: Array[VectorWithNorm], point: VectorWithNorm) is also used in KMeans initialization algorithm initKMeansParallel and BisectingKMeans.

zhengruifeng · 2022-02-22T07:24:57Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala


    // Execute iterations of Lloyd's algorithm until converged
    while (iteration < maxIterations && !converged) {
      val bcCenters = sc.broadcast(centers)
-      val stats = if (shouldDistributed) {


previous stats is a Array[Double]

zhengruifeng · 2022-02-22T07:25:15Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala


    // Execute iterations of Lloyd's algorithm until converged
    while (iteration < maxIterations && !converged) {
      val bcCenters = sc.broadcast(centers)
-      val stats = if (shouldDistributed) {
-        distanceMeasureInstance.computeStatisticsDistributedly(sc, bcCenters)
+      val stats = if (shouldComputeStats) {


Now, it is a Option[Array[Double]]

So the following val (bestCenter, cost) = distanceMeasureInstance.findClosest(centers, stats, point) will call this new method, without code change in the call sites.

huaxingao · 2022-03-02T03:38:16Z

I will merge this tomorrow if there are no further comments.

### What changes were proposed in this pull request? SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn. However, it needs a array of size `k * (k + 1) / 2`, which may cause overflow or OOM when k is too large. So we should skip this optimization in this case. ### Why are the changes needed? avoid overflow or OOM when k is too large (like 50,000) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #35457 from zhengruifeng/kmean_k_limit. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit ad5427e) Signed-off-by: huaxingao <huaxin_gao@apple.com>

huaxingao · 2022-03-02T19:54:19Z

Merged to master/3.2/3.1. Thanks!

zhengruifeng · 2022-03-03T03:52:12Z

@huaxingao @srowen @dongjoon-hyun Thanks for reviewing!

### What changes were proposed in this pull request? SPARK-31007 introduce an auxiliary statistics to speed up computation in KMeasn. However, it needs a array of size `k * (k + 1) / 2`, which may cause overflow or OOM when k is too large. So we should skip this optimization in this case. ### Why are the changes needed? avoid overflow or OOM when k is too large (like 50,000) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes apache#35457 from zhengruifeng/kmean_k_limit. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit ad5427e) Signed-off-by: huaxingao <huaxin_gao@apple.com> (cherry picked from commit d5e90cf) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the MLLIB label Feb 9, 2022

init

d399c4f

init init

zhengruifeng force-pushed the kmean_k_limit branch from e6a7d9a to d399c4f Compare February 9, 2022 07:05

srowen reviewed Feb 10, 2022

View reviewed changes

huaxingao approved these changes Feb 16, 2022

View reviewed changes

dongjoon-hyun reviewed Feb 16, 2022

View reviewed changes

address comments

a6adbdd

srowen reviewed Feb 18, 2022

View reviewed changes

nit

38d26a4

zhengruifeng commented Feb 22, 2022

View reviewed changes

huaxingao closed this in ad5427e Mar 2, 2022

zhengruifeng deleted the kmean_k_limit branch March 3, 2022 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K #35457

[SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K #35457

zhengruifeng commented Feb 9, 2022

anders-rydbirk commented Feb 9, 2022

zhengruifeng commented Feb 10, 2022

srowen left a comment

srowen Feb 10, 2022

zhengruifeng Feb 10, 2022

srowen Feb 18, 2022

zhengruifeng commented Feb 10, 2022 •

edited

Loading

srowen commented Feb 10, 2022

zhengruifeng commented Feb 10, 2022

srowen commented Feb 10, 2022

zhengruifeng commented Feb 10, 2022

zhengruifeng commented Feb 15, 2022

srowen commented Feb 15, 2022

dongjoon-hyun Feb 16, 2022

zhengruifeng Feb 18, 2022

dongjoon-hyun Feb 16, 2022

zhengruifeng commented Feb 18, 2022

srowen left a comment

srowen Feb 18, 2022

zhengruifeng commented Feb 21, 2022

srowen commented Feb 21, 2022

zhengruifeng commented Feb 21, 2022

zhengruifeng Feb 22, 2022

zhengruifeng Feb 22, 2022

zhengruifeng Feb 22, 2022

huaxingao commented Mar 2, 2022

huaxingao commented Mar 2, 2022

zhengruifeng commented Mar 3, 2022

[SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K #35457

[SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K #35457

Conversation

zhengruifeng commented Feb 9, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

anders-rydbirk commented Feb 9, 2022

zhengruifeng commented Feb 10, 2022

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Feb 10, 2022 • edited Loading

srowen commented Feb 10, 2022

zhengruifeng commented Feb 10, 2022

srowen commented Feb 10, 2022

zhengruifeng commented Feb 10, 2022

zhengruifeng commented Feb 15, 2022

srowen commented Feb 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Feb 18, 2022

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Feb 21, 2022

srowen commented Feb 21, 2022

zhengruifeng commented Feb 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Mar 2, 2022

huaxingao commented Mar 2, 2022

zhengruifeng commented Mar 3, 2022

zhengruifeng commented Feb 10, 2022 •

edited

Loading