[SPARK-31007][ML] KMeans optimization based on triangle-inequality #27758

zhengruifeng · 2020-03-02T10:05:39Z

What changes were proposed in this pull request?

apply Lemma 1 in Using the Triangle Inequality to Accelerate K-Means:

Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);

It can be directly applied in EuclideanDistance, but not in CosineDistance.
However, for CosineDistance we can luckily get a variant in the space of radian/angle.

Why are the changes needed?

It help improving the performance of prediction and training (mostly)

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

zhengruifeng · 2020-03-02T10:07:40Z

In current impl, following Lemma is used in KMeans:

0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|

      val center = centers(i)
      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
      // distance computation.
      var lowerBoundOfSqDist = center.norm - point.norm
      lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
      if (lowerBoundOfSqDist < bestDistance) {
        ...
      }

this can only be applied in EuclideanDistance, but not in CosineDistance

According to Using the Triangle Inequality to Accelerate K-Meanswe , we can go futher, and there are another two Lemmas can be used:

1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);

this can be applied in EuclideanDistance, but not in CosineDistance.
However, for CosineDistance we can luckily get a variant in the space of radian/angle.

This PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};

this can be applied in EuclideanDistance, but not in CosineDistance

The application of Lemma 2 is a little complex:

It need to cache/update the distance/lower bounds to previous centers,
and thus can be only applied in training, not usable in prediction.

zhengruifeng · 2020-03-02T10:44:42Z

env: bin/spark-shell --driver-memory=64G

testCode:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.clustering._

val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
df.persist()

val km = new KMeans().setK(16).setMaxIter(5)
val kmm = km.fit(df)

val results = Seq(2,4,8,16).map { k =>
    val km = new KMeans().setK(k).setMaxIter(20);
    val start = System.currentTimeMillis;
    val kmm = km.fit(df);
    val end = System.currentTimeMillis;
    val train = end - start;
    kmm.transform(df).count; // trigger the lazy computation of radii
    val start2 = System.currentTimeMillis;
    kmm.transform(df).count;
    val end2 = System.currentTimeMillis;
    val predict = end2 - start2;
    val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
    println(result);
    result
}


val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
df2.persist()

val km = new KMeans().setK(16).setMaxIter(10)
val kmm = km.fit(df2)

val results2 = Seq(2,4,8,16,32,64,128,256).map { k =>
    val km = new KMeans().setK(k).setMaxIter(20);
    val start = System.currentTimeMillis;
    val kmm = km.fit(df2);
    val end = System.currentTimeMillis;
    val train = end - start;
    kmm.transform(df2).count; // trigger the lazy computation of radii
    val start2 = System.currentTimeMillis;
    kmm.transform(df).count;
    val end2 = System.currentTimeMillis;
    val predict = end2 - start2;
    val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict);
    println(result);
    result }

1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919

Test on webspam	This PR(k=2)	This PR(k=4)	This PR(k=8)	This PR(k=16)	Master(k=2)	Master(k=4)	Master(k=8)	Master(k=16)
Train Duration (sec)	27.602	47.932	145.430	371.239	27.042	46.630	136.205	350.788
Radii Computation (sec)	0.097	0.651	5.558	26.059	-	-	-	-
NumIters	9	10	18	20	9	10	18	20
Cost	5701.39583824928	4518.01202129673	4013.6152754360096	3520.6545815340055	5701.39583824928	4518.01202129673	4013.6152754360096	3520.6545815340064
Prediction Duration (millsec)	31	31	30	30	36	33	41	43

2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123

Test on a9a	This PR(k=2)	This PR(k=4)	This PR(k=8)	This PR(k=16)	This PR(k=32)	This PR(k=64)	This PR(k=128)	This PR(k=256)	Master(k=2)	Master(k=4)	Master(k=8)	Master(k=16)	Master(k=32)	Master(k=64)	Master(k=238)	Master(k=3566)
Train Duration (sec)	0.465	1.411	0.957	1.109	1.387	2.137	3.891	8.373	0.484	0.758	1.065	1.3	1.577	2.413	4.483	9.616
Radii Computation (sec)	0.000	0.000	0.000	0.001	0.002	0.007	0.024	0.093	-	-	-	-	-	-	-	-
NumIters	5	14	20	20	20	20	20	20	5	14	20	20	20	20	20	20
Cost	223377.07442855154	208261.6049327395	183833.73210801493	166964.5700618612	151753.31986776151	137289.24733092127	122693.39508665689	110459.53320745505	223377.07442855154	208261.6049327395	183833.73210801493	166964.5700618612	151753.31986776151	137289.24733092127	122693.39508665689	110459.53320745505
Prediction Duration (millsec)	32	33	31	30	30	30	30	31	39	36	38	33	32	29	35	31

We can see that:
1, the convergence of both impl are almost the same.
2, new impl of prediction is likely to be faster than existing impl (after lazy computation of radii);
3, computation of radii may take a few seconds, so training maybe slower than existing impl in some case (few instances, large k, large dim, then the computation of radii matters in the whole training), for example, webspam with k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is because the computation of radii (in all 20 iterations) totally took 26.059 sec.

SparkQA · 2020-03-02T11:31:57Z

Test build #119166 has finished for PR 27758 at commit c423b74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Yeah, does this actually speed things up? the benchmarks don't show much win, but I agree it could be in theory. It would have to be a pretty large k.

srowen · 2020-03-02T15:10:55Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+   * @return Radii of centers. If distance between point x and center c is less than
+   *         the radius of center c, then center c is the closest center to point x.
+   */
+  def computeRadii(centers: Array[VectorWithNorm]): Array[Double] = {


Is there any value in this default impl? it's overridden in both subclasses. Or is it to support future impls that may not be able to use this optimization?

You are right, I should remove this. Not all distance can be expected to have such triangle-inequality.

srowen · 2020-03-02T15:14:30Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

  /**
   * @return the index of the closest center to the given point, as well as the cost.
   */
  def findClosest(
-      centers: TraversableOnce[VectorWithNorm],
+      centers: Array[VectorWithNorm],
+      radii: Array[Double],


Do you need the other implementation, if you have this one? this short-circuits the case where you find a point within a cluster radius. This is kept to support cosine distance?

srowen · 2020-03-02T15:14:40Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+      if (d < r) {
+        bestDistance = d
+        bestIndex = i
+        found = true


Just return here rather than carry a flag around?

srowen · 2020-03-02T15:20:27Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+      val center = centers(i)
+      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
+      // distance computation.
+      var lowerBoundOfSqDist = center.norm - point.norm


Nit: I'd just make these two vals for clarity

srowen · 2020-03-02T15:21:37Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+        if (d < r * r) {
+          bestDistance = d
+          bestIndex = i
+          found = true


Same, just return?

srowen · 2020-03-02T15:28:21Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+
+      // d = 1 - cos(x)
+      // r = 1 - cos(x/2) = 1 - sqrt((cos(x) + 1) / 2) = 1 - sqrt(1 - d/2)
+      distances.map(d => 1 - math.sqrt(1 - d / 2))


Cosine distance doesn't obey the triangle inequality; is this meant to be angular distance? For my benefit (and possibly comments) how is r the angular distance here?

Yes, Cosine distance doesn't obey the triangle inequality, but the following lemma should be available to apply:

given a point x, and let b and c be centers. If angle(x, b)<angle(b,c)/2, then angle(x,b)<angle(x,c), cos_distance(x,b)=1-cos(x,b)<cos_distance(x,c)=1-cos(x,c)

That is because: PRINCIPLES FROM GEOMETRY, point 3

Each side of a spherical triangle is less than the sum of the other two.

Triangle_inequality:

In spherical geometry, the shortest distance between two points is an arc of a great circle, but the triangle inequality holds provided the restriction is made that the distance between two points on a sphere is the length of a minor spherical line segment (that is, one with central angle in [0, π]) with those endpoints.[4][5]

angle(x,b) + angle(x,c) > angle(b,c)
angle(x,b) < angle(b,c)/2

=> angle(x,c) > angle(b,c)/2 > angle(x,b)
=> cos_distance(x,c) > cos_distance(x,b)

angle(x,b) < angle(b,c)/2
<=> cos(x,b) > sqrt{ (cos(b,c) + 1)/2 }
<=> cos_distance(x,b) < 1 - sqrt{ (cos(b,c) + 1)/2 } = 1 - sqrt{ 1 - cos_distance(b,c) / 2 }

=> Give two centers b and c, if point x has cos_distance(x,b) < 1 - sqrt{ 1 - cos_distance(b,c) / 2 }, then point x belongs to center b.

In short, cos_distance do not obey triangle inequality, so we can NOT say:
If cos_distance(b,x) < cos_distance(b,c)/2, then cos_distance(b,x) < cos_distance(c,x)

However, the arc distance (or angle) obeys Each side of a spherical triangle is less than the sum of the other two., so we can get a angular bound, and then a cos_distance bound:
if point cos_distance(b,x) < 1 - sqrt{ 1 - cos_distance(b,c) / 2 }, then cos_distance(b,x) < cos_distance(c,x)

OK, the last expression comes from the cos(2x) identity, OK.

srowen · 2020-03-02T15:28:41Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala


    val iterationStartTime = System.nanoTime()

    instr.foreach(_.logNumFeatures(centers.head.vector.size))

    // Execute iterations of Lloyd's algorithm until converged
    while (iteration < maxIterations && !converged) {
+      val radiiStartTime = System.nanoTime()


You might remove this after testing; it's not super important.

zhengruifeng · 2020-03-03T07:55:56Z

I update the radii to the statistics including more than radii, so another bound can be used in both distance measurers: In findClosest, when cluster i is too far alway from current closest cluster bestIndex (for example, distance(cluster_i, cluster_bestIndex) > 2 * distance(cluster_bestIndex, x) in EuclideanDistance), then we can say that point x should not belong to cluster i, so no need to compute distance(cluster_i, x).

Then I retest above testsuite, the prediction is a litter faster, but not significantly.

I also test on cosine-distance, results are:

Test on webspam	This PR(k=2)	This PR(k=4)	This PR(k=8)	This PR(k=16)	Master(k=2)	Master(k=4)	Master(k=8)	Master(k=16)
Train Duration (sec)	28.851	39.434	107.571	306.996	29.291	37.332	99.98	275.32
NumIters	10	7	11	14	10	7	11	14
Cost	3585.915362367295	2830.5043071043824	2410.540059493046	2057.831172250597	3585.9153623672946	2830.5043071043824	2410.540059493046	2057.8311722505964
Prediction Duration (millsec)	29	29	29	33	32	29	32	35

Test on a9a	This PR(k=2)	This PR(k=4)	This PR(k=8)	This PR(k=16)	This PR(k=32)	This PR(k=64)	This PR(k=128)	This PR(k=256)	Master(k=2)	Master(k=4)	Master(k=8)	Master(k=16)	Master(k=32)	Master(k=64)	Master(k=238)	Master(k=3566)
Train Duration (sec)	0.445	0.559	0.77	1.067	1.379	2.114	3.857	7.786	0.461	0.613	0.728	1.208	1.547	2.387	4.287	9.11
NumIters	4	9	10	20	20	20	20	20	4	9	10	20	20	20	20	20
Cost	9458.881512958757	8727.181294576074	7646.536181047704	6743.890831633205	6063.089195117649	5381.196166489522	4787.4797985497125	4275.705880212388	9458.881512958757	8727.181294576074	7646.536181047704	6743.890831633205	6063.089195117649	5381.196166489522	4787.4797985497125	4275.705880212388
Prediction Duration (millsec)	29	30	28	28	28	28	29	28	32	30	34	30	31	31	30	36

We can see that KMeans impls with cosine distance have the (almost) same convergen.
And the prediction is about 10% faster than master.

SparkQA · 2020-03-03T08:05:01Z

Test build #119208 has finished for PR 27758 at commit dd2aff7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-03-03T16:57:47Z

If I'm reading that right, it's slower to train in many cases in the first instance, and slightly faster at scale in the second. I'm surprised because I'd have thought it's more easily a win, even with the overhead of calculating stats. Is the purpose more about prediction speed? I just wonder how much one is worth vs the other.

zhengruifeng · 2020-03-20T12:50:01Z

@srowen Sorry to reply late. I missed the emails from github.

Is the purpose more about prediction speed?

It also help saving training time, if the dataset is large enough. Since the cost of computing stats is about O(k^2 * m), while the cost of computing distances at one iteration is O(k * n * m) where m is the number of features, and n is the number of instances; I guess I can compute the stats distributedly in some case (when k is large);

I just mark this PR WIP for two reasons:
1, I will test this impl on a big dataset distributedly to check wheter above hypothesis set up;
2, for cosine distance, I want to future find a theoretical basis for the bound. Since above Each side of a spherical triangle is less than the sum of the other two seems is for 3-dim. I think it also right when dim>3, but it is not used in other impls. I will look for a theoretical proof.

srowen · 2020-03-22T15:35:43Z

PS it might be worth benchmarking again anyway, as I just merged the change that uses native BLAS for level 1 operations of vectors >= 256 elements.

zhengruifeng · 2020-04-14T12:55:39Z

I made a update to optimize the computation of statistics, if k and/or numFeatures are too large, compute the statistics distributedly.

I retest this impl today, and I use SparkUI to profile the performance:
testcode:

import org.apache.spark.ml.linalg._
import org.apache.spark.ml.clustering._

var df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").repartition(2)
df.persist()

(0 until 4).foreach{ _ => df = df.union(df) }
df.count

Seq(4,8,16,32).foreach{ k => new KMeans().setK(k).setMaxIter(5).fit(df) }

I recoded both the duration at each iteration and the Stage of prediction:

results:

Test on webspam	This PR(k=4)	This PR(k=8)	This PR(k=16)	This PR(k=32)	Master(k=4)	Master(k=8)	Master(k=16)	Master(k=32)
Average iteration (sec)	9.2+0.0	15.8+0.1	31.4+0.5	63.6+2	9.8	16.4	34.6	78.3
Average Prediction Stage	6	10.1	20.6	44.4	6	10.8	22.8	57.2

63.6+2 here means it took 2sec to compute those statistics distributedly, which is faster than the previous commit (computing statstics in the driver) which took about 9sec.

When k=4,8 the speedup is not significant, when k=16,32 it is about 10%~30% faster in prediction Stage.

It shows that the large k is, the realtively faster this new impl is, that is because for large k there is more chances to trigger the short circuits.

SparkQA · 2020-04-14T13:33:52Z

Test build #121275 has finished for PR 27758 at commit b4fabb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

If it's no worse for small k I think this is more fine. I'm slightly concerned about the extra complexity but it is probably worth the speedup

srowen · 2020-04-14T13:57:46Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+  /**
+   * Statistics used in triangle inequality to obtain useful bounds to find closest centers.
+   *
+   * @return A symmetric matrix containing statistics, matrix(i)(j) represents:


It might be too hard to implement, but if it's symmetric you only need half this many elements to represent. The indexing becomes more complex though

good idea, it is symmetric. I will study how other impls like GMM to store only the upper triangular part of the matrix.
Maybe it is helpful to support symmetric dense matrix in .linalg? since it is used in many places

srowen · 2020-04-14T13:58:18Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+      while (j < k) {
+        val d = distance(centers(i), centers(j))
+        val s = computeStatistics(d)
+        stats(i)(j) = s


If you're micro-optimizing, I suppose you can lift stats(i) out of the loop, but it may not o anything

srowen · 2020-04-14T13:59:59Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

+  /**
+   * @return the index of the closest center to the given point, as well as the cost.
+   */
+  def findClosest(


Is there any clean way to avoid duplicating most of this code? maybe not. It looks almost identical to the above though

SparkQA · 2020-04-17T09:01:57Z

Test build #121400 has finished for PR 27758 at commit 2b10eb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-17T09:26:45Z

Test build #121401 has finished for PR 27758 at commit 0afc845.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

init init init init nit nit nit nit

update

compute stats distributedly nit nit

package upper tri package upper tri package upper tri

need collect need collect

SparkQA · 2020-04-20T10:39:30Z

Test build #121517 has finished for PR 27758 at commit b050973.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-20T10:49:43Z

Test build #121518 has finished for PR 27758 at commit bb4539b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-04-20T11:03:29Z

I retest this impl using the packed matrix:

This PR:

Master:

durations of the prediction Stage mapPartitions at KMeans.scala:... at each iteration:
This PR: 51sec, 49sec, 48sec, 46sec, 46sec
Master: 1.1min, 57sec, 60sec, 57sec, 59sec

zhengruifeng · 2020-04-20T11:04:00Z

retest this please

SparkQA · 2020-04-20T12:44:34Z

Test build #121524 has finished for PR 27758 at commit bb4539b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looking pretty fine if the speedup still holds and it doesn't change behavior

mllib-local/src/main/scala/org/apache/spark/ml/impl/Utils.scala

mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala

xwu99 · 2020-04-21T05:22:46Z

In current impl, following Lemma is used in KMeans:

0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|
      val center = centers(i)
      // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
      // distance computation.
      var lowerBoundOfSqDist = center.norm - point.norm
      lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
      if (lowerBoundOfSqDist < bestDistance) {
        ...
      }
this can only be applied in EuclideanDistance, but not in CosineDistance

According to Using the Triangle Inequality to Accelerate K-Meanswe , we can go futher, and there are another two Lemmas can be used:

1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);

this can be applied in EuclideanDistance, but not in CosineDistance.
However, for CosineDistance we can luckily get a variant in the space of radian/angle.

This PR is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

When this condition holds, it can remove the computation of distances for other centers. If not the case, it will go to the normal K-means path?

Does it hurt by introducing more overhead by pre-calculation when K is large since you need to pre-calculate when centers are updated for each iteration and the condition needs to always hold to get the benefit?

2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};

this can be applied in EuclideanDistance, but not in CosineDistance

The application of Lemma 2 is a little complex:

It need to cache/update the distance/lower bounds to previous centers,

and thus can be only applied in training, not usable in prediction.

zhengruifeng · 2020-04-21T06:00:39Z

@xwu99 Thanks for reviewing!

When this condition holds, it can remove the computation of distances for other centers. If not the case, it will go to the normal K-means path?

Yes, this PR will use two short-circuit:
1, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers.
2, in normal K-means path, suppose current closest center is c and distance is p, for next cluster d, if distance(c,d) is larger than a bound f(p), then we can skip center d;

Does it hurt by introducing more overhead by pre-calculation when K is large since you need to pre-calculate when centers are updated for each iteration and the condition needs to always hold to get the benefit?

I had optimize it by make the pre-calculation distributedly when k*k*dim is too large. On the test dataset with dim=8,289,919 and k=32, it took less than 2 seconds at one iteration.
In most cases, it is computed in the driver, and only took ~100 millsecond.

I think triangle-inequality based optimization is complementary with BLAS optimization. When we do not enable Level-3 BLAS, then the triangle-inequality approach will work.

SparkQA · 2020-04-21T08:09:56Z

Test build #121571 has finished for PR 27758 at commit d31d488.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

If there are no more comments, LGTM

srowen · 2020-04-24T16:24:21Z

Merged to master

zhengruifeng · 2020-04-26T03:00:47Z

Thanks for reviewing!

zhengruifeng added the ML label Mar 2, 2020

srowen requested changes Mar 2, 2020

View reviewed changes

zhengruifeng force-pushed the km_triangle branch from c423b74 to dd2aff7 Compare March 3, 2020 07:37

zhengruifeng changed the title ~~[SPARK-31007][ML] KMeans optimization based on triangle-inequality~~ [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality Mar 18, 2020

zhengruifeng force-pushed the km_triangle branch from dd2aff7 to b4fabb1 Compare April 14, 2020 12:07

srowen reviewed Apr 14, 2020

View reviewed changes

probot-autolabeler bot added the MLLIB label Apr 17, 2020

zhengruifeng changed the title ~~[SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality~~ [SPARK-31007][ML] KMeans optimization based on triangle-inequality Apr 17, 2020

zhengruifeng added 6 commits April 20, 2020 16:45

init

66f33de

init init init init nit nit nit nit

update

c6b69ee

update

compute stats distributedly

488713e

compute stats distributedly nit nit

package upper tri

c9a720f

package upper tri package upper tri package upper tri

need collect

c48183f

need collect need collect

add testsuite for statistics

b050973

zhengruifeng force-pushed the km_triangle branch from 0afc845 to b050973 Compare April 20, 2020 09:16

nit

bb4539b

srowen mentioned this pull request Apr 20, 2020

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

Closed

srowen reviewed Apr 21, 2020

View reviewed changes

adress comments

d31d488

srowen approved these changes Apr 23, 2020

View reviewed changes

srowen closed this in 0ede08b Apr 24, 2020

zhengruifeng deleted the km_triangle branch April 26, 2020 03:00

zhengruifeng mentioned this pull request Feb 10, 2022

[SPARK-36553][ML] KMeans avoid compute auxiliary statistics for large K #35457

Closed

[SPARK-31007][ML] KMeans optimization based on triangle-inequality #27758

[SPARK-31007][ML] KMeans optimization based on triangle-inequality #27758

Conversation

zhengruifeng commented Mar 2, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Mar 2, 2020 • edited Loading

zhengruifeng commented Mar 2, 2020 • edited Loading

SparkQA commented Mar 2, 2020

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng Mar 3, 2020 • edited Loading

Choose a reason for hiding this comment

zhengruifeng Mar 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Mar 3, 2020

SparkQA commented Mar 3, 2020

srowen commented Mar 3, 2020

zhengruifeng commented Mar 20, 2020

srowen commented Mar 22, 2020

zhengruifeng commented Apr 14, 2020 • edited Loading

SparkQA commented Apr 14, 2020

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 17, 2020

SparkQA commented Apr 17, 2020

SparkQA commented Apr 20, 2020

SparkQA commented Apr 20, 2020

zhengruifeng commented Apr 20, 2020

zhengruifeng commented Apr 20, 2020

SparkQA commented Apr 20, 2020

srowen left a comment

Choose a reason for hiding this comment

xwu99 commented Apr 21, 2020

zhengruifeng commented Apr 21, 2020

SparkQA commented Apr 21, 2020

srowen left a comment

Choose a reason for hiding this comment

srowen commented Apr 24, 2020

zhengruifeng commented Apr 26, 2020

zhengruifeng commented Mar 2, 2020 •

edited

Loading

zhengruifeng commented Mar 2, 2020 •

edited

Loading

zhengruifeng commented Mar 2, 2020 •

edited

Loading

zhengruifeng Mar 3, 2020 •

edited

Loading

zhengruifeng Mar 3, 2020 •

edited

Loading

zhengruifeng commented Apr 14, 2020 •

edited

Loading