[SPARK-30699][ML][PYSPARK] GMM blockify input vectors #27473

zhengruifeng · 2020-02-06T06:53:42Z

What changes were proposed in this pull request?

1, add new param blockSize;
2, if blockSize==1, keep original behavior, code path trainOnRows;
3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks

Why are the changes needed?

performance gain on dense dataset HIGGS:
1, save about 45% RAM;
2, 3X faster with openBLAS

Does this PR introduce any user-facing change?

add a new expert param blockSize

How was this patch tested?

added testsuites

SparkQA · 2020-02-06T07:16:44Z

Test build #117976 has finished for PR 27473 at commit 448ed80.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-06T05:54:46Z

Test build #122341 has finished for PR 27473 at commit 448ed80.

This patch fails MiMa tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-05-06T06:54:03Z

Test build #122342 has finished for PR 27473 at commit 473e5a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-06T09:06:39Z

Test build #122351 has finished for PR 27473 at commit 35e007e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-05-06T09:09:21Z

test on the first 1M rows in HIGGS:

test code:

import org.apache.spark.ml.clustering._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.ml.linalg._

val df = spark.read.format("libsvm").load("/data1/Datasets/higgs/HIGGS.1m").repartition(1)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count


val gmm = new GaussianMixture().setSeed(0).setK(4).setMaxIter(2).setBlockSize(64)
gmm.fit(df)


val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = gmm.setK(4).setMaxIter(20).setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) }

results.map(_._2.summary.numIter)
results.map(_._2.summary.logLikelihood)
results.map(_._3)

Results WITHOUT native BLAS:

scala> results.map(_._2.summary.numIter)
res3: Seq[Int] = List(20, 20, 20, 20, 20, 20, 20)

scala> results.map(_._2.summary.logLikelihood)
res4: Seq[Double] = List(-2.3353357834421366E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7)

scala> results.map(_._3)
res5: Seq[Long] = List(105777, 113261, 110608, 106573, 108141, 109825, 113094)

It is surprising that there is a small performance regression on dense input: 105777 -> 106573

blockSize==1

blockSize==1024

Results WITH native BLAS (OPENBLAS_NUM_THREADS=1):

scala> results.map(_._2.summary.numIter)
res3: Seq[Int] = List(20, 20, 20, 20, 20, 20, 20)

scala> results.map(_._2.summary.logLikelihood)
res4: Seq[Double] = List(-2.3353357834421374E7, -2.3353357834422573E7, -2.3353357834422797E7, -2.335335783442225E7, -2.3353357834422205E7, -2.3353357834422156E7, -2.335335783442218E7)

scala> results.map(_._3)
res5: Seq[Long] = List(108005, 54975, 39802, 35807, 35027, 36369, 38717)

When OpenBLAS is used, it obtain about 3x speedup.

blockSize==1 with OpenBLAS

blockSize==1024 with OpenBLAS

Comparsion to Master (WITHOUT native BLAS):

scala> val start = System.currentTimeMillis; val model = gmm.setK(4).setMaxIter(20).fit(df); val end = System.currentTimeMillis; end - start
start: Long = 1587976220511                                                     
model: org.apache.spark.ml.clustering.GaussianMixtureModel = GaussianMixtureModel: uid=GaussianMixture_753da885644b, k=4, numFeatures=28
end: Long = 1587976324361
res4: Long = 103850

This PR keeps original behavior and performance if BlockSize==1

SparkQA · 2020-05-06T11:53:46Z

Test build #122358 has finished for PR 27473 at commit ba2a4f3.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

zhengruifeng · 2020-05-07T02:37:14Z

@srowen I think GMM maybe a special case, I found than on dense input, it suffers a small regression. (Other impls like LoR/LiR will be accelerated significantly even without native-BLAS.)
But when I enable OpenBLAS, this PR is about 3X faster than existing impl, 5X faster on the prediction stage (5sec -> 1sec).

@xwu99 This is for GMM, I think it is similar to KMeans. I am happy if you can help reviewing this.

init init init init init init init init init init use nativeBLAS for dense input add py refactor refactor refactor nit revert BLAS.ger revert BLAS.ger revert BLAS.ger nit nit simplify nit nit opt_blas opt_blas nit nit nit

SparkQA · 2020-05-07T06:13:45Z

Test build #122392 has finished for PR 27473 at commit 0eb0f07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-07T10:36:47Z

Test build #122398 has finished for PR 27473 at commit 31f8907.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-05-09T07:53:19Z

retest this please

SparkQA · 2020-05-09T09:08:29Z

Test build #122463 has finished for PR 27473 at commit 31f8907.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-09T13:01:59Z

Test build #122466 has finished for PR 27473 at commit a3a005e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-05-12T04:54:40Z

Merged to master

HyukjinKwon · 2020-05-13T02:42:11Z

Let's don't merge without review or approval, @zhengruifeng

srowen · 2020-05-13T02:52:25Z

Meta-comment - I have reviewed similar block-ification PRs and don't have a problem with them, and trust the judgment of @zhengruifeng as a result of the discussion. He now knows this code better than anyone. I know he's a committer and will "own" his work, including fixes if needed. I think this line of work is OK.

HyukjinKwon · 2020-05-13T03:00:03Z

This PR might be fine, and I understand ML side isn't reviewed pretty well either. But might be best to have somebody to follow his work up. If this becomes a pattern, I think it's a problem.

mengxr · 2020-06-11T00:46:30Z

@srowen Even if you reviewed some similar PRs from @zhengruifeng , could you explicitly give LGTM on each of the PRs before merge? Committers who saw PRs like #28473 would raise the same question as @HyukjinKwon .

@zhengruifeng Please wait for LGTM before you merge a PR in the future. Committers should follow the review process strictly.

srowen · 2020-06-11T01:03:48Z

@mengxr why not review them yourself?

zhengruifeng · 2020-06-11T05:41:52Z

@mengxr OK, I will be more patient for reviewing.
actually, I did not ping Owen in some of those PRs, I will involve more ML committers/contributors in future PRs and tickets.

mengxr · 2020-06-11T20:32:51Z

@zhengruifeng I trust that a committer would merge a PR if and only if he/she is confident with the change. However, a non-trivial change from the committer him/herself does not count. You have to find another committer to review and approve before merge, no matter how long you might wait. If no one reviews it, it means the community is not interested in the change. Your choice is to keep the PR open or close the PR, not merge.

@srowen : @WeichenXu123 and I will take a look at the merged changes.

WeichenXu123 · 2020-09-23T00:07:33Z

mllib-local/src/main/scala/org/apache/spark/ml/stat/distribution/MultivariateGaussian.scala

@@ -55,7 +55,7 @@ class MultivariateGaussian @Since("2.0.0") (
   */
  @transient private lazy val tuple = {
    val (rootSigmaInv, u) = calculateCovarianceConstants
-    val rootSigmaInvMat = Matrices.fromBreeze(rootSigmaInv)
+    val rootSigmaInvMat = Matrices.fromBreeze(rootSigmaInv).toDense


See comment in #29782

WeichenXu123 · 2020-09-23T00:08:42Z

mllib-local/src/main/scala/org/apache/spark/ml/stat/distribution/MultivariateGaussian.scala

@@ -81,6 +81,36 @@ class MultivariateGaussian @Since("2.0.0") (
    u - 0.5 * BLAS.dot(v, v)
  }

+  private[ml] def pdf(X: Matrix): DenseVector = {


See comment in #29782

WeichenXu123 mentioned this pull request Feb 6, 2020

[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors #27374

Closed

zhengruifeng changed the title ~~[SPARK-30699][ML][PYSPARK] GMM blockify input vectors~~ [SPARK-30699][WIP][ML][PYSPARK] GMM blockify input vectors Feb 6, 2020

zhengruifeng closed this Feb 7, 2020

zhengruifeng mentioned this pull request Apr 29, 2020

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

Closed

zhengruifeng reopened this May 6, 2020

zhengruifeng force-pushed the blockify_gmm branch from 448ed80 to 473e5a2 Compare May 6, 2020 05:27

probot-autolabeler bot added ML MLLIB PYTHON labels May 6, 2020

zhengruifeng changed the title ~~[SPARK-30699][WIP][ML][PYSPARK] GMM blockify input vectors~~ [SPARK-30699][ML][PYSPARK] GMM blockify input vectors May 6, 2020

zhengruifeng force-pushed the blockify_gmm branch from 35e007e to ba2a4f3 Compare May 6, 2020 10:30

zhengruifeng added 2 commits May 7, 2020 12:58

init

985ad64

init init init init init init init init init init use nativeBLAS for dense input add py refactor refactor refactor nit revert BLAS.ger revert BLAS.ger revert BLAS.ger nit nit simplify nit nit opt_blas opt_blas nit nit nit

nit

0eb0f07

zhengruifeng force-pushed the blockify_gmm branch from ba2a4f3 to 0eb0f07 Compare May 7, 2020 04:59

nit

31f8907

fix probSum

a3a005e

zhengruifeng closed this in e7fa778 May 12, 2020

zhengruifeng deleted the blockify_gmm branch May 12, 2020 04:54

zero323 mentioned this pull request Jul 18, 2020

[SPARK-30699] GMM blockify input vectors zero323/pyspark-stubs#426

Closed

WeichenXu123 reviewed Sep 23, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30699][ML][PYSPARK] GMM blockify input vectors #27473

[SPARK-30699][ML][PYSPARK] GMM blockify input vectors #27473

zhengruifeng commented Feb 6, 2020 •

edited

Loading

SparkQA commented Feb 6, 2020

SparkQA commented May 6, 2020

SparkQA commented May 6, 2020

SparkQA commented May 6, 2020

zhengruifeng commented May 6, 2020 •

edited

Loading

SparkQA commented May 6, 2020

zhengruifeng commented May 7, 2020

SparkQA commented May 7, 2020

SparkQA commented May 7, 2020

zhengruifeng commented May 9, 2020

SparkQA commented May 9, 2020

SparkQA commented May 9, 2020

zhengruifeng commented May 12, 2020

HyukjinKwon commented May 13, 2020

srowen commented May 13, 2020

HyukjinKwon commented May 13, 2020

mengxr commented Jun 11, 2020

srowen commented Jun 11, 2020

zhengruifeng commented Jun 11, 2020

mengxr commented Jun 11, 2020

WeichenXu123 Sep 23, 2020

WeichenXu123 Sep 23, 2020

[SPARK-30699][ML][PYSPARK] GMM blockify input vectors #27473

[SPARK-30699][ML][PYSPARK] GMM blockify input vectors #27473

Conversation

zhengruifeng commented Feb 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 6, 2020

SparkQA commented May 6, 2020

SparkQA commented May 6, 2020

SparkQA commented May 6, 2020

zhengruifeng commented May 6, 2020 • edited Loading

SparkQA commented May 6, 2020

zhengruifeng commented May 7, 2020

SparkQA commented May 7, 2020

SparkQA commented May 7, 2020

zhengruifeng commented May 9, 2020

SparkQA commented May 9, 2020

SparkQA commented May 9, 2020

zhengruifeng commented May 12, 2020

HyukjinKwon commented May 13, 2020

srowen commented May 13, 2020

HyukjinKwon commented May 13, 2020

mengxr commented Jun 11, 2020

srowen commented Jun 11, 2020

zhengruifeng commented Jun 11, 2020

mengxr commented Jun 11, 2020

WeichenXu123 Sep 23, 2020

Choose a reason for hiding this comment

WeichenXu123 Sep 23, 2020

Choose a reason for hiding this comment

zhengruifeng commented Feb 6, 2020 •

edited

Loading

zhengruifeng commented May 6, 2020 •

edited

Loading