Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30699][ML][PYSPARK] GMM blockify input vectors #27473

Closed
wants to merge 4 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Feb 6, 2020

What changes were proposed in this pull request?

1, add new param blockSize;
2, if blockSize==1, keep original behavior, code path trainOnRows;
3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks

Why are the changes needed?

performance gain on dense dataset HIGGS:
1, save about 45% RAM;
2, 3X faster with openBLAS

Does this PR introduce any user-facing change?

add a new expert param blockSize

How was this patch tested?

added testsuites

@SparkQA
Copy link

SparkQA commented Feb 6, 2020

Test build #117976 has finished for PR 27473 at commit 448ed80.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng changed the title [SPARK-30699][ML][PYSPARK] GMM blockify input vectors [SPARK-30699][WIP][ML][PYSPARK] GMM blockify input vectors Feb 6, 2020
@zhengruifeng zhengruifeng reopened this May 6, 2020
@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122341 has finished for PR 27473 at commit 448ed80.

  • This patch fails MiMa tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122342 has finished for PR 27473 at commit 473e5a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng changed the title [SPARK-30699][WIP][ML][PYSPARK] GMM blockify input vectors [SPARK-30699][ML][PYSPARK] GMM blockify input vectors May 6, 2020
@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122351 has finished for PR 27473 at commit 35e007e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented May 6, 2020

test on the first 1M rows in HIGGS:

test code:

import org.apache.spark.ml.clustering._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.ml.linalg._

val df = spark.read.format("libsvm").load("/data1/Datasets/higgs/HIGGS.1m").repartition(1)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count


val gmm = new GaussianMixture().setSeed(0).setK(4).setMaxIter(2).setBlockSize(64)
gmm.fit(df)


val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = gmm.setK(4).setMaxIter(20).setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) }

results.map(_._2.summary.numIter)
results.map(_._2.summary.logLikelihood)
results.map(_._3)

Results WITHOUT native BLAS:

scala> results.map(_._2.summary.numIter)
res3: Seq[Int] = List(20, 20, 20, 20, 20, 20, 20)

scala> results.map(_._2.summary.logLikelihood)
res4: Seq[Double] = List(-2.3353357834421366E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7, -2.3353357834421184E7)

scala> results.map(_._3)
res5: Seq[Long] = List(105777, 113261, 110608, 106573, 108141, 109825, 113094)

It is surprising that there is a small performance regression on dense input: 105777 -> 106573

blockSize==1
gmm_1

blockSize==1024
gmm_1024


Results WITH native BLAS (OPENBLAS_NUM_THREADS=1):

scala> results.map(_._2.summary.numIter)
res3: Seq[Int] = List(20, 20, 20, 20, 20, 20, 20)

scala> results.map(_._2.summary.logLikelihood)
res4: Seq[Double] = List(-2.3353357834421374E7, -2.3353357834422573E7, -2.3353357834422797E7, -2.335335783442225E7, -2.3353357834422205E7, -2.3353357834422156E7, -2.335335783442218E7)

scala> results.map(_._3)
res5: Seq[Long] = List(108005, 54975, 39802, 35807, 35027, 36369, 38717)

When OpenBLAS is used, it obtain about 3x speedup.

blockSize==1 with OpenBLAS
gmm_openBlas_1

blockSize==1024 with OpenBLAS
gmm_openBlas_1024


Comparsion to Master (WITHOUT native BLAS):

scala> val start = System.currentTimeMillis; val model = gmm.setK(4).setMaxIter(20).fit(df); val end = System.currentTimeMillis; end - start
start: Long = 1587976220511                                                     
model: org.apache.spark.ml.clustering.GaussianMixtureModel = GaussianMixtureModel: uid=GaussianMixture_753da885644b, k=4, numFeatures=28
end: Long = 1587976324361
res4: Long = 103850

This PR keeps original behavior and performance if BlockSize==1

@SparkQA
Copy link

SparkQA commented May 6, 2020

Test build #122358 has finished for PR 27473 at commit ba2a4f3.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@srowen I think GMM maybe a special case, I found than on dense input, it suffers a small regression. (Other impls like LoR/LiR will be accelerated significantly even without native-BLAS.)
But when I enable OpenBLAS, this PR is about 3X faster than existing impl, 5X faster on the prediction stage (5sec -> 1sec).

@xwu99 This is for GMM, I think it is similar to KMeans. I am happy if you can help reviewing this.

init

init

init

init

init

init

init

init

init

init

use nativeBLAS for dense input

add py

refactor

refactor

refactor

nit

revert BLAS.ger

revert BLAS.ger

revert BLAS.ger

nit

nit

simplify

nit

nit

opt_blas

opt_blas

nit

nit

nit
@SparkQA
Copy link

SparkQA commented May 7, 2020

Test build #122392 has finished for PR 27473 at commit 0eb0f07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 7, 2020

Test build #122398 has finished for PR 27473 at commit 31f8907.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 9, 2020

Test build #122463 has finished for PR 27473 at commit 31f8907.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2020

Test build #122466 has finished for PR 27473 at commit a3a005e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

Merged to master

@zhengruifeng zhengruifeng deleted the blockify_gmm branch May 12, 2020 04:54
@HyukjinKwon
Copy link
Member

Let's don't merge without review or approval, @zhengruifeng

@srowen
Copy link
Member

srowen commented May 13, 2020

Meta-comment - I have reviewed similar block-ification PRs and don't have a problem with them, and trust the judgment of @zhengruifeng as a result of the discussion. He now knows this code better than anyone. I know he's a committer and will "own" his work, including fixes if needed. I think this line of work is OK.

@HyukjinKwon
Copy link
Member

This PR might be fine, and I understand ML side isn't reviewed pretty well either. But might be best to have somebody to follow his work up. If this becomes a pattern, I think it's a problem.

@mengxr
Copy link
Contributor

mengxr commented Jun 11, 2020

@srowen Even if you reviewed some similar PRs from @zhengruifeng , could you explicitly give LGTM on each of the PRs before merge? Committers who saw PRs like #28473 would raise the same question as @HyukjinKwon .

@zhengruifeng Please wait for LGTM before you merge a PR in the future. Committers should follow the review process strictly.

@srowen
Copy link
Member

srowen commented Jun 11, 2020

@mengxr why not review them yourself?

@zhengruifeng
Copy link
Contributor Author

@mengxr OK, I will be more patient for reviewing.
actually, I did not ping Owen in some of those PRs, I will involve more ML committers/contributors in future PRs and tickets.

@mengxr
Copy link
Contributor

mengxr commented Jun 11, 2020

@zhengruifeng I trust that a committer would merge a PR if and only if he/she is confident with the change. However, a non-trivial change from the committer him/herself does not count. You have to find another committer to review and approve before merge, no matter how long you might wait. If no one reviews it, it means the community is not interested in the change. Your choice is to keep the PR open or close the PR, not merge.

@srowen : @WeichenXu123 and I will take a look at the merged changes.

@@ -55,7 +55,7 @@ class MultivariateGaussian @Since("2.0.0") (
*/
@transient private lazy val tuple = {
val (rootSigmaInv, u) = calculateCovarianceConstants
val rootSigmaInvMat = Matrices.fromBreeze(rootSigmaInv)
val rootSigmaInvMat = Matrices.fromBreeze(rootSigmaInv).toDense
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in #29782

@@ -81,6 +81,36 @@ class MultivariateGaussian @Since("2.0.0") (
u - 0.5 * BLAS.dot(v, v)
}

private[ml] def pdf(X: Matrix): DenseVector = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in #29782

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants