[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

xwu99 · 2020-04-16T03:21:27Z

What changes were proposed in this pull request?

Adding an optimized K-Means implementation based on DenseMatrix and GEMM to improve performance. JIRA: https://issues.apache.org/jira/browse/SPARK-31454.

Why are the changes needed?

The main computations in K-Means are calculating distances between individual points and center points. Currently K-Means implementation is vector-based which can't take advantage of optimized native BLAS libraries.

When the original points are represented as dense vectors, our approach is to modify the original input data structures to a DenseMatrix-based one by grouping several points together. The original distance calculations can be translated into a Matrix multiplication then optimized native GEMM routines (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with sparse vectors despite having larger memory consumption when translating sparse vectors to dense matrix.

Does this PR introduce any user-facing change?

No. Use config parameters to control if turn on this implementation without modifying public interfaces.

How was this patch tested?

Use the same dataset and parameters to run original K-Means and this implementation and compare the results.

AmplabJenkins · 2020-04-16T03:26:29Z

Can one of the admins verify this patch?

srowen · 2020-04-20T13:37:19Z

@zhengruifeng is optimizing this (differently) in #27758

zhengruifeng · 2020-04-21T02:50:26Z

@srowen Thanks for pinging me

@xwu99 Could you please provide some performance results of your PR?

I had similar attempts to optimize KMeans based on high level BLAS.
I also blockfied vectors into blocks, and use BLAS.gemm to find best costs. But I found that:
1, it will cause performance regression when input dataset is sparse, (I notice that you add spark.ml.kmeans.matrixImplementation.rowsPerMatrix, I am not sure whether we should have two implementations);
2, when input dataset is dense, I found no performace gain when distanceMeasure = EUCLIDEAN; while distanceMeasure = EUCLIDEAN, about 10% ~ 20% speedup can be obtained;
3, Native BLAS (Open-BLAS) did not help too much, if single-thread is used (which is suggested in SPARK);

Then I swith to another optimization approach based on triangle-inequality, it works on both dense and sparse dataset, and will gain about 10%~30% when numFeatures and/or k are large.

srowen · 2020-04-21T03:16:00Z

Yeah I think we tried this and it hurt perf on sparse input, no? I'd have to dig it out..

xwu99 · 2020-04-21T03:21:35Z

@srowen Thanks you for linking us!

@xwu99 Could you please provide some performance results of your PR?
Our preliminary benchmark shows this approach can boost the training performance by 3.5x with Intel MKL, l can provide further benchmark later.

I had similar attempts to optimize KMeans based on high level BLAS.
I also blockfied vectors into blocks, and use BLAS.gemm to find best costs. But I found that:
1, it will cause performance regression when input dataset is sparse, (I notice that you add spark.ml.kmeans.matrixImplementation.rowsPerMatrix, I am not sure whether we should have two implementations);

This config is for not to impact the origianl one. If the general idea is OK, we can switch for best performance implementaion under different conditions, it's not unusual in other part of MLlib code.

2, when input dataset is dense, I found no performace gain when distanceMeasure = EUCLIDEAN; while distanceMeasure = EUCLIDEAN, about 10% ~ 20% speedup can be obtained;
3, Native BLAS (Open-BLAS) did not help too much, if single-thread is used (which is suggested in SPARK);

Did you benchmark with native BLAS with a machine with AVX2 or AVX512 ? The native optimization not only take advantage of multi-thread but also SIMD, cache etc.

Then I swith to another optimization approach based on triangle-inequality, it works on both dense and sparse dataset, and will gain about 10%~30% when numFeatures and/or k are large.

I do think it's a good idea! But it's still not a general speedup for all cases, gain on assuming some specific conditions. Still need to use the general K-Means.

xwu99 · 2020-04-21T03:27:01Z

Yeah I think we tried this and it hurt perf on sparse input, no? I'd have to dig it out..

@srowen I will benchmark sparse cases, but could we use this to deal with dense only? it's not unusual in other parts of MLlib such as in BLAS, switch sparse/dense into different cases?
And I also think it depends on sparsity level, if it's very sparse, translating to dense one not only hurt perf but also waste memory. if just represented as sparse but not very sparse, may still gain?

zhengruifeng · 2020-04-21T04:01:01Z

Did you benchmark with native BLAS with a machine with AVX2 or AVX512 ? The native optimization not only take advantage of multi-thread but also SIMD, cache etc.

I tested with OpenBLAS (OPENBLAS_NUM_THREADS=1) with a i7-8850 cpu, which support avx2, not avx512;

I do think it's a good idea! But it's still not a general speedup for all cases, gain on assuming some specific conditions. Still need to use the general K-Means.

When k and numFeatures are small, there is no much optimization space for triangle-inequality. But I guess this also applies to high-level BLAS, suppose k=2 or k=64, I guess BLAS.gemm with k=2 may not gain as much speedup as k=64?

it's not unusual in other parts of MLlib such as in BLAS, switch sparse/dense into different cases?

There are some algorithms (in ml.stat) that can switch between sparse/dense, but no classification/regression/clustering impls support it now.

zhengruifeng · 2020-04-21T04:38:36Z

I am not against this PR.
I don't think adding a spark conf is a good idea, but maybe we could add a parameter for end user to switch between impls? or check the first vector like existing impl?

xwu99 · 2020-04-21T05:03:39Z

@zhengruifeng I am OK with inline switch instead of SparkConf.

My general points:

Matrix-multiply is highly optimized routine by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available
Sparse matrix also has native optimization, but it's not part of standard BLAS and current BLAS in Spark doesn't support the optimized one of it.
I do believe the parameters requires tuning (such as K, rowsPerMatrix) and we can do more benchmarks and give guidelines to end users on how to set them to get best performance.

zhengruifeng · 2020-04-21T05:20:54Z

I am OK if we can avoid performance regression on sparse datasets. It is up to the end users to choose the right impl.

@srowen How you think about it?

zhengruifeng · 2020-04-21T05:23:33Z

also ping @mengxr @WeichenXu123
What about adding an option and letting end user to choose whether to enable high-level BLAS?

xwu99 · 2020-04-28T10:31:04Z

@zhengruifeng I saw your PR was merged, I will rebase. I am preparing some benchmarks. let's focus on dense case first. For sparse cases, we can use original path.
@srowen @mengxr @WeichenXu123 do you have more feedbacks?

zhengruifeng · 2020-04-29T03:53:21Z

I saw your PR was merged, I will rebase.

I had some reverted PRs on using high-level BLAS in LoR/LiR/SVC/GMM, they were reverted because of performance regression on sparse datasets;
I am now working on it again, using param blockSize==1 to choose the impl.
I am also waiting for more feedbacks. If nobody object, I will merge them.

There are some common utils in those PRs, which should also be used in KMeans. So I think you can rebase this PR after SVC get merged.

xwu99 · 2020-04-29T04:14:35Z

I saw your PR was merged, I will rebase.

I had some reverted PRs on using high-level BLAS in LoR/LiR/SVC/GMM, they were reverted because of performance regression on sparse datasets;
I am now working on it again, using param blockSize==1 to choose the impl.
I am also waiting for more feedbacks. If nobody object, I will merge them.

There are some common utils in those PRs, which should also be used in KMeans. So I think you can rebase this PR after SVC get merged.

OK. could you also let me know the PRs you are reworking since we are also working on enabling high-level BLAS not only for k-means but also other algos in MLlib. I can help to review them rather than duplicate efforts.

zhengruifeng · 2020-04-29T05:07:43Z

@xwu99 My previous works include:
LinearSVC: #27360
LogisticRegression: #27374
LinearRegression: #27396
GaussianMixture: #27473
KMeans: https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1, not send

I'm reworking on LinearSVC/LogisticRegression/LinearRegression/GaussianMixture. For KMeans I am glad you can take it over.

I just recreate a new PR for LinearSVC, the main idea is to use expert param blockSize to choose the path. The original path will be choosen by default to avoid performance regression on sparse datasets.

If nobody object, I will merge it, and then other three impls (since they depend on the first one, I do not recreate PRs right now) LogisticRegression/LinearRegression/GaussianMixture.

zhengruifeng · 2020-05-07T02:16:47Z

@xwu99 There was a ticket for this.
Now I had merged high-level BLAS supports for LinearSVC and LogisticRegression.

I can help reviewing this PR on KMeans, you can list some performance details, like dataset, numFeatures, numInstances, performance without native-blas (mkl), performance with native-blas...

xwu99 · 2020-05-07T02:18:59Z

@xwu99 There was a ticket for this.
Now I had merged high-level BLAS supports for LinearSVC and LogisticRegression.

I can help reviewing this PR on KMeans, you can list some performance details, like dataset, numFeatures, numInstances, performance without native-blas (mkl), performance with native-blas...

I will do this. thanks in advance!

zhengruifeng · 2020-05-07T02:26:24Z

@xwu99 I think you can also refer to those two PRs, since some utils were added.

xwu99 · 2020-05-07T02:31:13Z

@zhengruifeng btw. There is a closed PR for ALS which is my colleague's work before he left. I would like to rework it. Could you also review it and add to the task list?

zhengruifeng

Thanks for updating this! In general, I suggest implementing the new path (trainOnBlocks) in the .ml side.
BTW, is there some detailed performance test results?

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansBlocksImpl.scala

zhengruifeng · 2020-05-12T05:32:36Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansBlocksImpl.scala

+                                     centers_num: Int): DenseMatrix = {
+    val points_num = points_matrix.numRows
+    val ret = DenseMatrix.zeros(points_num, centers_num)
+    for ((row, index) <- points_matrix.rowIter.zipWithIndex) {


nit: There is array copying in matrix.rowIter

I can access to raw value array of matrix but I didn't find a way in Java of using subarray without copying and use BLAS.axpy. Adding values individually may be faster?

xwu99 · 2020-05-12T06:00:02Z

hibench-kmeans.txt

@zhengruifeng attach my former benchmark for dense case. I will retest after fixing the code.

…MeansParallel

…ying array

zhengruifeng · 2020-05-31T13:44:31Z

@xwu99 Thanks for your work. The speedup is promising.

Since this issue (blockify+gemv/gemm) need more discussion with other committers, so I am working on retest those algorithms (current result was attached in https://issues.apache.org/jira/browse/SPARK-31783).

I'm afraid I can review this PR only after an agreement with other committers get reached.

xwu99 · 2020-05-31T14:09:03Z

@xwu99 Thanks for your work. The speedup is promising.

Since this issue (blockify+gemv/gemm) need more discussion with other committers, so I am working on retest those algorithms (current result was attached in https://issues.apache.org/jira/browse/SPARK-31783).

I'm afraid I can review this PR only after an agreement with other committers get reached.

Thanks a lot! Pls let me know if any more input needed.

github-actions · 2020-09-09T00:44:11Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

xwu99 · 2020-09-09T02:24:26Z

@zhengruifeng, any progress on this?

zhengruifeng · 2020-12-18T09:47:07Z

@xwu99 Sorry for the late reply.

We had just make SVC/LiR/LoR/AFT using blocks instead of instances in 3.1.0, but in a new adaptive way to blockify instances (#30009 and #30355).

In you above performance tests, KMeans was about 2.25 faster with MKL on a dense dataset.

But now I have some new concerns:

1, the performance of GEMM is highly related to the underlying native blas, while the performance gap between native blas and f2jBLAS for GEMV and DOT is relatively small:

1.1 in performance tests SPARK-31783,

binary-lor: enabling native blas do not bring too much speed up

multi-lor: enabling native blas double the performance

1.2 GMM based on GEMM is 3X faster than old impl, only if native blas is used; With f2jBLAS, it is even slower than the old impl. Partially beacause of this, we then reverted it.

1.3 ALS based on GEMM is even slower than that based on DOT, without native blas;

1.4 in my previous attempts to use GEMM in KMeans (also mentioned above), I found that using GEMM hurted performace when dataset is sparse.
When dataset is dense, with f2j, GEMM is only about 10%~20% faster than old impl(2.4.x) in my tests. I guess partially due to GEMM will disable short-circuit used in existing impl.

So I think, using GEMM to accelerate KMeans will only help when 1,input dataset is dense, and 2,the native blas is enabled. But in my opinion, most likely, a spark cluster does not support native blas by default.

2, large k
Different from multi-lor whose number of classes is likely a relately small number, k in KMeans can be a large number. In my recently practical works, we set k>5000 to group vectors, and then serach nerest neighborhood within each group (recall in this way is much better than LSH).
In each block, GEMM need a buffer of size k*blockSize, which maybe dangerous.

In summary, I am now very conservative here, and am considering optimizing KMeans in other ways like using GEMV (which works in ALS, but vectors in ALS are always dense, I am not sure its performance on sparse dataset).

also ping @srowen , since I guess you maybe interested in this field.

probot-autolabeler bot added the ML label Apr 16, 2020

xwu99 changed the title ~~[WIP][SPARK-31454] An optimized K-Means based on DenseMatrix and GEMM~~ [ML][SPARK-31454] An optimized K-Means based on DenseMatrix and GEMM Apr 16, 2020

xwu99 changed the title ~~[ML][SPARK-31454] An optimized K-Means based on DenseMatrix and GEMM~~ [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM Apr 16, 2020

xwu99 force-pushed the kmeans-matrix-impl branch from 5c5db78 to ed95b93 Compare May 10, 2020 06:37

zhengruifeng reviewed May 12, 2020

View reviewed changes

probot-autolabeler bot added BUILD CORE DOCS EXAMPLES PYTHON labels May 22, 2020

probot-autolabeler bot added the WEB UI label May 22, 2020

xwu99 added 4 commits May 22, 2020 16:44

Add KMeansMatrixImpl

92e22c8

Rebase to master and add KMeansBlocksImpl

16285a2

Add test case, add weight.nonEmpty condition when weights are all 1

6655ed4

Improve fit impl, moved KMeansBlocksImpl to ml and integrating WIP

5c3fb7e

xwu99 force-pushed the kmeans-matrix-impl branch from 54c60e8 to 5c3fb7e Compare May 22, 2020 08:47

xwu99 added 6 commits May 22, 2020 18:46

change VectorWithNorm to public, add initBlocksRandom and initBlocksK…

73a37b4

…MeansParallel

Integrating KMeansBlocksImpl to KMeans

619dd6c

imports style

ed92da3

nit

3de91f8

Fix trainOnBlocks MLlibKMeans init, should pass params

7671010

sums points around best center by adding values directly to avoid cop…

9ce0e56

…ying array

github-actions bot added the Stale label Sep 9, 2020

github-actions bot closed this Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

xwu99 commented Apr 16, 2020 •

edited

Loading

AmplabJenkins commented Apr 16, 2020

srowen commented Apr 20, 2020

zhengruifeng commented Apr 21, 2020 •

edited

Loading

srowen commented Apr 21, 2020

xwu99 commented Apr 21, 2020

xwu99 commented Apr 21, 2020 •

edited

Loading

zhengruifeng commented Apr 21, 2020 •

edited

Loading

zhengruifeng commented Apr 21, 2020

xwu99 commented Apr 21, 2020

zhengruifeng commented Apr 21, 2020

zhengruifeng commented Apr 21, 2020

xwu99 commented Apr 28, 2020

zhengruifeng commented Apr 29, 2020

xwu99 commented Apr 29, 2020

zhengruifeng commented Apr 29, 2020

zhengruifeng commented May 7, 2020

xwu99 commented May 7, 2020

zhengruifeng commented May 7, 2020

xwu99 commented May 7, 2020

zhengruifeng left a comment

zhengruifeng May 12, 2020

xwu99 May 23, 2020

xwu99 commented May 12, 2020

zhengruifeng commented May 31, 2020

xwu99 commented May 31, 2020

github-actions bot commented Sep 9, 2020

xwu99 commented Sep 9, 2020

zhengruifeng commented Dec 18, 2020 •

edited

Loading

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

Conversation

xwu99 commented Apr 16, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Apr 16, 2020

srowen commented Apr 20, 2020

zhengruifeng commented Apr 21, 2020 • edited Loading

srowen commented Apr 21, 2020

xwu99 commented Apr 21, 2020

xwu99 commented Apr 21, 2020 • edited Loading

zhengruifeng commented Apr 21, 2020 • edited Loading

zhengruifeng commented Apr 21, 2020

xwu99 commented Apr 21, 2020

zhengruifeng commented Apr 21, 2020

zhengruifeng commented Apr 21, 2020

xwu99 commented Apr 28, 2020

zhengruifeng commented Apr 29, 2020

xwu99 commented Apr 29, 2020

zhengruifeng commented Apr 29, 2020

zhengruifeng commented May 7, 2020

xwu99 commented May 7, 2020

zhengruifeng commented May 7, 2020

xwu99 commented May 7, 2020

zhengruifeng left a comment

Choose a reason for hiding this comment

zhengruifeng May 12, 2020

Choose a reason for hiding this comment

xwu99 May 23, 2020

Choose a reason for hiding this comment

xwu99 commented May 12, 2020

zhengruifeng commented May 31, 2020

xwu99 commented May 31, 2020

github-actions bot commented Sep 9, 2020

xwu99 commented Sep 9, 2020

zhengruifeng commented Dec 18, 2020 • edited Loading

xwu99 commented Apr 16, 2020 •

edited

Loading

zhengruifeng commented Apr 21, 2020 •

edited

Loading

xwu99 commented Apr 21, 2020 •

edited

Loading

zhengruifeng commented Apr 21, 2020 •

edited

Loading

zhengruifeng commented Dec 18, 2020 •

edited

Loading