Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

Closed
wants to merge 10 commits into from

Conversation

xwu99
Copy link
Contributor

@xwu99 xwu99 commented Apr 16, 2020

What changes were proposed in this pull request?

Adding an optimized K-Means implementation based on DenseMatrix and GEMM to improve performance. JIRA: https://issues.apache.org/jira/browse/SPARK-31454.

Why are the changes needed?

The main computations in K-Means are calculating distances between individual points and center points. Currently K-Means implementation is vector-based which can't take advantage of optimized native BLAS libraries.

When the original points are represented as dense vectors, our approach is to modify the original input data structures to a DenseMatrix-based one by grouping several points together. The original distance calculations can be translated into a Matrix multiplication then optimized native GEMM routines (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with sparse vectors despite having larger memory consumption when translating sparse vectors to dense matrix.

Does this PR introduce any user-facing change?

No. Use config parameters to control if turn on this implementation without modifying public interfaces.

How was this patch tested?

Use the same dataset and parameters to run original K-Means and this implementation and compare the results.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@xwu99 xwu99 changed the title [WIP][SPARK-31454] An optimized K-Means based on DenseMatrix and GEMM [ML][SPARK-31454] An optimized K-Means based on DenseMatrix and GEMM Apr 16, 2020
@xwu99 xwu99 changed the title [ML][SPARK-31454] An optimized K-Means based on DenseMatrix and GEMM [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM Apr 16, 2020
@srowen
Copy link
Member

srowen commented Apr 20, 2020

@zhengruifeng is optimizing this (differently) in #27758

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Apr 21, 2020

@srowen Thanks for pinging me

@xwu99 Could you please provide some performance results of your PR?

I had similar attempts to optimize KMeans based on high level BLAS.
I also blockfied vectors into blocks, and use BLAS.gemm to find best costs. But I found that:
1, it will cause performance regression when input dataset is sparse, (I notice that you add spark.ml.kmeans.matrixImplementation.rowsPerMatrix, I am not sure whether we should have two implementations);
2, when input dataset is dense, I found no performace gain when distanceMeasure = EUCLIDEAN; while distanceMeasure = EUCLIDEAN, about 10% ~ 20% speedup can be obtained;
3, Native BLAS (Open-BLAS) did not help too much, if single-thread is used (which is suggested in SPARK);

Then I swith to another optimization approach based on triangle-inequality, it works on both dense and sparse dataset, and will gain about 10%~30% when numFeatures and/or k are large.

@srowen
Copy link
Member

srowen commented Apr 21, 2020

Yeah I think we tried this and it hurt perf on sparse input, no? I'd have to dig it out..

@xwu99
Copy link
Contributor Author

xwu99 commented Apr 21, 2020

@srowen Thanks you for linking us!

@xwu99 Could you please provide some performance results of your PR?
Our preliminary benchmark shows this approach can boost the training performance by 3.5x with Intel MKL, l can provide further benchmark later.

I had similar attempts to optimize KMeans based on high level BLAS.
I also blockfied vectors into blocks, and use BLAS.gemm to find best costs. But I found that:
1, it will cause performance regression when input dataset is sparse, (I notice that you add spark.ml.kmeans.matrixImplementation.rowsPerMatrix, I am not sure whether we should have two implementations);

This config is for not to impact the origianl one. If the general idea is OK, we can switch for best performance implementaion under different conditions, it's not unusual in other part of MLlib code.

2, when input dataset is dense, I found no performace gain when distanceMeasure = EUCLIDEAN; while distanceMeasure = EUCLIDEAN, about 10% ~ 20% speedup can be obtained;
3, Native BLAS (Open-BLAS) did not help too much, if single-thread is used (which is suggested in SPARK);

Did you benchmark with native BLAS with a machine with AVX2 or AVX512 ? The native optimization not only take advantage of multi-thread but also SIMD, cache etc.

Then I swith to another optimization approach based on triangle-inequality, it works on both dense and sparse dataset, and will gain about 10%~30% when numFeatures and/or k are large.

I do think it's a good idea! But it's still not a general speedup for all cases, gain on assuming some specific conditions. Still need to use the general K-Means.

@xwu99
Copy link
Contributor Author

xwu99 commented Apr 21, 2020

Yeah I think we tried this and it hurt perf on sparse input, no? I'd have to dig it out..

@srowen I will benchmark sparse cases, but could we use this to deal with dense only? it's not unusual in other parts of MLlib such as in BLAS, switch sparse/dense into different cases?
And I also think it depends on sparsity level, if it's very sparse, translating to dense one not only hurt perf but also waste memory. if just represented as sparse but not very sparse, may still gain?

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Apr 21, 2020

Did you benchmark with native BLAS with a machine with AVX2 or AVX512 ? The native optimization not only take advantage of multi-thread but also SIMD, cache etc.

I tested with OpenBLAS (OPENBLAS_NUM_THREADS=1) with a i7-8850 cpu, which support avx2, not avx512;

I do think it's a good idea! But it's still not a general speedup for all cases, gain on assuming some specific conditions. Still need to use the general K-Means.

When k and numFeatures are small, there is no much optimization space for triangle-inequality. But I guess this also applies to high-level BLAS, suppose k=2 or k=64, I guess BLAS.gemm with k=2 may not gain as much speedup as k=64?

it's not unusual in other parts of MLlib such as in BLAS, switch sparse/dense into different cases?

There are some algorithms (in ml.stat) that can switch between sparse/dense, but no classification/regression/clustering impls support it now.

@zhengruifeng
Copy link
Contributor

I am not against this PR.
I don't think adding a spark conf is a good idea, but maybe we could add a parameter for end user to switch between impls? or check the first vector like existing impl?

@xwu99
Copy link
Contributor Author

xwu99 commented Apr 21, 2020

@zhengruifeng I am OK with inline switch instead of SparkConf.

My general points:

  1. Matrix-multiply is highly optimized routine by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available
  2. Sparse matrix also has native optimization, but it's not part of standard BLAS and current BLAS in Spark doesn't support the optimized one of it.
  3. I do believe the parameters requires tuning (such as K, rowsPerMatrix) and we can do more benchmarks and give guidelines to end users on how to set them to get best performance.

@zhengruifeng
Copy link
Contributor

I am OK if we can avoid performance regression on sparse datasets. It is up to the end users to choose the right impl.

@srowen How you think about it?

@zhengruifeng
Copy link
Contributor

also ping @mengxr @WeichenXu123
What about adding an option and letting end user to choose whether to enable high-level BLAS?

@xwu99
Copy link
Contributor Author

xwu99 commented Apr 28, 2020

@zhengruifeng I saw your PR was merged, I will rebase. I am preparing some benchmarks. let's focus on dense case first. For sparse cases, we can use original path.
@srowen @mengxr @WeichenXu123 do you have more feedbacks?

@zhengruifeng
Copy link
Contributor

I saw your PR was merged, I will rebase.

I had some reverted PRs on using high-level BLAS in LoR/LiR/SVC/GMM, they were reverted because of performance regression on sparse datasets;
I am now working on it again, using param blockSize==1 to choose the impl.
I am also waiting for more feedbacks. If nobody object, I will merge them.

There are some common utils in those PRs, which should also be used in KMeans. So I think you can rebase this PR after SVC get merged.

@xwu99
Copy link
Contributor Author

xwu99 commented Apr 29, 2020

I saw your PR was merged, I will rebase.

I had some reverted PRs on using high-level BLAS in LoR/LiR/SVC/GMM, they were reverted because of performance regression on sparse datasets;
I am now working on it again, using param blockSize==1 to choose the impl.
I am also waiting for more feedbacks. If nobody object, I will merge them.

There are some common utils in those PRs, which should also be used in KMeans. So I think you can rebase this PR after SVC get merged.

OK. could you also let me know the PRs you are reworking since we are also working on enabling high-level BLAS not only for k-means but also other algos in MLlib. I can help to review them rather than duplicate efforts.

@zhengruifeng
Copy link
Contributor

@xwu99 My previous works include:
LinearSVC: #27360
LogisticRegression: #27374
LinearRegression: #27396
GaussianMixture: #27473
KMeans: https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1, not send

I'm reworking on LinearSVC/LogisticRegression/LinearRegression/GaussianMixture. For KMeans I am glad you can take it over.

I just recreate a new PR for LinearSVC, the main idea is to use expert param blockSize to choose the path. The original path will be choosen by default to avoid performance regression on sparse datasets.

If nobody object, I will merge it, and then other three impls (since they depend on the first one, I do not recreate PRs right now) LogisticRegression/LinearRegression/GaussianMixture.

@zhengruifeng
Copy link
Contributor

@xwu99 There was a ticket for this.
Now I had merged high-level BLAS supports for LinearSVC and LogisticRegression.

I can help reviewing this PR on KMeans, you can list some performance details, like dataset, numFeatures, numInstances, performance without native-blas (mkl), performance with native-blas...

@xwu99
Copy link
Contributor Author

xwu99 commented May 7, 2020

@xwu99 There was a ticket for this.
Now I had merged high-level BLAS supports for LinearSVC and LogisticRegression.

I can help reviewing this PR on KMeans, you can list some performance details, like dataset, numFeatures, numInstances, performance without native-blas (mkl), performance with native-blas...

I will do this. thanks in advance!

@zhengruifeng
Copy link
Contributor

@xwu99 I think you can also refer to those two PRs, since some utils were added.

@xwu99
Copy link
Contributor Author

xwu99 commented May 7, 2020

@zhengruifeng btw. There is a closed PR for ALS which is my colleague's work before he left. I would like to rework it. Could you also review it and add to the task list?

@xwu99 xwu99 force-pushed the kmeans-matrix-impl branch from 5c5db78 to ed95b93 Compare May 10, 2020 06:37
Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this! In general, I suggest implementing the new path (trainOnBlocks) in the .ml side.
BTW, is there some detailed performance test results?

centers_num: Int): DenseMatrix = {
val points_num = points_matrix.numRows
val ret = DenseMatrix.zeros(points_num, centers_num)
for ((row, index) <- points_matrix.rowIter.zipWithIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There is array copying in matrix.rowIter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can access to raw value array of matrix but I didn't find a way in Java of using subarray without copying and use BLAS.axpy. Adding values individually may be faster?

@xwu99
Copy link
Contributor Author

xwu99 commented May 12, 2020

hibench-kmeans.txt

@zhengruifeng attach my former benchmark for dense case. I will retest after fixing the code.

@xwu99 xwu99 force-pushed the kmeans-matrix-impl branch from 54c60e8 to 5c3fb7e Compare May 22, 2020 08:47
@zhengruifeng
Copy link
Contributor

@xwu99 Thanks for your work. The speedup is promising.

Since this issue (blockify+gemv/gemm) need more discussion with other committers, so I am working on retest those algorithms (current result was attached in https://issues.apache.org/jira/browse/SPARK-31783).

I'm afraid I can review this PR only after an agreement with other committers get reached.

@xwu99
Copy link
Contributor Author

xwu99 commented May 31, 2020

@xwu99 Thanks for your work. The speedup is promising.

Since this issue (blockify+gemv/gemm) need more discussion with other committers, so I am working on retest those algorithms (current result was attached in https://issues.apache.org/jira/browse/SPARK-31783).

I'm afraid I can review this PR only after an agreement with other committers get reached.

Thanks a lot! Pls let me know if any more input needed.

@github-actions
Copy link

github-actions bot commented Sep 9, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 9, 2020
@xwu99
Copy link
Contributor Author

xwu99 commented Sep 9, 2020

@zhengruifeng, any progress on this?

@github-actions github-actions bot closed this Sep 10, 2020
@zhengruifeng
Copy link
Contributor

zhengruifeng commented Dec 18, 2020

@xwu99 Sorry for the late reply.

We had just make SVC/LiR/LoR/AFT using blocks instead of instances in 3.1.0, but in a new adaptive way to blockify instances (#30009 and #30355).

In you above performance tests, KMeans was about 2.25 faster with MKL on a dense dataset.

But now I have some new concerns:

1, the performance of GEMM is highly related to the underlying native blas, while the performance gap between native blas and f2jBLAS for GEMV and DOT is relatively small:

1.1 in performance tests SPARK-31783,

binary-lor: enabling native blas do not bring too much speed up
image

multi-lor: enabling native blas double the performance
image

1.2 GMM based on GEMM is 3X faster than old impl, only if native blas is used; With f2jBLAS, it is even slower than the old impl. Partially beacause of this, we then reverted it.

1.3 ALS based on GEMM is even slower than that based on DOT, without native blas;

1.4 in my previous attempts to use GEMM in KMeans (also mentioned above), I found that using GEMM hurted performace when dataset is sparse.
When dataset is dense, with f2j, GEMM is only about 10%~20% faster than old impl(2.4.x) in my tests. I guess partially due to GEMM will disable short-circuit used in existing impl.

So I think, using GEMM to accelerate KMeans will only help when 1,input dataset is dense, and 2,the native blas is enabled. But in my opinion, most likely, a spark cluster does not support native blas by default.

2, large k
Different from multi-lor whose number of classes is likely a relately small number, k in KMeans can be a large number. In my recently practical works, we set k>5000 to group vectors, and then serach nerest neighborhood within each group (recall in this way is much better than LSH).
In each block, GEMM need a buffer of size k*blockSize, which maybe dangerous.

In summary, I am now very conservative here, and am considering optimizing KMeans in other ways like using GEMV (which works in ALS, but vectors in ALS are always dense, I am not sure its performance on sparse dataset).

also ping @srowen , since I guess you maybe interested in this field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants