-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229
Conversation
Can one of the admins verify this patch? |
@zhengruifeng is optimizing this (differently) in #27758 |
@srowen Thanks for pinging me @xwu99 Could you please provide some performance results of your PR? I had similar attempts to optimize KMeans based on high level BLAS. Then I swith to another optimization approach based on triangle-inequality, it works on both dense and sparse dataset, and will gain about 10%~30% when |
Yeah I think we tried this and it hurt perf on sparse input, no? I'd have to dig it out.. |
@srowen Thanks you for linking us!
This config is for not to impact the origianl one. If the general idea is OK, we can switch for best performance implementaion under different conditions, it's not unusual in other part of MLlib code.
Did you benchmark with native BLAS with a machine with AVX2 or AVX512 ? The native optimization not only take advantage of multi-thread but also SIMD, cache etc.
I do think it's a good idea! But it's still not a general speedup for all cases, gain on assuming some specific conditions. Still need to use the general K-Means. |
@srowen I will benchmark sparse cases, but could we use this to deal with dense only? it's not unusual in other parts of MLlib such as in BLAS, switch sparse/dense into different cases? |
I tested with OpenBLAS (OPENBLAS_NUM_THREADS=1) with a i7-8850 cpu, which support avx2, not avx512;
When
There are some algorithms (in |
I am not against this PR. |
@zhengruifeng I am OK with inline switch instead of SparkConf. My general points:
|
I am OK if we can avoid performance regression on sparse datasets. It is up to the end users to choose the right impl. @srowen How you think about it? |
also ping @mengxr @WeichenXu123 |
@zhengruifeng I saw your PR was merged, I will rebase. I am preparing some benchmarks. let's focus on dense case first. For sparse cases, we can use original path. |
I had some reverted PRs on using high-level BLAS in LoR/LiR/SVC/GMM, they were reverted because of performance regression on sparse datasets; There are some common utils in those PRs, which should also be used in KMeans. So I think you can rebase this PR after SVC get merged. |
OK. could you also let me know the PRs you are reworking since we are also working on enabling high-level BLAS not only for k-means but also other algos in MLlib. I can help to review them rather than duplicate efforts. |
@xwu99 My previous works include: I'm reworking on I just recreate a new PR for LinearSVC, the main idea is to use expert param If nobody object, I will merge it, and then other three impls (since they depend on the first one, I do not recreate PRs right now) |
@xwu99 There was a ticket for this. I can help reviewing this PR on KMeans, you can list some performance details, like dataset, numFeatures, numInstances, performance without native-blas (mkl), performance with native-blas... |
I will do this. thanks in advance! |
@xwu99 I think you can also refer to those two PRs, since some utils were added. |
@zhengruifeng btw. There is a closed PR for ALS which is my colleague's work before he left. I would like to rework it. Could you also review it and add to the task list? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this! In general, I suggest implementing the new path (trainOnBlocks
) in the .ml
side.
BTW, is there some detailed performance test results?
mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansBlocksImpl.scala
Outdated
Show resolved
Hide resolved
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansBlocksImpl.scala
Outdated
Show resolved
Hide resolved
centers_num: Int): DenseMatrix = { | ||
val points_num = points_matrix.numRows | ||
val ret = DenseMatrix.zeros(points_num, centers_num) | ||
for ((row, index) <- points_matrix.rowIter.zipWithIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: There is array copying in matrix.rowIter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can access to raw value array of matrix but I didn't find a way in Java of using subarray without copying and use BLAS.axpy. Adding values individually may be faster?
@zhengruifeng attach my former benchmark for dense case. I will retest after fixing the code. |
@xwu99 Thanks for your work. The speedup is promising. Since this issue (blockify+gemv/gemm) need more discussion with other committers, so I am working on retest those algorithms (current result was attached in https://issues.apache.org/jira/browse/SPARK-31783). I'm afraid I can review this PR only after an agreement with other committers get reached. |
Thanks a lot! Pls let me know if any more input needed. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
@zhengruifeng, any progress on this? |
@xwu99 Sorry for the late reply. We had just make SVC/LiR/LoR/AFT using blocks instead of instances in 3.1.0, but in a new adaptive way to blockify instances (#30009 and #30355). In you above performance tests, KMeans was about 2.25 faster with But now I have some new concerns: 1, the performance of GEMM is highly related to the underlying native blas, while the performance gap between native blas and 1.1 in performance tests SPARK-31783, binary-lor: enabling native blas do not bring too much speed up multi-lor: enabling native blas double the performance 1.2 GMM based on GEMM is 3X faster than old impl, only if native blas is used; With 1.3 ALS based on GEMM is even slower than that based on DOT, without native blas; 1.4 in my previous attempts to use GEMM in KMeans (also mentioned above), I found that using GEMM hurted performace when dataset is sparse. So I think, using GEMM to accelerate KMeans will only help when 1,input dataset is dense, and 2,the native blas is enabled. But in my opinion, most likely, a spark cluster does not support native blas by default. 2, large In summary, I am now very conservative here, and am considering optimizing KMeans in other ways like using GEMV (which works in ALS, but vectors in ALS are always dense, I am not sure its performance on sparse dataset). also ping @srowen , since I guess you maybe interested in this field. |
What changes were proposed in this pull request?
Adding an optimized K-Means implementation based on DenseMatrix and GEMM to improve performance. JIRA: https://issues.apache.org/jira/browse/SPARK-31454.
Why are the changes needed?
The main computations in K-Means are calculating distances between individual points and center points. Currently K-Means implementation is vector-based which can't take advantage of optimized native BLAS libraries.
When the original points are represented as dense vectors, our approach is to modify the original input data structures to a DenseMatrix-based one by grouping several points together. The original distance calculations can be translated into a Matrix multiplication then optimized native GEMM routines (Intel MKL, OpenBLAS etc.) can be used. This approach can also work with sparse vectors despite having larger memory consumption when translating sparse vectors to dense matrix.
Does this PR introduce any user-facing change?
No. Use config parameters to control if turn on this implementation without modifying public interfaces.
How was this patch tested?
Use the same dataset and parameters to run original K-Means and this implementation and compare the results.