[SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm #18624

mpjlu · 2017-07-13T12:15:18Z

What changes were proposed in this pull request?

In Spark 2.2, we have optimized ALS recommendForAll, which uses a handwriting matrix multiplication, and get the topK items for each matrix. The method effectively reduce the GC problem. However, Native BLAS GEMM, like Intel MKL, and OpenBLAS, the performance of matrix multiplication is about 10X comparing with handwriting method.

I have rewritten the code of recommendForAll with GEMM, and got about 50% improvement comparing with the master recommendForAll method.

The key point of this optimization:
1), use GEMM to replace hand-written matrix multiplication.
2), Use matrix to keep temp result, largely reduce GC and computing time. The master method create many small objects, which causes using GEMM directly cannot get good performance.
3), Use sort and merge to get the topK items, which don't need to call priority queue two times.

Test Result:
479818 users, 13727 products, rank = 10, topK = 20.
3 workers, each with 35 cores. Native BLAS is Intel MKL.
Block Size: 1000===2000===4000===8000
Master Method:40s-----39.4s-----39.5s----39.1s
This Method 26.5s---25.9s----26s-----27.1s

Performance Improvement: (OldTime - NewTime)/NewTime = about 50%

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

srowen · 2017-07-13T12:20:03Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+        val m = srcIds.length
+        val n = dstIds.length
+        val dstIdMatrix = new Array[Int](m * num)
+        val scoreMatrix = Array.fill[Double](m * num)(Double.MinValue)


By the way, MinValue is not the most negative value, but the smallest positive value. Is that what you want here?

srowen · 2017-07-13T12:20:13Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+          var size = pq.size
+          while(size > 0) {
+            size -= 1
+            val factor = pq.poll


poll() because it has side effects

srowen · 2017-07-13T12:20:31Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

          i += 1
+          // pq.size maybe less than num, corner case
+          j += num
+          pq.clear


srowen · 2017-07-13T12:21:00Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+            k += 1
+          }
+          var size = pq.size
+          while(size > 0) {


You'll need to fix up a few style things like a space after while

Why not add a nonEmpty / isEmpty method for this?

Do you mean add an isEmpty method for PriorityQueue? Thanks.

srowen · 2017-07-13T12:21:36Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+        (index -> (srcIds, dstIdMatrix, new DenseMatrix(m, num, scoreMatrix)))
+    }
+    ratings.aggregateByKey(null: Array[Int], null: Array[Int], null: DenseMatrix)(
+      (rateSum, rate) => {


Braces aren't needed in these args, just put them on one line

srowen · 2017-07-13T12:21:49Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+      (rateSum1, rateSum2) => {
+        mergeFunc(rateSum1, rateSum2, num)
+      }
+    ).flatMap(value => {


.flatMap { value => to avoid redundant parens

Also, use case (...) instead of value to name its elements. The ._2, ._3 below is hard to understand

srowen · 2017-07-13T12:22:49Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+        var rate_index = 0
+        while (j < num) {
+          if (rate._3(i, rate_index) > rateSum._3(i, sum_index)) {
+            tempIdMatrix(i * num + j) = rate._2(i * num + rate_index)


Might be worth storing i * num in a local to avoid recomputing it

SparkQA · 2017-07-13T12:25:00Z

Test build #79584 has finished for PR 18624 at commit e8a40ed.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-13T14:11:15Z

Test build #79587 has finished for PR 18624 at commit 963dae2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-14T04:39:25Z

Test build #79597 has finished for PR 18624 at commit e5881f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-07-14T07:34:09Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+        }
+        (index, (srcIds, dstIdMatrix, new DenseMatrix(m, num, scoreMatrix)))
+    }
+    ratings.aggregateByKey(null: Array[Int], null: Array[Int], null: DenseMatrix)(


This is aggregating by key which in this case appears to be the "block index". What is the benefit then? Since each block will have a unique index, there would be no intermediate aggregation.

mpjlu · 2017-07-14T08:56:17Z

An user block, after Cartesian, will generate many blocks(Number of Item blocks), all these blocks should be aggregated. Thanks.

MLnick · 2017-07-14T12:13:02Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+          var size = pq.size
+          while (size > 0) {
+            size -= 1
+            val factor = pq.poll()


Is it really necessary to add poll? For size of k (which is usually very small), the approach of pq.foreach should suffice and is simpler

The queue is length num - which is typically10, 20, or perhaps in extreme cases in the low 100's. So is there really any performance benefit here? Even if so it would be marginal and I believe it's cleaner do just use foreach and sorted, and not worth adding the poll method.

When num = 20, if use sorted here, the prediction time is about 31s, if use poll, the prediction time is about 26s. I think this difference is large. I have tested many times. The result is about the same.

Hi @MLnick , thanks for your review.
My original test for sorted is using: pq.toArray.sorted(Ordering.By(Int, Double), Double),
because pq.toArray.sorted(-_.2) build error. Maybe there is boxing/unboxing, the performance is very bad.
Now, I use pq.toArray.sortBy(-._2), the performance is good than poll. this 25s vs poll 26s.
Thanks.

mpjlu · 2017-07-14T12:28:02Z

We need the value is in order here.

mpjlu · 2017-07-14T12:30:21Z

If no poll, we have to use toArray.sorted, which performance is bad.

mpjlu · 2017-07-17T03:20:57Z

I have checked the results with the master method, the recommendation results are right.
The master TestSuite is too simple, should be updated. I will update it.
Thanks.

SparkQA · 2017-07-17T06:23:23Z

Test build #79655 has finished for PR 18624 at commit 4a9b350.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-17T12:58:25Z

Test build #79664 has finished for PR 18624 at commit a5bf101.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-09-04T07:45:05Z

@mpjlu sorry for delay on this. I think the idea of the change is good but I still need to review in detail. One concern I have is it now looks quite convoluted, so I want to see if we can simplify the implementation somehow.

MLnick · 2017-09-04T07:45:47Z

@mpjlu also feel free to look into enhancing the tests. One issue however is we don't want to add to the run time too much as the ALS suite already is very heavy on time.

mpjlu · 2017-09-04T07:53:25Z

Thanks @MLnick , I think the ML ALS suite is ok, just MLLIB ALS suite is too simple. One possible enhancement is to add the same test cases as ML ALS suite. How do you think about it?

jiangxb1987 · 2017-11-06T09:04:21Z

ping @WeichenXu123

WeichenXu123

One of my concern is, your change, increase the memory cost, at here:
L301 val ratings = srcFactors.transpose.multiply(dstFactors)
Suppose we tuning the blockSize to some large value, the matrix multiplication here will be possible OOM.

And, I appreciate your optimization on aggregate merge stage. The sort & merge will be more efficient than merging via priority queue.

WeichenXu123 · 2017-11-10T07:45:31Z

mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala

+  def blockify(
+      rank: Int,
+      features: RDD[(Int, Array[Double])]): RDD[(Array[Int], DenseMatrix)] = {
+    val blockSize = 2000 // TODO: tune the block size


So will you add a parameter for this ?

Yes, we have another PR to set this value SPARK-20443.
If the blockSize is large enough, it is possible to OOM. For my test, the blockSize is set from 1000 to 8000, the performance of this PR is better than the master.
And the performance is about the same for blockSize is 1000 to 8000.

WeichenXu123 · 2017-11-10T07:58:35Z

But, I agree the issue @MLnick mentioned, the code now looks convoluted, can you try to simplify it ?

mpjlu · 2017-11-10T15:55:53Z

Thanks @WeichenXu123 , I will think about the method to simplify the code.

mpjlu · 2017-12-12T03:13:54Z

Hi @holdenk, this is the PR we have discussed in Strata conference.

I have thought about the code again, for the performance, we can continue to optimize the code. Because we can merge the block matrices before shuffle.
For only code simplification, I didn't get a good method.
Any change needs many tests about the performance, and now I am in a new team, and I will not have the test environment soon.
I will be glad if anyone wants to continue this work.
Thanks.

SparkQA · 2017-12-12T03:29:16Z

Test build #84749 has finished for PR 18624 at commit 8ac8196.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-12T04:32:24Z

Test build #84750 has finished for PR 18624 at commit f36706a.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2017-12-12T04:49:35Z

retest this please

SparkQA · 2017-12-12T05:54:52Z

Test build #84752 has finished for PR 18624 at commit f36706a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2018-01-15T16:16:30Z

Because I don't have the environment to continue this work, I will close it. Thanks.

Peng Meng added 4 commits July 13, 2017 16:14

add poll for PriorityQueue

5ca3fd1

test pass

215efc3

fix bug

7c587f4

code style change

e8a40ed

srowen requested changes Jul 13, 2017

View reviewed changes

Code style fix

963dae2

mpjlu changed the title ~~[SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm with about 50% performance improvement~~ [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm Jul 13, 2017

fix build error

e5881f0

mpjlu mentioned this pull request Jul 14, 2017

[Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForAll #17742

Closed

MLnick reviewed Jul 14, 2017

View reviewed changes

fix DenseMatrix Store bugs

4a9b350

change poll to sortBy

a5bf101

srowen mentioned this pull request Jul 17, 2017

[SPARK-21401][ML][MLLIB] add poll function for BoundedPriorityQueue #18620

Closed

WeichenXu123 reviewed Nov 10, 2017

View reviewed changes

reset code

f36706a

mpjlu force-pushed the OptimizeAlsByGEMM branch from 8ac8196 to f36706a Compare December 12, 2017 03:30

mpjlu closed this Jan 15, 2018

zhengruifeng mentioned this pull request Nov 23, 2020

[SPARK-33518][ML] Improve performance of ML ALS recommendForAll by GEMV #30468

Closed

[SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm #18624

[SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm #18624

Conversation

mpjlu commented Jul 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 13, 2017

SparkQA commented Jul 13, 2017

SparkQA commented Jul 14, 2017

Choose a reason for hiding this comment

mpjlu commented Jul 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpjlu commented Jul 14, 2017

mpjlu commented Jul 14, 2017

mpjlu commented Jul 17, 2017

SparkQA commented Jul 17, 2017

SparkQA commented Jul 17, 2017

MLnick commented Sep 4, 2017

MLnick commented Sep 4, 2017

mpjlu commented Sep 4, 2017

jiangxb1987 commented Nov 6, 2017

WeichenXu123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 commented Nov 10, 2017

mpjlu commented Nov 10, 2017

mpjlu commented Dec 12, 2017 • edited Loading

SparkQA commented Dec 12, 2017

SparkQA commented Dec 12, 2017

mpjlu commented Dec 12, 2017

SparkQA commented Dec 12, 2017

mpjlu commented Jan 15, 2018

mpjlu commented Dec 12, 2017 •

edited

Loading