[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method #22661

wangyum · 2018-10-07T08:38:28Z

What changes were proposed in this pull request?

Refactor JoinBenchmark to use main method.

use spark-submit:

bin/spark-submit --class  org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar

Generate benchmark result:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"

How was this patch tested?

manual tests

SparkQA · 2018-10-07T12:25:44Z

Test build #97080 has finished for PR 22661 at commit 4339b1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-07T23:38:48Z

Test build #97090 has finished for PR 22661 at commit 4859a9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-09T05:20:42Z

cc @dongjoon-hyun

dongjoon-hyun · 2018-10-10T23:12:11Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

    val N = 20 << 20
    val M = 1 << 16

-    val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v"))


So, this is a removal of redundant one, right?

dongjoon-hyun · 2018-10-10T23:13:10Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

-      val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k"))
-      val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k"))
+    codegenBenchmark("Join w long duplicated", N) {
+      val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))


According to another bechmark case in this file, broadcast seems to be put outside of codegenBenchmark. How do you think about this?

dongjoon-hyun · 2018-10-11T09:16:46Z

sql/core/benchmarks/JoinBenchmark-results.txt

+Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------
+Join w 2 ints wholestage off              138514 / 139178          0.2        6604.9       1.0X
+Join w 2 ints wholestage on               129908 / 140869          0.2        6194.5       1.1X


Ur, is this correct? Previously, we had the followings.

*Join w 2 ints codegen=false 4426 / 4501 4.7 211.1 1.0X *Join w 2 ints codegen=true 791 / 818 26.5 37.7 5.6X

I think it's correct, I ran it on master:

build/sbt "sql/test-only *benchmark.JoinBenchmark" ...... [info] JoinBenchmark: [info] - broadcast hash join, long key !!! IGNORED !!! [info] - broadcast hash join, long key with duplicates !!! IGNORED !!! Running benchmark: Join w 2 ints Running case: Join w 2 ints wholestage off Stopped after 2 iterations, 307335 ms Running case: Join w 2 ints wholestage on Stopped after 5 iterations, 687107 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Join w 2 ints: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Join w 2 ints wholestage off 153532 / 153668 0.1 7321.0 1.0X Join w 2 ints wholestage on 132075 / 137422 0.2 6297.8 1.2X

Oh, interesting. Although it's beyond the scope, could you run on branch-2.4 and branch-2.3 please, too?

dongjoon-hyun · 2018-10-11T09:19:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

  def broadcastHashJoinLongKeyWithDuplicates(): Unit = {
    val N = 20 << 20
    val M = 1 << 16
-
+    val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k"))


For this change, we need rerun the benchmark to get a new result.

SparkQA · 2018-10-11T11:18:06Z

Test build #97243 has finished for PR 22661 at commit 2baaf35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-11T14:48:09Z

Test build #97249 has finished for PR 22661 at commit 00c4950.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-11T16:50:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

@@ -19,229 +19,161 @@ package org.apache.spark.sql.execution.benchmark

 import org.apache.spark.sql.execution.joins._
 import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.IntegerType

 /**
 * Benchmark to measure performance for aggregate primitives.


aggregate primitives -> joins

dongjoon-hyun · 2018-10-11T16:58:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

-     *shuffle hash join codegen=false          2005 / 2010          2.1         478.0       1.0X
-     *shuffle hash join codegen=true           1773 / 1792          2.4         422.7       1.1X
-     */
+  override def runBenchmarkSuite(): Unit = {


Could you wrap the followings(line 168~177) with something like runBenchmark("Join Benchmark")?

SparkQA · 2018-10-11T20:51:53Z

Test build #97279 has finished for PR 22661 at commit 3be13b1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-11T23:37:02Z

retest this please

SparkQA · 2018-10-12T03:18:46Z

Test build #97287 has finished for PR 22661 at commit 3be13b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-12T05:55:07Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

-    runBenchmark("merge join", N) {
-      val df1 = sparkSession.range(N).selectExpr(s"id * 2 as k1")
-      val df2 = sparkSession.range(N).selectExpr(s"id * 3 as k2")
+    codegenBenchmark("merge join", N) {


merge join -> sort merge join

dongjoon-hyun · 2018-10-12T06:02:29Z

@wangyum . Could you review and merge wangyum#18 ?

dongjoon-hyun · 2018-10-12T06:04:33Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala

-     *-------------------------------------------------------------------------------------------
-     *Join w 2 ints codegen=false              4426 / 4501          4.7         211.1       1.0X
-     *Join w 2 ints codegen=true                791 /  818         26.5          37.7       5.6X
-     */


Hi, @cloud-fan , @gatorsmile , @davies , @rxin .

We are hitting some performance slowdown in benchmark. However, this is not a regression because it's consistent in 2.0.2 ~ 2.4.0-rc3.

Join w 2 ints: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Join w 2 ints wholestage off 157742 / 158892 0.1 7521.7 1.0X Join w 2 ints wholestage on 134290 / 152917 0.2 6403.4 1.2X

According to the original performance number, it seems to be a result when HashJoin.rewriteKeyExpr uses a simple upcasting to bigint. However, the current code generates a result where HashJoin.rewriteKeyExpr uses shiftleft operations.

scala> val df = spark.range(N).join(dim2, (col("id") % M).cast(IntegerType) === col("k1") && (col("id") % M).cast(IntegerType) === col("k2")) scala> val df2 = spark.range(N).join(dim2, (col("id") % M) === col("k1") && (col("id") % M) === col("k2")) scala> df.explain == Physical Plan == *(2) BroadcastHashJoin [cast((id#8L % 65536) as int), cast((id#8L % 65536) as int)], [k1#2, k2#3], Inner, BuildRight :- *(2) Range (0, 20971520, step=1, splits=8) +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, false] as bigint), 32) | (cast(input[1, int, false] as bigint) & 4294967295)))) +- *(1) Project [cast(id#0L as int) AS k1#2, cast(id#0L as int) AS k2#3, cast(id#0L as string) AS v#4] +- *(1) Range (0, 65536, step=1, splits=8) scala> df2.explain == Physical Plan == *(2) BroadcastHashJoin [(id#23L % 65536), (id#23L % 65536)], [cast(k1#2 as bigint), cast(k2#3 as bigint)], Inner, BuildRight :- *(2) Range (0, 20971520, step=1, splits=8) +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint), cast(input[1, int, false] as bigint))) +- *(1) Project [cast(id#0L as int) AS k1#2, cast(id#0L as int) AS k2#3, cast(id#0L as string) AS v#4] +- *(1) Range (0, 65536, step=1, splits=8)

Did we really want to measure the difference in HashJoin.rewriteKeyExpr?

Any advice is welcome and thank you in advance, @cloud-fan , @gatorsmile , @davies , @rxin .

This seems caused by the bug fix: #15390

So the performance is reasonable.

Thank you for confirmation, @cloud-fan !

wangyum · 2018-10-12T08:52:24Z

core/src/test/scala/org/apache/spark/benchmark/Benchmark.scala

@@ -200,11 +200,12 @@ private[spark] object Benchmark {
  def getProcessorName(): String = {
    val cpu = if (SystemUtils.IS_OS_MAC_OSX) {
      Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string"))
+        .stripLineEnd


Because the Mac has one more line than Linux:
28f9b9a#diff-45c96c65f7c46bc2d84843a7cb92f22fL7

Ur.. I'm not a fan to piggy-backing. Okay.

SparkQA · 2018-10-12T11:31:31Z

Test build #97299 has finished for PR 22661 at commit 28f9b9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-12T12:36:40Z

sql/core/benchmarks/JoinBenchmark-results.txt

+Join w 2 ints:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------
+Join w 2 ints wholestage off              173174 / 173183          0.1        8257.6       1.0X
+Join w 2 ints wholestage on               166350 / 198362          0.1        7932.2       1.0X


this surprises me that whole stage codegen doesn't help. We should investigate it later.

SparkQA · 2018-10-12T13:19:40Z

Test build #97301 has finished for PR 22661 at commit cd8b664.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @wangyum and @cloud-fan .

Merged to master.

## What changes were proposed in this pull request? Refactor `JoinBenchmark` to use main method. 1. use `spark-submit`: ```console bin/spark-submit --class org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar ``` 2. Generate benchmark result: ```console SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark" ``` ## How was this patch tested? manual tests Closes apache#22661 from wangyum/SPARK-25664. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <wgyumg@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Refactor JoinBenchmark

4339b1c

Put SQLConf.SHUFFLE_PARTITIONS.key to next line

4859a9f

dongjoon-hyun reviewed Oct 10, 2018

View reviewed changes

Move broadcast outside codegenBenchmark

2baaf35

dongjoon-hyun reviewed Oct 11, 2018

View reviewed changes

rerun benchmark

00c4950

dongjoon-hyun reviewed Oct 11, 2018

View reviewed changes

address comment

3be13b1

dongjoon-hyun reviewed Oct 12, 2018

View reviewed changes

dongjoon-hyun and others added 2 commits October 12, 2018 16:10

Updat result (#18)

28f9b9a

merge join -> sort merge join

cd8b664

wangyum commented Oct 12, 2018

View reviewed changes

cloud-fan reviewed Oct 12, 2018

View reviewed changes

dongjoon-hyun approved these changes Oct 12, 2018

View reviewed changes

asfgit closed this in e965fb5 Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method #22661

[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method #22661

wangyum commented Oct 7, 2018

SparkQA commented Oct 7, 2018

SparkQA commented Oct 7, 2018

wangyum commented Oct 9, 2018

dongjoon-hyun Oct 10, 2018

wangyum Oct 11, 2018

dongjoon-hyun Oct 10, 2018

dongjoon-hyun Oct 11, 2018

wangyum Oct 11, 2018

dongjoon-hyun Oct 11, 2018

dongjoon-hyun Oct 11, 2018

SparkQA commented Oct 11, 2018

SparkQA commented Oct 11, 2018

dongjoon-hyun Oct 11, 2018

dongjoon-hyun Oct 11, 2018 •

edited

Loading

SparkQA commented Oct 11, 2018

wangyum commented Oct 11, 2018

SparkQA commented Oct 12, 2018

dongjoon-hyun Oct 12, 2018

dongjoon-hyun commented Oct 12, 2018

dongjoon-hyun Oct 12, 2018 •

edited

Loading

dongjoon-hyun Oct 12, 2018

cloud-fan Oct 12, 2018

dongjoon-hyun Oct 12, 2018

wangyum Oct 12, 2018

dongjoon-hyun Oct 12, 2018

SparkQA commented Oct 12, 2018

cloud-fan Oct 12, 2018

dongjoon-hyun Oct 12, 2018

SparkQA commented Oct 12, 2018

dongjoon-hyun left a comment

[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method #22661

[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method #22661

Conversation

wangyum commented Oct 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 7, 2018

SparkQA commented Oct 7, 2018

wangyum commented Oct 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2018

SparkQA commented Oct 11, 2018

Choose a reason for hiding this comment

dongjoon-hyun Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2018

wangyum commented Oct 11, 2018

SparkQA commented Oct 12, 2018

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 12, 2018

dongjoon-hyun Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 12, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun Oct 11, 2018 •

edited

Loading

dongjoon-hyun Oct 12, 2018 •

edited

Loading