[SQL] Rewrite join implementation to allow streaming of one relation. #250

marmbrus · 2014-03-27T04:23:52Z

Before we were materializing everything in memory. This also uses the projection interface so will be easier to plug in code gen (its ported from that branch).

@rxin @liancheng

AmplabJenkins · 2014-03-27T04:24:21Z

Merged build triggered.

AmplabJenkins · 2014-03-27T04:24:21Z

Merged build started.

AmplabJenkins · 2014-03-27T05:12:58Z

Merged build finished.

AmplabJenkins · 2014-03-27T05:12:59Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13497/

AmplabJenkins · 2014-03-27T06:58:20Z

Merged build triggered. One or more automated tests failed

AmplabJenkins · 2014-03-27T06:58:29Z

Merged build started. One or more automated tests failed

rxin · 2014-03-27T07:03:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala

+  /** Returns true if there are any NULL values in this row. */
+  def anyNull: Boolean = {
+    var i = 0
+    while(i < length) {


space after while and if...

AmplabJenkins · 2014-03-27T07:59:46Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-03-27T07:59:46Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13504/

AmplabJenkins · 2014-03-31T18:32:22Z

Merged build triggered. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-31T18:32:28Z

Merged build started. Build is starting -or- tests failed to complete.

marmbrus · 2014-03-31T18:34:08Z

Hey @rxin, thanks for looking this over. I added to TODOs for using Spark's collections, but did not make these changes.

rxin · 2014-03-31T18:46:12Z

lgtm if travis or jenkins (whatever we are using nowadays ...) is happy

AmplabJenkins · 2014-03-31T19:07:23Z

Merged build triggered. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-31T19:07:28Z

Merged build started. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-31T19:30:28Z

Merged build finished. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-31T19:30:28Z

Build is starting -or- tests failed to complete.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13600/

AmplabJenkins · 2014-03-31T20:01:09Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-03-31T20:01:09Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13601/

rxin · 2014-03-31T22:23:38Z

thanks. merged.

README incorrectly suggests build sources spark-env.sh This is misleading because the build doesn't source that file. IMO it's better to force people to specify build environment variables on the command line always, like we do in every example, so I'm just removing this doc. (cherry picked from commit d2efe13) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

@rxin

Before we were materializing everything in memory. This also uses the projection interface so will be easier to plug in code gen (its ported from that branch). @rxin @liancheng Author: Michael Armbrust <michael@databricks.com> Closes apache#250 from marmbrus/hashJoin and squashes the following commits: 1ad873e [Michael Armbrust] Change hasNext logic back to the correct version. 8e6f2a2 [Michael Armbrust] Review comments. 1e9fb63 [Michael Armbrust] style bc0cb84 [Michael Armbrust] Rewrite join implementation to allow streaming of one relation.

## What changes were proposed in this pull request? A configuration parameter spark.databricks.debug.taskKiller.minOutputRows is added. It sets the minimum required number of records that need to be produced at some point in task execution, before the task can be terminated by DatabricksTaskDebugListener. ## How was this patch tested? Adds unit tests. Author: Ala Luszczak <ala@databricks.com> Closes apache#250 from ala/min-output-rows.

…is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes #24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…is reused With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())apache#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())apache#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)apache#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` Modified the existing test. Closes apache#24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

Rewrite join implementation to allow streaming of one relation.

bc0cb84

style

1e9fb63

rxin reviewed Mar 27, 2014
View reviewed changes

Review comments.

8e6f2a2

Change hasNext logic back to the correct version.

1ad873e

asfgit closed this in 5731af5 Mar 31, 2014

marmbrus deleted the hashJoin branch April 1, 2014 23:33

wangyum mentioned this pull request Aug 19, 2020

[SPARK-32444][SQL] Infer filters from DPP #29243

Closed

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

[MAPR-SPARK-178] Fix Spark Project Hive unit tests (apache#250)

0bdebf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SQL] Rewrite join implementation to allow streaming of one relation. #250

[SQL] Rewrite join implementation to allow streaming of one relation. #250

marmbrus commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

rxin Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

marmbrus commented Mar 31, 2014

rxin commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

rxin commented Mar 31, 2014

[SQL] Rewrite join implementation to allow streaming of one relation. #250

[SQL] Rewrite join implementation to allow streaming of one relation. #250

Conversation

marmbrus commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

rxin Mar 27, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

marmbrus commented Mar 31, 2014

rxin commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

AmplabJenkins commented Mar 31, 2014

rxin commented Mar 31, 2014