[SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet #29598

dbaliafroozeh · 2020-08-31T14:37:41Z

What changes were proposed in this pull request?

This PR changes AttributeSet and ExpressionSet to maintain the insertion order of the elements. More specifically, we:

change the underlying data structure of AttributeSet from HashSet to LinkedHashSet to maintain the insertion order.
ExpressionSet already uses a list to keep track of the expressions, however, since it is extending Scala's immutable.Set class, operations such as map and flatMap are delegated to the immutable.Set itself. This means that the result of these operations is not an instance of ExpressionSet anymore, rather it's a implementation picked up by the parent class. We also remove this inheritance from immutable.Set and implement the needed methods directly. ExpressionSet has a very specific semantics and it does not make sense to extend immutable.Set anyway.
change the PlanStabilitySuite to not sort the attributes, to be able to catch changes in the order of expressions in different runs.

Why are the changes needed?

Expressions identity is based on the ExprId which is an auto-incremented number. This means that the same query can yield a query plan with different expression ids in different runs. AttributeSet and ExpressionSet internally use a HashSet as the underlying data structure, and therefore cannot guarantee the a fixed order of operations in different runs. This can be problematic in cases we like to check for plan changes in different runs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Passes PlanStabilitySuite after regenerating the golden files.

hvanhovell · 2020-08-31T14:42:28Z

ok to test

SparkQA · 2020-08-31T17:25:54Z

Test build #128101 has finished for PR 29598 at commit 5afdedb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-31T17:30:30Z

Test build #128102 has finished for PR 29598 at commit 5afdedb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-31T19:35:56Z

Test build #128110 has finished for PR 29598 at commit f235b53.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-01T13:56:46Z

Test build #128145 has finished for PR 29598 at commit f3a79a3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-01T14:17:23Z

Test build #128149 has finished for PR 29598 at commit b0478f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2020-09-01T14:38:21Z

jenkins retest this please

SparkQA · 2020-09-01T18:43:48Z

Test build #128153 has finished for PR 29598 at commit b0478f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-01T22:01:09Z

Test build #128156 has finished for PR 29598 at commit 9590555.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell

LGTM

hvanhovell · 2020-08-31T15:26:05Z

sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala

-    }
+  def -(elem: Expression): ExpressionSet = {
+    val newSet = clone()
+    newSet.remove(elem)


Isn't this more efficient?:

ExpressionSet(baseSet.filter(_ != e. canonicalized), originals.filter(_.canonicalized != e.canonicalized))

cloud-fan · 2020-09-04T08:02:12Z

sql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/ExpressionSet.scala

@@ -27,6 +27,10 @@ object ExpressionSet {
    expressions.foreach(set.add)
    set
  }
+


We should apply the same change in ExpressionSet under the scala-2.13 source tree. @dbaliafroozeh can you open a followup PR?

@cloud-fan good catch, I thought I already deleted the ExpressionSet in 2.13. Note that we don't want it anymore as ExpressionSet doesn't extend Set anymore. I'll open a followup PR for that.

### What changes were proposed in this pull request? This PR is a followup on #29598 and removes the `ExpressionSet` class from the 2.13 branch. ### Why are the changes needed? `ExpressionSet` does not extend Scala `Set` anymore and this class is no longer needed in the 2.13 branch. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passes existing tests Closes #29648 from dbaliafroozeh/RemoveExpressionSetFrom2.13Branch. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

LuciferYang · 2020-09-08T04:44:12Z

Sorry to leave a message in a completed issue. @cloud-fan @dbaliafroozeh This patch seems to bring about some different behavior between use Scala 2.12 and Scala 2.13.

I found that the number of failed cases increased with this patch of the sub-suites of PlanStabilitySuite.

For example, if we execute

mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite -Pscala-2.13 -am

The test result with out this patch is

Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

and with this patch is

Tests: succeeded 1, failed 31, canceled 0, ignored 0, pending 0
*** 31 TESTS FAILED ***

I haven't found the root cause yet. Do you have any good ideas for fix this problems?

cloud-fan · 2020-09-08T04:58:50Z

This should have been merged before we have PlanStabilitySuite, as the query plans in golden files were kind of random previously. That's why this PR updates PlanStabilitySuite.

If your spark fork has different golden files for PlanStabilitySuite, you should just re-generate golden files after this patch.

LuciferYang · 2020-09-08T05:26:42Z

@cloud-fan Maybe I didn't describe it clearly, now I use the master of spark-source to execute maven test with Scala 2.12

mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite -am

All tests passed.

execute maven test with Scala 2.13

mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite -am -Pscala-2.13

31 TESTS FAILED

without this patch execute maven test , both Scala 2.12 and Scala 2.13 All tests passed

LuciferYang · 2020-09-08T05:30:40Z

So always need to re-generate golden files with Scala 2.13? Or we need to use different golden files for different Scala verision, feels a little unreasonable...

Or do you mean the additional failure cases in Scala 2.13 is caused by other unknown reasons?

cloud-fan · 2020-09-08T06:05:11Z

Interesting. So AttributeSet and ExpressionSet behave differently under scala 2.12 and 2.13. @Ngone51 can you take a look?

LuciferYang · 2020-09-08T06:43:00Z

@Ngone51 need some simple fix on compilation for Scala 2.13 , QueryPlan and ShuffleBlockFetcherIterator.

cloud-fan · 2020-09-08T06:54:42Z

@LuciferYang do you have a branch that contains the compilation fix?

LuciferYang · 2020-09-08T06:58:57Z

@cloud-fan @Ngone51 we can use #29660

LuciferYang · 2020-09-08T13:31:36Z

@cloud-fan @Ngone51 I think the reason for this problem is

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeSet.scala

Lines 106 to 113 in e7d9a24

    
           def --(other: Iterable[NamedExpression]): AttributeSet = { 
        
             other match { 
        
               case otherSet: AttributeSet => 
        
                 new AttributeSet(baseSet -- otherSet.baseSet) 
        
               case _ => 
        
                 new AttributeSet(baseSet -- other.map(a => new AttributeEquals(a.toAttribute))) 
        
             } 
        
           }

The -- method of LinkedHashSet is different between Scala 2.12 and Scala 2.13.

in Scala 2.12 it use implementation of SetLike.scala as follow:

override def --(xs: GenTraversableOnce[A]): This = clone() --= xs.seq

in Scala 2.13 it use implementation of Set.scala as follow:

def -- (that: IterableOnce[A]): C = fromSpecific(coll.toSet.removedAll(that))

From the above code we can found that in Scala 2.13 after baseSet -- otherSet.baseSet, the result order changed .

Maybe we can use baseSet.diff(otherSet.baseSet) instead of baseSet -- otherSet.baseSet, but I need to confirm the feasibility.

dbaliafroozeh · 2020-09-08T14:03:15Z

@cloud-fan @LuciferYang we can also try to use Java's LinkedHashSet here if there is a difference between different versions of Scala's mutable.LinkedHashSet.

hvanhovell · 2020-09-08T14:06:23Z

@dbaliafroozeh -- returns a set, which gets converted into linked set by AttributeSet.apply(..). The 2.12 implementation will actually return a linked set, whereas the new probably returns a set.

@LuciferYang can you open a PR that basically uses 2.12 implementation inside AttributeSet?

LuciferYang · 2020-09-08T14:17:25Z

@LuciferYang can you open a PR that basically uses 2.12 implementation inside AttributeSet?

@hvanhovell Ok ~ I will give a new followup pr

LuciferYang · 2020-09-08T14:21:05Z

@hvanhovell use clone() --= xs.seq instead of -- or use diff method?

cloud-fan · 2020-09-08T14:40:35Z

I'd prefer diff if it works, as it's simpler.

hvanhovell · 2020-09-08T14:42:05Z

Anything that explicitly maintains the insertion order (i.e. returns a LinkedHashSet) will do :).

hvanhovell · 2020-09-08T14:43:03Z

diff returns a set and does not fall in that category.

Maintain the order of expressions in AttributeSet and ExpressionSet

5afdedb

probot-autolabeler bot added the SQL label Aug 31, 2020

dbaliafroozeh changed the title ~~Maintain the order of expressions in AttributeSet and ExpressionSet~~ [SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet Aug 31, 2020

Regenerate the files

f235b53

dbaliafroozeh added 2 commits September 1, 2020 12:02

Regenerate files

f3a79a3

Regenerate more files

b0478f3

Use ExpressionSet in CostBasedJoinReorder

9590555

hvanhovell approved these changes Sep 3, 2020

View reviewed changes

hvanhovell closed this in 0a6043f Sep 3, 2020

cloud-fan reviewed Sep 4, 2020

View reviewed changes

dbaliafroozeh mentioned this pull request Sep 4, 2020

[SPARK-32800][SQL] Remove ExpressionSet from the 2.13 branch #29648

Closed

bersprockets mentioned this pull request Feb 3, 2022

[SPARK-37290][SQL] - Exponential planning time in case of non-deterministic function #35233

Closed

ulysses-you mentioned this pull request Apr 11, 2022

[SPARK-38836][SQL] Improve the performance of ExpressionSet #36121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet #29598

[SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet #29598

dbaliafroozeh commented Aug 31, 2020 •

edited

Loading

hvanhovell commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Sep 1, 2020

SparkQA commented Sep 1, 2020

hvanhovell commented Sep 1, 2020

SparkQA commented Sep 1, 2020

SparkQA commented Sep 1, 2020

hvanhovell left a comment

hvanhovell Aug 31, 2020

cloud-fan Sep 4, 2020

dbaliafroozeh Sep 4, 2020 •

edited

Loading

LuciferYang commented Sep 8, 2020

cloud-fan commented Sep 8, 2020

LuciferYang commented Sep 8, 2020

LuciferYang commented Sep 8, 2020 •

edited

Loading

cloud-fan commented Sep 8, 2020

LuciferYang commented Sep 8, 2020

cloud-fan commented Sep 8, 2020

LuciferYang commented Sep 8, 2020

LuciferYang commented Sep 8, 2020 •

edited

Loading

dbaliafroozeh commented Sep 8, 2020

hvanhovell commented Sep 8, 2020

LuciferYang commented Sep 8, 2020 •

edited

Loading

LuciferYang commented Sep 8, 2020 •

edited

Loading

cloud-fan commented Sep 8, 2020

hvanhovell commented Sep 8, 2020

hvanhovell commented Sep 8, 2020

[SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet #29598

[SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet #29598

Conversation

dbaliafroozeh commented Aug 31, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

hvanhovell commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Aug 31, 2020

SparkQA commented Sep 1, 2020

SparkQA commented Sep 1, 2020

hvanhovell commented Sep 1, 2020

SparkQA commented Sep 1, 2020

SparkQA commented Sep 1, 2020

hvanhovell left a comment

Choose a reason for hiding this comment

hvanhovell Aug 31, 2020

Choose a reason for hiding this comment

cloud-fan Sep 4, 2020

Choose a reason for hiding this comment

dbaliafroozeh Sep 4, 2020 • edited Loading

Choose a reason for hiding this comment

LuciferYang commented Sep 8, 2020

cloud-fan commented Sep 8, 2020

LuciferYang commented Sep 8, 2020

LuciferYang commented Sep 8, 2020 • edited Loading

cloud-fan commented Sep 8, 2020

LuciferYang commented Sep 8, 2020

cloud-fan commented Sep 8, 2020

LuciferYang commented Sep 8, 2020

LuciferYang commented Sep 8, 2020 • edited Loading

dbaliafroozeh commented Sep 8, 2020

hvanhovell commented Sep 8, 2020

LuciferYang commented Sep 8, 2020 • edited Loading

LuciferYang commented Sep 8, 2020 • edited Loading

cloud-fan commented Sep 8, 2020

hvanhovell commented Sep 8, 2020

hvanhovell commented Sep 8, 2020

dbaliafroozeh commented Aug 31, 2020 •

edited

Loading

dbaliafroozeh Sep 4, 2020 •

edited

Loading

LuciferYang commented Sep 8, 2020 •

edited

Loading

LuciferYang commented Sep 8, 2020 •

edited

Loading

LuciferYang commented Sep 8, 2020 •

edited

Loading

LuciferYang commented Sep 8, 2020 •

edited

Loading