Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet #29598

Closed

Conversation

dbaliafroozeh
Copy link
Contributor

@dbaliafroozeh dbaliafroozeh commented Aug 31, 2020

What changes were proposed in this pull request?

This PR changes AttributeSet and ExpressionSet to maintain the insertion order of the elements. More specifically, we:

  • change the underlying data structure of AttributeSet from HashSet to LinkedHashSet to maintain the insertion order.
  • ExpressionSet already uses a list to keep track of the expressions, however, since it is extending Scala's immutable.Set class, operations such as map and flatMap are delegated to the immutable.Set itself. This means that the result of these operations is not an instance of ExpressionSet anymore, rather it's a implementation picked up by the parent class. We also remove this inheritance from immutable.Set and implement the needed methods directly. ExpressionSet has a very specific semantics and it does not make sense to extend immutable.Set anyway.
  • change the PlanStabilitySuite to not sort the attributes, to be able to catch changes in the order of expressions in different runs.

Why are the changes needed?

Expressions identity is based on the ExprId which is an auto-incremented number. This means that the same query can yield a query plan with different expression ids in different runs. AttributeSet and ExpressionSet internally use a HashSet as the underlying data structure, and therefore cannot guarantee the a fixed order of operations in different runs. This can be problematic in cases we like to check for plan changes in different runs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Passes PlanStabilitySuite after regenerating the golden files.

@dbaliafroozeh dbaliafroozeh changed the title Maintain the order of expressions in AttributeSet and ExpressionSet [SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet Aug 31, 2020
@hvanhovell
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128101 has finished for PR 29598 at commit 5afdedb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128102 has finished for PR 29598 at commit 5afdedb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 31, 2020

Test build #128110 has finished for PR 29598 at commit f235b53.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128145 has finished for PR 29598 at commit f3a79a3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128149 has finished for PR 29598 at commit b0478f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128153 has finished for PR 29598 at commit b0478f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128156 has finished for PR 29598 at commit 9590555.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
def -(elem: Expression): ExpressionSet = {
val newSet = clone()
newSet.remove(elem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this more efficient?:

ExpressionSet(baseSet.filter(_ != e. canonicalized), originals.filter(_.canonicalized != e.canonicalized))

@hvanhovell hvanhovell closed this in 0a6043f Sep 3, 2020
@@ -27,6 +27,10 @@ object ExpressionSet {
expressions.foreach(set.add)
set
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should apply the same change in ExpressionSet under the scala-2.13 source tree. @dbaliafroozeh can you open a followup PR?

Copy link
Contributor Author

@dbaliafroozeh dbaliafroozeh Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan good catch, I thought I already deleted the ExpressionSet in 2.13. Note that we don't want it anymore as ExpressionSet doesn't extend Set anymore. I'll open a followup PR for that.

HyukjinKwon pushed a commit that referenced this pull request Sep 6, 2020
### What changes were proposed in this pull request?
This PR is a followup on #29598 and removes the `ExpressionSet` class from the 2.13 branch.

### Why are the changes needed?
`ExpressionSet` does not extend Scala `Set` anymore and this class is no longer needed in the 2.13 branch.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes existing tests

Closes #29648 from dbaliafroozeh/RemoveExpressionSetFrom2.13Branch.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@LuciferYang
Copy link
Contributor

Sorry to leave a message in a completed issue. @cloud-fan @dbaliafroozeh This patch seems to bring about some different behavior between use Scala 2.12 and Scala 2.13.

I found that the number of failed cases increased with this patch of the sub-suites of PlanStabilitySuite.

For example, if we execute

mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite -Pscala-2.13 -am 

The test result with out this patch is

Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

and with this patch is

Tests: succeeded 1, failed 31, canceled 0, ignored 0, pending 0
*** 31 TESTS FAILED ***

I haven't found the root cause yet. Do you have any good ideas for fix this problems?

@cloud-fan
Copy link
Contributor

This should have been merged before we have PlanStabilitySuite, as the query plans in golden files were kind of random previously. That's why this PR updates PlanStabilitySuite.

If your spark fork has different golden files for PlanStabilitySuite, you should just re-generate golden files after this patch.

@LuciferYang
Copy link
Contributor

@cloud-fan Maybe I didn't describe it clearly, now I use the master of spark-source to execute maven test with Scala 2.12

mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite -am 

All tests passed.

execute maven test with Scala 2.13

mvn clean test -pl sql/core -Dtest=none -DwildcardSuites=org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite -am -Pscala-2.13 

31 TESTS FAILED

without this patch execute maven test , both Scala 2.12 and Scala 2.13 All tests passed

@LuciferYang
Copy link
Contributor

LuciferYang commented Sep 8, 2020

So always need to re-generate golden files with Scala 2.13? Or we need to use different golden files for different Scala verision, feels a little unreasonable...

Or do you mean the additional failure cases in Scala 2.13 is caused by other unknown reasons?

@cloud-fan
Copy link
Contributor

Interesting. So AttributeSet and ExpressionSet behave differently under scala 2.12 and 2.13. @Ngone51 can you take a look?

@LuciferYang
Copy link
Contributor

@Ngone51 need some simple fix on compilation for Scala 2.13 , QueryPlan and ShuffleBlockFetcherIterator.

@cloud-fan
Copy link
Contributor

@LuciferYang do you have a branch that contains the compilation fix?

@LuciferYang
Copy link
Contributor

@cloud-fan @Ngone51 we can use #29660

@LuciferYang
Copy link
Contributor

LuciferYang commented Sep 8, 2020

@cloud-fan @Ngone51 I think the reason for this problem is

def --(other: Iterable[NamedExpression]): AttributeSet = {
other match {
case otherSet: AttributeSet =>
new AttributeSet(baseSet -- otherSet.baseSet)
case _ =>
new AttributeSet(baseSet -- other.map(a => new AttributeEquals(a.toAttribute)))
}
}

The -- method of LinkedHashSet is different between Scala 2.12 and Scala 2.13.

in Scala 2.12 it use implementation of SetLike.scala as follow:

override def --(xs: GenTraversableOnce[A]): This = clone() --= xs.seq

in Scala 2.13 it use implementation of Set.scala as follow:

def -- (that: IterableOnce[A]): C = fromSpecific(coll.toSet.removedAll(that))

From the above code we can found that in Scala 2.13 after baseSet -- otherSet.baseSet, the result order changed .

Maybe we can use baseSet.diff(otherSet.baseSet) instead of baseSet -- otherSet.baseSet, but I need to confirm the feasibility.

@dbaliafroozeh
Copy link
Contributor Author

@cloud-fan @LuciferYang we can also try to use Java's LinkedHashSet here if there is a difference between different versions of Scala's mutable.LinkedHashSet.

@hvanhovell
Copy link
Contributor

@dbaliafroozeh -- returns a set, which gets converted into linked set by AttributeSet.apply(..). The 2.12 implementation will actually return a linked set, whereas the new probably returns a set.

@LuciferYang can you open a PR that basically uses 2.12 implementation inside AttributeSet?

@LuciferYang
Copy link
Contributor

LuciferYang commented Sep 8, 2020

@LuciferYang can you open a PR that basically uses 2.12 implementation inside AttributeSet?

@hvanhovell Ok ~ I will give a new followup pr

@LuciferYang
Copy link
Contributor

LuciferYang commented Sep 8, 2020

@hvanhovell use clone() --= xs.seq instead of -- or use diff method?

@cloud-fan
Copy link
Contributor

I'd prefer diff if it works, as it's simpler.

@hvanhovell
Copy link
Contributor

Anything that explicitly maintains the insertion order (i.e. returns a LinkedHashSet) will do :).

@hvanhovell
Copy link
Contributor

diff returns a set and does not fall in that category.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants