-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50525][SQL] Define InsertMapSortInRepartitionExpressions Optimizer Rule #49144
Conversation
Hi @harshmotw-db, @hvanhovell, @cloud-fan in continuation to #49080 discussion and SPARK-50525. |
5c01b4e
to
37ef34f
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
It's a breaking change, shall we just support it now? We can handle it in |
@cloud-fan this change will impact users indeed, but it was done in considerations to #49080 and #48909. If you think it would be better to extend repartition in the similar fashion as |
consistency is good but breaking change is scary, even if the old behavior returns the wrong result. |
Okay, thank you for your point @cloud-fan! Just to double check if i've got everything correctly before further implementation: instead of prohibiting map expressions for partitioning, we can implemented a case map: MapData =>
val (kt, vt) = dataType match {
case udt: UserDefinedType[_] =>
val mapType = udt.sqlType.asInstanceOf[MapType]
mapType.keyType -> mapType.valueType
case MapType(kt, vt, _) => kt -> vt
}
val keys = map.keyArray()
val values = map.valueArray()
var result = seed
var i = 0
while (i < map.numElements()) {
result = hash(keys.get(i, kt), kt, result)
result = hash(values.get(i, vt), vt, result)
i += 1
}
result Please let me know if im missing something or if you have any other recommendations! |
@MaxGekk could you please check this PR again? what are your thoughts? |
We can rename |
@ostronaut what does it take to get this PR moving? |
No blockers for now! If this is fine, i will implement changes as suggested by @cloud-fan, where map will be changed to sorted map in same way as InsertMapSortInGroupingExpressions. If you have any other comments, please let me know! |
@cloud-fan, @hvanhovell, @MaxGekk ready for review after applying all suggestions! |
* SELECT * FROM TABLE DISTRIBUTE BY map_column => | ||
* SELECT * FROM TABLE DISTRIBUTE BY map_sort(map_column) | ||
*/ | ||
object InsertMapSortInRepartitionExpressions extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we combine these two rules so that we only need to traverse the plan once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially i wanted to do the same, but logic for InsertMapSortInGroupingExpressions
and InsertMapSortInRepartitionExpressions
is quite different: Grouping produces a new output after applying the changes, while Repartition only updates existing RepartitionByExpression
by replacing partitionExpressions
.
Also, there is a dependency between InsertMapSortInGroupingExpressions
and PullOutGroupingExpressions
, as mentioned in this comment.
For those reasons i decided to split them into separate Rules. But if you think performance saving from reduced traverse will be significant, we can combine those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to use transformUpWithNewOutput
for both. If we hit RepartitionByExpression
, we return Nil
as the new output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this will make things more complex, while there is no need to return new output for RepartitionByExpression
. Also, to prevent traverse on every plan we have added two conditions in InsertMapSortInRepartitionExpressions
:
_.containsPattern(REPARTITION_OPERATION)
ascond
totransformUpWithPruning
.if rep.partitionExpressions.exists(mapTypeExistsRecursively)
in case matching.
So i would keep these two independent from each other.
0d7adec
to
75e323b
Compare
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortExpression.scala
Outdated
Show resolved
Hide resolved
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortExpression.scala
Show resolved
Hide resolved
...atalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InsertMapSortExpression.scala
Outdated
Show resolved
Hide resolved
d6592d2
to
20c76e4
Compare
There is a test failure in |
- SPARK-47148: AQE should avoid to submit shuffle job on cancellation test failed on initial run, but after re-run it succeeded. So i dont think its related to this PR. Most likely this test is unstable. Note: the issue was: |
Hi @cloud-fan, @MaxGekk. Can we merge this PR? |
The Spark Connect failure is unrelated, thanks, merging to master! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @ostronaut and @cloud-fan . This seems to break non-ANSI mode. Could you take a look at the CI failures?
According to the logs, 4 suites failed due to this.
|
@@ -428,6 +428,33 @@ class DataFrameSuite extends QueryTest | |||
} | |||
} | |||
|
|||
test("repartition by MapType") { | |||
Seq("int", "long", "float", "double", "decimal(10, 2)", "string", "varchar(6)").foreach { dt => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This newly added test case fails at NON-ANSI mode at the test case, "decimal(10, 2)"
.
…y MapType` test assumption ### What changes were proposed in this pull request? This is a follow-up to recover the NON-ANSI mode CI failure by adding a test assumption clearly. - #49144 ### Why are the changes needed? **BEFORE** ``` $ SPARK_ANSI_SQL_MODE=false build/sbt "sql/testOnly *.DataFrameSuite -- -z MapType" [info] *** 1 TEST FAILED *** [error] Failed tests: [error] org.apache.spark.sql.DataFrameSuite ``` **AFTER** ``` $ SPARK_ANSI_SQL_MODE=false build/sbt "sql/testOnly *.DataFrameSuite -- -z MapType" [info] All tests passed. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test with `SPARK_ANSI_SQL_MODE=false`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49457 from dongjoon-hyun/SPARK-50525. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…icits` ### What changes were proposed in this pull request? Related to #49144. scala 2.12 is failing with `ArrayImplicits`, which is in use for `ShowTablesExec.isTempView` method. This PR removes `org.apache.spark.util.ArrayImplicits._` from `ShowTablesExec` and uses default Seq instead. ### Why are the changes needed? To fix failing scala 2.12 compilation isssu. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing init tests and actions run. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50008 from ostronaut/features/ShowTablesExec-remove-ArrayImplicits. Authored-by: Dima <dimanowq@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…icits` Related to #49144. scala 2.12 is failing with `ArrayImplicits`, which is in use for `ShowTablesExec.isTempView` method. This PR removes `org.apache.spark.util.ArrayImplicits._` from `ShowTablesExec` and uses default Seq instead. To fix failing scala 2.12 compilation isssu. No Existing init tests and actions run. No. Closes #50008 from ostronaut/features/ShowTablesExec-remove-ArrayImplicits. Authored-by: Dima <dimanowq@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4d15f64) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…icits` Related to #49144. scala 2.12 is failing with `ArrayImplicits`, which is in use for `ShowTablesExec.isTempView` method. This PR removes `org.apache.spark.util.ArrayImplicits._` from `ShowTablesExec` and uses default Seq instead. To fix failing scala 2.12 compilation isssu. No Existing init tests and actions run. No. Closes #50008 from ostronaut/features/ShowTablesExec-remove-ArrayImplicits. Authored-by: Dima <dimanowq@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4d15f64) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In the current version of Spark, its possible to use
MapType
as column for repartitioning. ButMapData
does not implementequals
andhashCode
(in according to SPARK-9415 and [SPARK-16135][SQL] Remove hashCode and equals in ArrayBasedMapData). Considering that, hash value for same Maps can be different.In an attempt to run
xxhash64
orhash
function onMapType
,org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.HASH_MAP_TYPE] Cannot resolve "xxhash64(value)" due to data type mismatch: Input to the function `xxhash64` cannot contain elements of the "MAP" type. In Spark, same maps may have different hashcode, thus hash expressions are prohibited on "MAP" elements. To restore previous behavior set "spark.sql.legacy.allowHashOnMapType" to "true".;
will be thrown.Also, when trying to run
ds.distinct(col("value"))
, wherevalue
hasMapType
, the following exception is thrown:org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_FEATURE.SET_OPERATION_ON_MAP_TYPE] The feature is not supported: Cannot have MAP type columns in DataFrame which calls set operations (INTERSECT, EXCEPT, etc.), but the type of column `value` is "MAP<INT, STRING>".;
With the above consideration, a new
InsertMapSortInRepartitionExpressions
Rule[LogicalPlan]
was implemented to insertmapsort
for everyMapType
inRepartitionByExpression.partitionExpressions
.Why are the changes needed?
To keep
repartition
API for MapType consistent.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No