-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50503][SQL] Prohibit partitioning by Variant data #49080
[SPARK-50503][SQL] Prohibit partitioning by Variant data #49080
Conversation
@gene-db This PR blocks Variants from being used in repartitioning. |
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala
Outdated
Show resolved
Hide resolved
@@ -96,6 +96,13 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB | |||
case _ => None | |||
} | |||
|
|||
protected def variantExprInPartitionExpression(plan: LogicalPlan): Option[Expression] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq, Map and Variant should be excluded in the same scenarios. Can you check if we are covering everything for Variant by checking what do for Map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repartition by map is currently allowed. I don't know if that is intended. My previous PR which blocked set operations on Variant followed the same pattern as Maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add a check for Map I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be quite risky considering several Spark workloads could currently be repartitioning by map types. It's fine for Variant as Variant is a new feature.
Also, broken "repartitioning" isn't a correctness problem since query results should be agnostic of the partitioning right? I think it would result in suboptimal performance at worst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harshmotw-db no it will be an actual correctness issue. The problem - like with variant - is that the order of key-value pairs in a map is undefined, e.g. Map(a ->b, b -> c) == Map(b -> c, a -> b)
. However this is not your problem. Your point about it being a problem for existing workloads is correct, however that would only mean we would have to add a feature flag. cc @cloud-fan @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the ticket: https://issues.apache.org/jira/browse/SPARK-50525
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch! I think we can fix it now by adding MapSort
expression for the Repartition
node in InsertMapSortInGroupingExpressions
(and rename the rule)
@hvanhovell Can you please go over this PR again? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merging to master |
What changes were proposed in this pull request?
Prior to this PR, repartition by Variant producing expressions wasn't blocked during analysis. It should be blocked because Variant equality is not defined. It is similar to this PR which blocked Variant from Set operations.
Why are the changes needed?
Variant equality is not defined yet and therefore shouldn't be allowed in repartitioning.
Does this PR introduce any user-facing change?
Yes, prior to this PR, Variants repartition did not throw a well defined error.
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.