Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50503][SQL] Prohibit partitioning by Variant data #49080

Conversation

harshmotw-db
Copy link
Contributor

What changes were proposed in this pull request?

Prior to this PR, repartition by Variant producing expressions wasn't blocked during analysis. It should be blocked because Variant equality is not defined. It is similar to this PR which blocked Variant from Set operations.

Why are the changes needed?

Variant equality is not defined yet and therefore shouldn't be allowed in repartitioning.

Does this PR introduce any user-facing change?

Yes, prior to this PR, Variants repartition did not throw a well defined error.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@harshmotw-db harshmotw-db marked this pull request as ready for review December 5, 2024 21:08
@github-actions github-actions bot added the SQL label Dec 5, 2024
@harshmotw-db
Copy link
Contributor Author

@gene-db This PR blocks Variants from being used in repartitioning.

@@ -96,6 +96,13 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
case _ => None
}

protected def variantExprInPartitionExpression(plan: LogicalPlan): Option[Expression] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, Map and Variant should be excluded in the same scenarios. Can you check if we are covering everything for Variant by checking what do for Map?

Copy link
Contributor Author

@harshmotw-db harshmotw-db Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repartition by map is currently allowed. I don't know if that is intended. My previous PR which blocked set operations on Variant followed the same pattern as Maps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a check for Map I suppose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be quite risky considering several Spark workloads could currently be repartitioning by map types. It's fine for Variant as Variant is a new feature.
Also, broken "repartitioning" isn't a correctness problem since query results should be agnostic of the partitioning right? I think it would result in suboptimal performance at worst.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harshmotw-db no it will be an actual correctness issue. The problem - like with variant - is that the order of key-value pairs in a map is undefined, e.g. Map(a ->b, b -> c) == Map(b -> c, a -> b). However this is not your problem. Your point about it being a problem for existing workloads is correct, however that would only mean we would have to add a feature flag. cc @cloud-fan @HyukjinKwon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! I think we can fix it now by adding MapSort expression for the Repartition node in InsertMapSortInGroupingExpressions (and rename the rule)

cc @stevomitric @nebojsa-db

@HyukjinKwon HyukjinKwon changed the title [SPARK-50503] Prohibit partitioning by Variant data [SPARK-50503][SQL] Prohibit partitioning by Variant data Dec 6, 2024
@harshmotw-db
Copy link
Contributor Author

@hvanhovell Can you please go over this PR again? Thanks!

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hvanhovell
Copy link
Contributor

Merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants