[SPARK-50615][SQL] Push variant into scan. #49235

chenhao-db · 2024-12-18T20:47:58Z

What changes were proposed in this pull request?

It adds an optimizer rule to push variant into scan by rewriting the variant type with a struct type producing all requested fields and rewriting the variant extraction expressions by struct accesses. This will be the foundation of the variant shredding reader. The rule must be disabled at this point because the scan part is not yet able to recognize the special struct.

Why are the changes needed?

It is necessary for the performance of reading from shredded variant. With this rule (and the reader implemented), the scan only needs to fetch the necessary shredded columns required by the plan.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

chenhao-db · 2024-12-18T20:51:28Z

@cloud-fan @gene-db @cashmand Please help review, thanks!

cloud-fan · 2024-12-19T09:20:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

@@ -95,7 +95,8 @@ class SparkOptimizer(
      EliminateLimits,
      ConstantFolding),
    Batch("User Provided Optimizers", fixedPoint, experimentalMethods.extraOptimizations: _*),
-    Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition)))
+    Batch("Replace CTE with Repartition", Once, ReplaceCTERefWithRepartition),
+    Batch("Push Variant Into Scan", Once, PushVariantIntoScan)))


why does this rule need to be in the end of optimizer?

cloud-fan · 2024-12-19T09:21:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

+  def addVariantFields(attrId: ExprId, dataType: DataType, defaultValue: Any,
+                       path: Seq[Int]): Unit = {


Suggested change

def addVariantFields(attrId: ExprId, dataType: DataType, defaultValue: Any,

path: Seq[Int]): Unit = {

def addVariantFields(

attrId: ExprId,

dataType: DataType,

defaultValue: Any,

path: Seq[Int]): Unit = {

cloud-fan · 2024-12-19T09:22:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

+  private def addField(map: HashMap[RequestedVariantField, Int],
+                       field: RequestedVariantField): Unit = {


Suggested change

private def addField(map: HashMap[RequestedVariantField, Int],

field: RequestedVariantField): Unit = {

private def addField(

map: HashMap[RequestedVariantField, Int],

field: RequestedVariantField): Unit = {

cloud-fan · 2024-12-19T09:22:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

+    }
+  }
+
+  private def rewritePlan(originalPlan: LogicalPlan,


ditto, fix indentaiton please

cloud-fan · 2024-12-19T09:28:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PushVariantIntoScan.scala

+//   - Filter [v.0 = 1]
+//     - Relation [v: struct<0: int, 1: string, 2: variant>]
+// The struct fields are annotated with `VariantMetadata` to indicate the extraction path.
+object PushVariantIntoScan extends Rule[LogicalPlan] {


This rule matches a similar plan pattern as SchemaPruning, shall we also put this rule in the earlyScanPushDownRules of SparkOptimizer?

chenhao-db · 2024-12-20T03:01:08Z

@cloud-fan Thanks! I have made the changes you recommended.

cloud-fan · 2024-12-20T05:03:02Z

thanks, merging to master!

### What changes were proposed in this pull request? It adds support for variant struct in Parquet reader. The concept of variant struct was introduced in #49235. It includes all the extracted fields from a variant column that the query requests. ### Why are the changes needed? By producing variant struct in Parquet reader, we can avoid reading/rebuilding the full variant and achieve more efficient variant processing. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49263 from chenhao-db/spark_variant_struct_reader. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

initial

4e41099

github-actions bot added the SQL label Dec 18, 2024

HyukjinKwon changed the title ~~[SPARK-50615] Push variant into scan.~~ [SPARK-50615][SQL] Push variant into scan. Dec 19, 2024

cloud-fan reviewed Dec 19, 2024

View reviewed changes

minor fix

c017934

chenhao-db requested a review from cloud-fan December 20, 2024 03:01

cloud-fan closed this in 78592a0 Dec 20, 2024

chenhao-db mentioned this pull request Dec 21, 2024

[SPARK-50644][SQL] Read variant struct in Parquet reader. #49263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50615][SQL] Push variant into scan. #49235

[SPARK-50615][SQL] Push variant into scan. #49235

chenhao-db commented Dec 18, 2024 •

edited

Loading

chenhao-db commented Dec 18, 2024

cloud-fan Dec 19, 2024

cloud-fan Dec 19, 2024

cloud-fan Dec 19, 2024

cloud-fan Dec 19, 2024

cloud-fan Dec 19, 2024

chenhao-db commented Dec 20, 2024

cloud-fan commented Dec 20, 2024

		def addVariantFields(attrId: ExprId, dataType: DataType, defaultValue: Any,
		path: Seq[Int]): Unit = {

		private def addField(map: HashMap[RequestedVariantField, Int],
		field: RequestedVariantField): Unit = {

[SPARK-50615][SQL] Push variant into scan. #49235

[SPARK-50615][SQL] Push variant into scan. #49235

Conversation

chenhao-db commented Dec 18, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

chenhao-db commented Dec 18, 2024

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

cloud-fan Dec 19, 2024

Choose a reason for hiding this comment

chenhao-db commented Dec 20, 2024

cloud-fan commented Dec 20, 2024

chenhao-db commented Dec 18, 2024 •

edited

Loading