[SPARK-50644][SQL] Read variant struct in Parquet reader. #49263

chenhao-db · 2024-12-21T22:40:44Z

What changes were proposed in this pull request?

It adds support for variant struct in Parquet reader. The concept of variant struct was introduced in #49235. It includes all the extracted fields from a variant column that the query requests.

Why are the changes needed?

By producing variant struct in Parquet reader, we can avoid reading/rebuilding the full variant and achieve more efficient variant processing.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-12-23T00:10:01Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala

@@ -188,6 +570,32 @@ case object SparkShreddingUtils {
      scalarSchema, objectSchema, arraySchema)
  }

+  // Convert a scalar variant schema into a Spark scalar type.
+  def scalarSchemaToSparkType(scalar: VariantSchema.ScalarType): DataType = scalar match {
+    case _: VariantSchema.StringType => StringType


I wonder if we can have a util that lists supported types here .. otherwise, it's very likely we miss to fix this place when we happen to support more types.

I think we can just keep it as it is because:

Essentially, this function only needs to process the VariantSchema returned by buildVariantSchema, which we have control of and will know it has changed.

I cannot think of an easy way list all supported types. Decimal types can have many precision-scale combinations.

chenhao-db · 2024-12-23T02:27:43Z

@cloud-fan @cashmand @gene-db could you help review? Thanks!

cloud-fan · 2024-12-23T12:52:16Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala

@@ -390,6 +394,11 @@ object ParquetReadSupport extends Logging {
      .named(parquetRecord.getName)
  }

+  private def clipVariantSchema(parquetType: GroupType, variantStruct: StructType): GroupType = {
+    // TODO(SHREDDING): clip `parquetType` to retain the necessary columns.


So this requires the new parquet version to support column pruning for variant?

I don't think it requires a new Parquet version - it should be possible to clip it in sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala in the same way that unused struct fields are clipped. The logic of deciding which fields can be clipped is more complicated, though.

It doesn't. in this function, we will have custom logic to clip parquetType to retain the necessary columns for reading variantStruct. But this part will be in a future PR to avoid making the single PR too big.

cloud-fan · 2024-12-24T06:53:26Z

thanks, merging to master!

cashmand

One question for a possible future follow-up.

cashmand · 2024-12-23T23:06:53Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala

+  override def readFromTyped(row: InternalRow, topLevelMetadata: Array[Byte]): Any = {
+    if (castProject == null) {
+      return if (targetType.isInstanceOf[StringType]) {
+        UTF8String.fromString(rebuildVariant(row, topLevelMetadata).toJson(castArgs.zoneId))


In the case where the target type is string and the typed_value type is also string, would this add a lot of overhead? Is it worth specializing that case, since it seems like one that's likely to be common. I guess more generally, is rebuildVariant a heavier-than-necessary hammer if typed_value is any scalar?

I think there is a misunderstanding. If the target type is string and the typed_value type is also string, castProject will not be null, and the code with not take the rebuild path. I also measured the cost of castProject, and it turns out to be small. For string -> string specifically, if I replace the whole readFromTyped with row.getUTF8String(schema.typedIdx), the performance improvement is <10%.

Oh, got it, I missed that we only do this in the castProject == null case. Thanks for measuring the performance impact of readFromTyped.

initial

9992a6d

github-actions bot added the SQL label Dec 21, 2024

HyukjinKwon changed the title ~~[SPARK-50644] Read variant struct in Parquet reader.~~ [SPARK-50644][SQL] Read variant struct in Parquet reader. Dec 23, 2024

HyukjinKwon reviewed Dec 23, 2024

View reviewed changes

fix lint

98bfe37

cloud-fan reviewed Dec 23, 2024

View reviewed changes

cloud-fan closed this in 2c1c4d2 Dec 24, 2024

cashmand reviewed Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50644][SQL] Read variant struct in Parquet reader. #49263

[SPARK-50644][SQL] Read variant struct in Parquet reader. #49263

chenhao-db commented Dec 21, 2024

HyukjinKwon Dec 23, 2024

chenhao-db Dec 23, 2024

chenhao-db commented Dec 23, 2024

cloud-fan Dec 23, 2024

cashmand Dec 23, 2024

chenhao-db Dec 23, 2024 •

edited

Loading

cloud-fan commented Dec 24, 2024

cashmand left a comment

cashmand Dec 23, 2024

chenhao-db Dec 24, 2024

cashmand Dec 24, 2024

[SPARK-50644][SQL] Read variant struct in Parquet reader. #49263

[SPARK-50644][SQL] Read variant struct in Parquet reader. #49263

Conversation

chenhao-db commented Dec 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon Dec 23, 2024

Choose a reason for hiding this comment

chenhao-db Dec 23, 2024

Choose a reason for hiding this comment

chenhao-db commented Dec 23, 2024

cloud-fan Dec 23, 2024

Choose a reason for hiding this comment

cashmand Dec 23, 2024

Choose a reason for hiding this comment

chenhao-db Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Dec 24, 2024

cashmand left a comment

Choose a reason for hiding this comment

cashmand Dec 23, 2024

Choose a reason for hiding this comment

chenhao-db Dec 24, 2024

Choose a reason for hiding this comment

cashmand Dec 24, 2024

Choose a reason for hiding this comment

chenhao-db Dec 23, 2024 •

edited

Loading