[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803

johanl-db · 2024-01-19T16:53:46Z

What changes were proposed in this pull request?

This is a follow-up from #44368 and #44513, implementing an additional type promotion from integers to decimals in the parquet vectorized reader, bringing it at parity with the non-vectorized reader in that regard.

Why are the changes needed?

This allows reading parquet files that have different schemas and mix decimals and integers - e.g reading files containing either Decimal(15, 2) and INT32 as Decimal(15, 2) - as long as the requested decimal type is large enough to accommodate the integer values without precision loss.

Does this PR introduce any user-facing change?

Yes, the following now succeeds when using the vectorized Parquet reader:

  Seq(20).toDF($"a".cast(IntegerType)).write.parquet(path)
  spark.read.schema("a decimal(12, 0)").parquet(path).collect()

It failed before with the vectorized reader and succeeded with the non-vectorized reader.

How was this patch tested?

Tests added to ParquetWideningTypeSuite
Updated relevant ParquetQuerySuite test.

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

cc @sunchao

…orized reader

johanl-db · 2024-01-22T17:39:02Z

@cloud-fan since you reviewed the two previous changes around type promotion in parquet readers, this is adding (byte, short, int, long) -> decimal to the vectorized reader - the non-vectorized reader can already do it.

cloud-fan · 2024-01-23T08:19:19Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

        boolean needsUpcast = sparkType == LongType || sparkType == DoubleType ||
-          (isDate && sparkType == TimestampNTZType) ||


why do we remove isDate?

This was redundant since reading an INT32 as TimestampNTZType necessarily requires converting the value. The fact that this only happens for parquet dates isn't really relevant here and with the current change this would be the only case where we look at the parquet type annotation which is a bit confusing.

Oh I see, this is inside isLazyDecodingSupported

cloud-fan · 2024-01-23T09:34:44Z

...test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetTypeWideningSuite.scala

+      toType = DecimalType(toPrecision, 2),
+      expectError = fromPrecision > toPrecision &&
+        // parquet-mr allows reading decimals into a smaller precision decimal type without
+        // checking for overflows. See test below checking for the overflow case in parquet-mr.


for non-vectorized parquet reader, what's the behavior? silent overflow?

Decimal values are set to null on overflow, see https://github.com/apache/spark/pull/44803/files/8935b284e5519038f78fd95c8d12e66224f29d63#diff-a5cfd7285f9adf95b2aeea90aa57cc35d2b8c6bddaa0f4652172d30a264d3614R347

Integers wrap around on overflow on the other hand:
https://github.com/apache/spark/pull/44803/files/8935b284e5519038f78fd95c8d12e66224f29d63#diff-a5cfd7285f9adf95b2aeea90aa57cc35d2b8c6bddaa0f4652172d30a264d3614R363

Arguably not great but changing it would be a breaking change

and vectorized reader just doesn't allow it?

Yes the vectorized reader throws an exception which this test is checking

dongjoon-hyun

+1, LGTM.

dongjoon-hyun

Oh, @johanl-db .

It seems that you forgot to add [FOLLOWUP] tag to the PR title.
In addition, please use a new JIRA from now. You already made 5 commits with the same JIRA ID. This is very bad in terms of traceability in the community because we cannot keep track in JIRA when we need to revert one of your contributions.

$ git log --oneline | grep SPARK-40876
0356ac00947 [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader
ee2a87b4642 [SPARK-40876][SQL][TESTS][FOLLOW-UP] Remove invalid decimal test case when ANSI mode is on
d439e34d6bd [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers
c1888cdf536 [SPARK-40876][SQL][TESTS][FOLLOWUP] Fix failed test in `ParquetTypeWideningSuite` when `SPARK_ANSI_SQL_MODE` is set to true
3361f25dc0f [SPARK-40876][SQL] Widening type promotions in Parquet readers

johanl-db · 2024-01-24T10:49:51Z

Oh, @johanl-db .

It seems that you forgot to add [FOLLOWUP] tag to the PR title.

In addition, please use a new JIRA from now. You already made 5 commits with the same JIRA ID. This is very bad in terms of traceability in the community because we cannot keep track in JIRA when we need to revert one of your contributions.
$ git log --oneline | grep SPARK-40876
0356ac00947 [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader
ee2a87b4642 [SPARK-40876][SQL][TESTS][FOLLOW-UP] Remove invalid decimal test case when ANSI mode is on
d439e34d6bd [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers
c1888cdf536 [SPARK-40876][SQL][TESTS][FOLLOWUP] Fix failed test in `ParquetTypeWideningSuite` when `SPARK_ANSI_SQL_MODE` is set to true
3361f25dc0f [SPARK-40876][SQL] Widening type promotions in Parquet readers

Understood, I'll make sure to create separate tickets for each PRs in the future (and use tags).

dongjoon-hyun · 2024-01-25T00:16:37Z

Thank you, @johanl-db .

github-actions bot added the SQL label Jan 19, 2024

dongjoon-hyun reviewed Jan 19, 2024

View reviewed changes

Add widening type promotion from integers to decimals in Parquet vect…

ec1a8bc

…orized reader

johanl-db force-pushed the SPARK-40876-widening-promotion-int-to-decimal branch from 9bb5d09 to ec1a8bc Compare January 22, 2024 08:02

johanl-db added 2 commits January 22, 2024 11:04

Fix import order

21290c6

Don't read unsigned integers as decimals

8935b28

cloud-fan reviewed Jan 23, 2024

View reviewed changes

cloud-fan approved these changes Jan 23, 2024

View reviewed changes

dongjoon-hyun approved these changes Jan 23, 2024

View reviewed changes

dongjoon-hyun closed this in 0356ac0 Jan 23, 2024

dongjoon-hyun reviewed Jan 23, 2024

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader~~ [SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader Jan 23, 2024

jackierwzhang mentioned this pull request Aug 6, 2024

[SPARK-49082][SQL] Widening type promotions in AvroDeserializer #47582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803

[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803

johanl-db commented Jan 19, 2024

dongjoon-hyun left a comment

johanl-db commented Jan 22, 2024

cloud-fan Jan 23, 2024

johanl-db Jan 23, 2024

cloud-fan Jan 23, 2024

cloud-fan Jan 23, 2024

johanl-db Jan 23, 2024 •

edited

Loading

cloud-fan Jan 24, 2024

johanl-db Jan 24, 2024

dongjoon-hyun left a comment

dongjoon-hyun left a comment •

edited

Loading

johanl-db commented Jan 24, 2024

dongjoon-hyun commented Jan 25, 2024

		boolean needsUpcast = sparkType == LongType \|\| sparkType == DoubleType \|\|
		(isDate && sparkType == TimestampNTZType) \|\|

[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803

[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803

Conversation

johanl-db commented Jan 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

johanl-db commented Jan 22, 2024

cloud-fan Jan 23, 2024

Choose a reason for hiding this comment

johanl-db Jan 23, 2024

Choose a reason for hiding this comment

cloud-fan Jan 23, 2024

Choose a reason for hiding this comment

cloud-fan Jan 23, 2024

Choose a reason for hiding this comment

johanl-db Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

cloud-fan Jan 24, 2024

Choose a reason for hiding this comment

johanl-db Jan 24, 2024

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

johanl-db commented Jan 24, 2024

dongjoon-hyun commented Jan 25, 2024

johanl-db Jan 23, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading