Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803

Conversation

johanl-db
Copy link
Contributor

What changes were proposed in this pull request?

This is a follow-up from #44368 and #44513, implementing an additional type promotion from integers to decimals in the parquet vectorized reader, bringing it at parity with the non-vectorized reader in that regard.

Why are the changes needed?

This allows reading parquet files that have different schemas and mix decimals and integers - e.g reading files containing either Decimal(15, 2) and INT32 as Decimal(15, 2) - as long as the requested decimal type is large enough to accommodate the integer values without precision loss.

Does this PR introduce any user-facing change?

Yes, the following now succeeds when using the vectorized Parquet reader:

  Seq(20).toDF($"a".cast(IntegerType)).write.parquet(path)
  spark.read.schema("a decimal(12, 0)").parquet(path).collect()

It failed before with the vectorized reader and succeeded with the non-vectorized reader.

How was this patch tested?

  • Tests added to ParquetWideningTypeSuite
  • Updated relevant ParquetQuerySuite test.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jan 19, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanl-db johanl-db force-pushed the SPARK-40876-widening-promotion-int-to-decimal branch from 9bb5d09 to ec1a8bc Compare January 22, 2024 08:02
@johanl-db
Copy link
Contributor Author

@cloud-fan since you reviewed the two previous changes around type promotion in parquet readers, this is adding (byte, short, int, long) -> decimal to the vectorized reader - the non-vectorized reader can already do it.

boolean needsUpcast = sparkType == LongType || sparkType == DoubleType ||
(isDate && sparkType == TimestampNTZType) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we remove isDate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was redundant since reading an INT32 as TimestampNTZType necessarily requires converting the value. The fact that this only happens for parquet dates isn't really relevant here and with the current change this would be the only case where we look at the parquet type annotation which is a bit confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this is inside isLazyDecodingSupported

toType = DecimalType(toPrecision, 2),
expectError = fromPrecision > toPrecision &&
// parquet-mr allows reading decimals into a smaller precision decimal type without
// checking for overflows. See test below checking for the overflow case in parquet-mr.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non-vectorized parquet reader, what's the behavior? silent overflow?

Copy link
Contributor Author

@johanl-db johanl-db Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and vectorized reader just doesn't allow it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the vectorized reader throws an exception which this test is checking

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, @johanl-db .

  • It seems that you forgot to add [FOLLOWUP] tag to the PR title.
  • In addition, please use a new JIRA from now. You already made 5 commits with the same JIRA ID. This is very bad in terms of traceability in the community because we cannot keep track in JIRA when we need to revert one of your contributions.
$ git log --oneline | grep SPARK-40876
0356ac00947 [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader
ee2a87b4642 [SPARK-40876][SQL][TESTS][FOLLOW-UP] Remove invalid decimal test case when ANSI mode is on
d439e34d6bd [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers
c1888cdf536 [SPARK-40876][SQL][TESTS][FOLLOWUP] Fix failed test in `ParquetTypeWideningSuite` when `SPARK_ANSI_SQL_MODE` is set to true
3361f25dc0f [SPARK-40876][SQL] Widening type promotions in Parquet readers

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader [SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader Jan 23, 2024
@johanl-db
Copy link
Contributor Author

Oh, @johanl-db .

  • It seems that you forgot to add [FOLLOWUP] tag to the PR title.
  • In addition, please use a new JIRA from now. You already made 5 commits with the same JIRA ID. This is very bad in terms of traceability in the community because we cannot keep track in JIRA when we need to revert one of your contributions.
$ git log --oneline | grep SPARK-40876
0356ac00947 [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader
ee2a87b4642 [SPARK-40876][SQL][TESTS][FOLLOW-UP] Remove invalid decimal test case when ANSI mode is on
d439e34d6bd [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers
c1888cdf536 [SPARK-40876][SQL][TESTS][FOLLOWUP] Fix failed test in `ParquetTypeWideningSuite` when `SPARK_ANSI_SQL_MODE` is set to true
3361f25dc0f [SPARK-40876][SQL] Widening type promotions in Parquet readers

Understood, I'll make sure to create separate tickets for each PRs in the future (and use tags).

@dongjoon-hyun
Copy link
Member

Thank you, @johanl-db .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants