-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @sunchao
9bb5d09
to
ec1a8bc
Compare
@cloud-fan since you reviewed the two previous changes around type promotion in parquet readers, this is adding (byte, short, int, long) -> decimal to the vectorized reader - the non-vectorized reader can already do it. |
boolean needsUpcast = sparkType == LongType || sparkType == DoubleType || | ||
(isDate && sparkType == TimestampNTZType) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we remove isDate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was redundant since reading an INT32 as TimestampNTZType necessarily requires converting the value. The fact that this only happens for parquet dates isn't really relevant here and with the current change this would be the only case where we look at the parquet type annotation which is a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, this is inside isLazyDecodingSupported
toType = DecimalType(toPrecision, 2), | ||
expectError = fromPrecision > toPrecision && | ||
// parquet-mr allows reading decimals into a smaller precision decimal type without | ||
// checking for overflows. See test below checking for the overflow case in parquet-mr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for non-vectorized parquet reader, what's the behavior? silent overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decimal values are set to null on overflow, see https://github.com/apache/spark/pull/44803/files/8935b284e5519038f78fd95c8d12e66224f29d63#diff-a5cfd7285f9adf95b2aeea90aa57cc35d2b8c6bddaa0f4652172d30a264d3614R347
Integers wrap around on overflow on the other hand:
https://github.com/apache/spark/pull/44803/files/8935b284e5519038f78fd95c8d12e66224f29d63#diff-a5cfd7285f9adf95b2aeea90aa57cc35d2b8c6bddaa0f4652172d30a264d3614R363
Arguably not great but changing it would be a breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and vectorized reader just doesn't allow it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the vectorized reader throws an exception which this test is checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, @johanl-db .
- It seems that you forgot to add
[FOLLOWUP]
tag to the PR title. - In addition, please use a new JIRA from now. You already made 5 commits with the same JIRA ID. This is very bad in terms of traceability in the community because we cannot keep track in JIRA when we need to revert one of your contributions.
$ git log --oneline | grep SPARK-40876
0356ac00947 [SPARK-40876][SQL] Widening type promotion from integers to decimal in Parquet vectorized reader
ee2a87b4642 [SPARK-40876][SQL][TESTS][FOLLOW-UP] Remove invalid decimal test case when ANSI mode is on
d439e34d6bd [SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers
c1888cdf536 [SPARK-40876][SQL][TESTS][FOLLOWUP] Fix failed test in `ParquetTypeWideningSuite` when `SPARK_ANSI_SQL_MODE` is set to true
3361f25dc0f [SPARK-40876][SQL] Widening type promotions in Parquet readers
Understood, I'll make sure to create separate tickets for each PRs in the future (and use tags). |
Thank you, @johanl-db . |
What changes were proposed in this pull request?
This is a follow-up from #44368 and #44513, implementing an additional type promotion from integers to decimals in the parquet vectorized reader, bringing it at parity with the non-vectorized reader in that regard.
Why are the changes needed?
This allows reading parquet files that have different schemas and mix decimals and integers - e.g reading files containing either
Decimal(15, 2)
andINT32
asDecimal(15, 2)
- as long as the requested decimal type is large enough to accommodate the integer values without precision loss.Does this PR introduce any user-facing change?
Yes, the following now succeeds when using the vectorized Parquet reader:
It failed before with the vectorized reader and succeeded with the non-vectorized reader.
How was this patch tested?
ParquetWideningTypeSuite
ParquetQuerySuite
test.Was this patch authored or co-authored using generative AI tooling?
No