-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45905][SQL] Least common type between decimal types should retain integral digits first #43781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you re-trigger the failed CI test pipeline, @cloud-fan ?
Also, cc @kazuyukitanimura .
@@ -5425,7 +5434,7 @@ class SQLConf extends Serializable with Logging with SqlApiConf { | |||
} | |||
|
|||
def legacyRaiseErrorWithoutErrorClass: Boolean = | |||
getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS) | |||
getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noise? (whitespace changes best made in a non-bugfix PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since I touched this file, I just fixed the wrong indentation.
| e1 - e2 | max(s1, s2) + max(p1 - s1, p2 - s2) + 1 | max(s1, s2) | | ||
| e1 * e2 | p1 + p2 + 1 | s1 + s2 | | ||
| e1 / e2 | p1 - s1 + s2 + max(6, s1 + p2 + 1) | max(6, s1 + p2 + 1) | | ||
| e1 % e2 | min(p1 - s1, p2 - s2) + max(s1, s2) | max(s1, s2) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the arithmetic operations did not strictly follow this rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which one does not follow? The final decimal type can be different as there is one more truncation step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*
AND /
.
For example:
val a = Decimal(100) // p: 10, s: 0
val b= Decimal(-100) // p: 10, s: 0
val c = a * b // Decimal(-10000) p: 5, s: 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the Spark SQL multiple. Please take a look at Multiple#resultDecimalType
} else { | ||
// If we have to reduce the precision, we should retain the digits in the integral part first, | ||
// as they are more significant to the value. Here we reduce the scale as well to drop the | ||
// digits in the fractional part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take a look at the following failure too? I'm wondering if it's related or not.
[info] - array contains function *** FAILED *** (321 milliseconds)
[info] Expected exception org.apache.spark.sql.AnalysisException to be thrown, but no exception was thrown (DataFrameFunctionsSuite.scala:1534)
@@ -64,7 +65,11 @@ object DecimalPrecision extends TypeCoercionRule { | |||
def widerDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = { | |||
val scale = max(s1, s2) | |||
val range = max(p1 - s1, p2 - s2) | |||
DecimalType.bounded(range + scale, scale) | |||
if (conf.getConf(SQLConf.LEGACY_RETAIN_FRACTION_DIGITS_FIRST)) { | |||
DecimalType.bounded(range + scale, scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many usages of DecimalType.bounded
.
Why we only change the behavior here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To limit the scope to type coercion only. Some arithmetic operations also call it to determine the result decimal type and I don't want to change that part.
@@ -4541,6 +4541,15 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(false) | |||
|
|||
val LEGACY_RETAIN_FRACTION_DIGITS_FIRST = | |||
buildConf("spark.sql.legacy.decimalLeastCommonType.retainFractionDigitsFirst") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH the conf is a bit long.
How about spark.sql.legacy.decimal.retainFractionDigitsOnTruncate
. But the naming really depends on the scrope of the behavior change as I asked in https://github.com/apache/spark/pull/43781/files#r1393331222
docs/sql-ref-ansi-compliance.md
Outdated
| e1 / e2 | p1 - s1 + s2 + max(6, s1 + p2 + 1) | max(6, s1 + p2 + 1) | | ||
| e1 % e2 | min(p1 - s1, p2 - s2) + max(s1, s2) | max(s1, s2) | | ||
|
||
The truncation rule is also different for arithmetic operations: they retain at least 6 digits in the fractional part, which means we can only reduce `scale` to 6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention what happens if we cannot truncate fractional part to make it fit into maximum precision? Overflow?
.doc("When set to true, we will try to retain the fraction digits first rather than " + | ||
"integral digits, when getting a least common type between decimal types, and the " + | ||
"result decimal precision exceeds the max precision.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.doc("When set to true, we will try to retain the fraction digits first rather than " + | |
"integral digits, when getting a least common type between decimal types, and the " + | |
"result decimal precision exceeds the max precision.") | |
.doc("When set to true, we will try to retain the fraction digits first rather than " + | |
"integral digits as prior Spark 4.0, when getting a least common type between decimal types, and the " + | |
"result decimal precision exceeds the max precision.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks reasonable. Is there reproducer that can be added as unit test to show the issue in e2e example? Like any operator mentioned in the description?
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (only one minor comment about config version).
I think the updated tests show the problem. |
The failure is unrelated
I'm merging it to master, thanks for the reviews! |
What changes were proposed in this pull request?
This is kind of a followup of #20023 .
It's simply wrong to cut the decimal precision to 38 if a wider decimal type exceeds the max precision, which drops the integral digits and makes the decimal value very likely to overflow.
In #20023 , we fixed this issue for arithmetic operations, but many other operations suffer from the same issue: Union, binary comparison, in subquery, create_array, coalesce, etc.
This PR fixes all the remaining operators, without the min scale limitation, which should be applied to division and multiple only according to the SQL server doc: https://learn.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15
Why are the changes needed?
To produce reasonable wider decimal type.
Does this PR introduce any user-facing change?
Yes, the final data type of these operators will be changed if it's decimal type and its precision exceeds the max and the scale is not 0.
How was this patch tested?
updated tests
Was this patch authored or co-authored using generative AI tooling?
No