[SPARK-45905][SQL] Least common type between decimal types should retain integral digits first #43781

cloud-fan · 2023-11-13T08:31:51Z

What changes were proposed in this pull request?

This is kind of a followup of #20023 .

It's simply wrong to cut the decimal precision to 38 if a wider decimal type exceeds the max precision, which drops the integral digits and makes the decimal value very likely to overflow.

In #20023 , we fixed this issue for arithmetic operations, but many other operations suffer from the same issue: Union, binary comparison, in subquery, create_array, coalesce, etc.

This PR fixes all the remaining operators, without the min scale limitation, which should be applied to division and multiple only according to the SQL server doc: https://learn.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15

Why are the changes needed?

To produce reasonable wider decimal type.

Does this PR introduce any user-facing change?

Yes, the final data type of these operators will be changed if it's decimal type and its precision exceeds the max and the scale is not 0.

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

No

… first

docs/sql-ref-ansi-compliance.md

dongjoon-hyun

Could you re-trigger the failed CI test pipeline, @cloud-fan ?

Also, cc @kazuyukitanimura .

ryan-johnson-databricks · 2023-11-13T22:06:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -5425,7 +5434,7 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
  }

  def legacyRaiseErrorWithoutErrorClass: Boolean =
-      getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)
+    getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)


noise? (whitespace changes best made in a non-bugfix PR?)

since I touched this file, I just fixed the wrong indentation.

beliefer · 2023-11-14T01:39:51Z

docs/sql-ref-ansi-compliance.md

+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                        | s1 + s2             |
+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |


AFAIK, the arithmetic operations did not strictly follow this rule.

which one does not follow? The final decimal type can be different as there is one more truncation step.

* AND /.
For example:

val a = Decimal(100) // p: 10, s: 0 val b= Decimal(-100) // p: 10, s: 0 val c = a * b // Decimal(-10000) p: 5, s: 0

This is not the Spark SQL multiple. Please take a look at Multiple#resultDecimalType

beliefer · 2023-11-14T01:40:17Z

sql/api/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+    } else {
+      // If we have to reduce the precision, we should retain the digits in the integral part first,
+      // as they are more significant to the value. Here we reduce the scale as well to drop the
+      // digits in the fractional part.


Looks good.

cloud-fan · 2023-11-14T03:03:16Z

cc @gengliangwang @viirya @wangyum

wangyum

LTGM.

dongjoon-hyun

Could you take a look at the following failure too? I'm wondering if it's related or not.

[info] - array contains function *** FAILED *** (321 milliseconds)
[info]   Expected exception org.apache.spark.sql.AnalysisException to be thrown, but no exception was thrown (DataFrameFunctionsSuite.scala:1534)

gengliangwang · 2023-11-14T21:40:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala

@@ -64,7 +65,11 @@ object DecimalPrecision extends TypeCoercionRule {
  def widerDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = {
    val scale = max(s1, s2)
    val range = max(p1 - s1, p2 - s2)
-    DecimalType.bounded(range + scale, scale)
+    if (conf.getConf(SQLConf.LEGACY_RETAIN_FRACTION_DIGITS_FIRST)) {
+      DecimalType.bounded(range + scale, scale)


There are many usages of DecimalType.bounded.
Why we only change the behavior here?

To limit the scope to type coercion only. Some arithmetic operations also call it to determine the result decimal type and I don't want to change that part.

gengliangwang · 2023-11-14T21:47:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -4541,6 +4541,15 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

+  val LEGACY_RETAIN_FRACTION_DIGITS_FIRST =
+    buildConf("spark.sql.legacy.decimalLeastCommonType.retainFractionDigitsFirst")


TBH the conf is a bit long.
How about spark.sql.legacy.decimal.retainFractionDigitsOnTruncate. But the naming really depends on the scrope of the behavior change as I asked in https://github.com/apache/spark/pull/43781/files#r1393331222

docs/sql-ref-ansi-compliance.md

viirya · 2023-11-14T22:24:23Z

docs/sql-ref-ansi-compliance.md

+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |
+
+The truncation rule is also different for arithmetic operations: they retain at least 6 digits in the fractional part, which means we can only reduce `scale` to 6.


Should we mention what happens if we cannot truncate fractional part to make it fit into maximum precision? Overflow?

viirya · 2023-11-14T22:26:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("When set to true, we will try to retain the fraction digits first rather than " +
+        "integral digits, when getting a least common type between decimal types, and the " +
+        "result decimal precision exceeds the max precision.")


Suggested change

.doc("When set to true, we will try to retain the fraction digits first rather than " +

"integral digits, when getting a least common type between decimal types, and the " +

"result decimal precision exceeds the max precision.")

.doc("When set to true, we will try to retain the fraction digits first rather than " +

"integral digits as prior Spark 4.0, when getting a least common type between decimal types, and the " +

"result decimal precision exceeds the max precision.")

viirya

I think this looks reasonable. Is there reproducer that can be added as unit test to show the issue in e2e example? Like any operator mentioned in the description?

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun

+1, LGTM (only one minor comment about config version).

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

…onf.scala

cloud-fan · 2023-11-15T06:32:12Z

Is there reproducer that can be added as unit test to show the issue in e2e example?

I think the updated tests show the problem.

cloud-fan · 2023-11-15T12:38:45Z

The failure is unrelated

Extension error:
Could not import extension sphinx_copybutton (exception: No module named 'sphinx_copybutton')
make: *** [Makefile:35: html] Error 2

I'm merging it to master, thanks for the reviews!

github-actions bot added the SQL label Nov 13, 2023

least common type between decimal types should retain integral digits…

70affef

… first

cloud-fan force-pushed the decimal branch from ec9bdb6 to 70affef Compare November 13, 2023 09:30

github-actions bot added the DOCS label Nov 13, 2023

cloud-fan commented Nov 13, 2023

View reviewed changes

docs/sql-ref-ansi-compliance.md Outdated Show resolved Hide resolved

Update docs/sql-ref-ansi-compliance.md

41e0a3c

cloud-fan commented Nov 13, 2023

View reviewed changes

docs/sql-ref-ansi-compliance.md Outdated Show resolved Hide resolved

Update docs/sql-ref-ansi-compliance.md

cf5cb7b

dongjoon-hyun reviewed Nov 13, 2023

View reviewed changes

ryan-johnson-databricks reviewed Nov 13, 2023

View reviewed changes

beliefer reviewed Nov 14, 2023

View reviewed changes

fix

d61da82

fix golden

5426242

wangyum approved these changes Nov 14, 2023

View reviewed changes

dongjoon-hyun reviewed Nov 14, 2023

View reviewed changes

gengliangwang reviewed Nov 14, 2023

View reviewed changes

viirya reviewed Nov 14, 2023

View reviewed changes

docs/sql-ref-ansi-compliance.md Outdated Show resolved Hide resolved

viirya reviewed Nov 14, 2023

View reviewed changes

docs/sql-ref-ansi-compliance.md Outdated Show resolved Hide resolved

viirya reviewed Nov 14, 2023

View reviewed changes

HyukjinKwon approved these changes Nov 15, 2023

View reviewed changes

cloud-fan and others added 3 commits November 15, 2023 11:00

fix

f75fe5d

Update docs/sql-ref-ansi-compliance.md

c6f4295

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

Update docs/sql-ref-ansi-compliance.md

0400e8c

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya approved these changes Nov 15, 2023

View reviewed changes

dongjoon-hyun reviewed Nov 15, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

dongjoon-hyun approved these changes Nov 15, 2023

View reviewed changes

cloud-fan commented Nov 15, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

Update sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLC…

42477eb

…onf.scala

beliefer approved these changes Nov 15, 2023

View reviewed changes

cloud-fan closed this in 7120e6b Nov 15, 2023

yaooqinn mentioned this pull request Apr 9, 2024

[SPARK-47781][SQL] Handle negative scale decimals for JDBC data sources #45956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45905][SQL] Least common type between decimal types should retain integral digits first #43781

[SPARK-45905][SQL] Least common type between decimal types should retain integral digits first #43781

cloud-fan commented Nov 13, 2023

dongjoon-hyun left a comment

ryan-johnson-databricks Nov 13, 2023 •

edited

Loading

cloud-fan Nov 14, 2023

beliefer Nov 14, 2023

cloud-fan Nov 14, 2023

beliefer Nov 14, 2023

cloud-fan Nov 14, 2023

beliefer Nov 14, 2023

cloud-fan commented Nov 14, 2023

wangyum left a comment

dongjoon-hyun left a comment

gengliangwang Nov 14, 2023 •

edited

Loading

cloud-fan Nov 15, 2023

gengliangwang Nov 14, 2023

viirya Nov 14, 2023

viirya Nov 14, 2023

viirya left a comment

dongjoon-hyun left a comment

cloud-fan commented Nov 15, 2023

cloud-fan commented Nov 15, 2023 •

edited

Loading

[SPARK-45905][SQL] Least common type between decimal types should retain integral digits first #43781

[SPARK-45905][SQL] Least common type between decimal types should retain integral digits first #43781

Conversation

cloud-fan commented Nov 13, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

ryan-johnson-databricks Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 14, 2023

wangyum left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

gengliangwang Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cloud-fan commented Nov 15, 2023

cloud-fan commented Nov 15, 2023 • edited Loading

ryan-johnson-databricks Nov 13, 2023 •

edited

Loading

gengliangwang Nov 14, 2023 •

edited

Loading

cloud-fan commented Nov 15, 2023 •

edited

Loading