[SPARK-25454][SQL] Avoid precision loss in division with decimal with negative scale #22450

mgaido91 · 2018-09-18T13:53:41Z

What changes were proposed in this pull request?

Our rules for determine decimal precision and scale are reflecting Hive and SQLServer's. The problem is that in Spark we allow negative scale, whereas in those other systems this is not possible. So the rule we have for division doesn't take in account the case when the scale is negative.

The PR makes our rule compatible with decimals having negative scale too.

How was this patch tested?

added UTs

… negative scale

SparkQA · 2018-09-18T17:00:23Z

Test build #96180 has finished for PR 22450 at commit 7c4b454.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-19T06:45:47Z

retest this please

mgaido91 · 2018-09-19T06:45:56Z

cc @cloud-fan @dongjoon-hyun

SparkQA · 2018-09-19T07:05:02Z

Test build #96226 has finished for PR 22450 at commit 7c4b454.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-19T07:06:35Z

retest this please

maropu · 2018-09-19T07:07:00Z

...core/src/test/resources/sql-tests/inputs/typeCoercion/native/decimalArithmeticOperations.sql

@@ -83,4 +83,7 @@ select 12345678912345678912345678912.1234567 + 9999999999999999999999999999999.1
 select 123456789123456789.1234567890 * 1.123456789123456789;
 select 12345678912345.123456789123 / 0.000000012345678;

+-- division with negative scale operands
+select 26393499451/ 1000e6;


super nit... add space 26393499451 /

btw, what't the result of this query in v2.3?

sure thanks. In 2.3 this truncates the output by some digits, ie. it would return 26.393499

oh... thanks.

cloud-fan · 2018-09-19T10:43:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala

@@ -129,16 +129,17 @@ object DecimalPrecision extends TypeCoercionRule {
        resultType)

    case Divide(e1 @ DecimalType.Expression(p1, s1), e2 @ DecimalType.Expression(p2, s2)) =>
+      val adjP2 = if (s2 < 0) p2 - s2 else p2


This rule was added long time ago, do you mean this is a long-standing bug?

Yes, I think this is more clear in the related JIRA description and comments. The problem is that here we have never handled properly decimals with negative scale. The point is: before 2.3, this could happen only if someone was creating some specific literal from a BigDecimal, like lit(BigDecimal(100e6)); since 2.3, this can happen with every constant like 100e6 in the SQL code. So the problem has been there for a while, but we haven't seen it because it was less likely to happen.

Another solution would be avoiding having decimals with a negative scale. But this is quite a breaking change, so I'd avoid until a 3.0 release at least.

ah i see. Can we add a test in DataFrameSuite with decimal literal?

can we update the document of this rule to reflect this change?

sure, but if you agree I'll try and find a better place than DataFrameSuite. I'd prefer adding the new tests to ArithmeticExpressionSuite. Is that ok for you?

SparkQA · 2018-09-19T11:12:14Z

Test build #96229 has finished for PR 22450 at commit 7c4b454.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-19T11:25:07Z

Test build #96237 has finished for PR 22450 at commit 520b64e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-19T13:45:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala

@@ -40,10 +40,13 @@ import org.apache.spark.sql.types._
 *   e1 + e2      max(s1, s2) + max(p1-s1, p2-s2) + 1     max(s1, s2)
 *   e1 - e2      max(s1, s2) + max(p1-s1, p2-s2) + 1     max(s1, s2)
 *   e1 * e2      p1 + p2 + 1                             s1 + s2
- *   e1 / e2      p1 - s1 + s2 + max(6, s1 + p2 + 1)      max(6, s1 + p2 + 1)
+ *   e1 / e2      max(p1-s1+s2, 0) + max(6, s1+adjP2+1)   max(6, s1+adjP2+1)


This is very critical. Is there any other database using this formula?

I don't think as the other DBs I know the formula of are Hive and MS SQL which don't allow negative scales so they don't have this problem. The formula is not changed from before actually, it just handles a negative scale.

cloud-fan · 2018-09-19T16:26:02Z

I feel there are more places we need to fix for negative scale. I couldn't find any design doc for negative scale in Spark and I believe we supported it by accident.

That said, fixing division is just fixing the specific case the user reported. This is not ideal. We should either officially support negative scale and fix all the cases, or officially forbid negative scale. However, neither of them can be made into a bug fix for branch 2.3 and 2.4.

Instead, I'm proposing a different fix: un-officially forbids negative scale. Users can still create a decimal value with negative scale, but Spark itself should avoid generating such values. See #22470

mgaido91 · 2018-09-19T19:57:48Z

@cloud-fan I checked the other operations and they have no issue with negative scale. This is the reason why this fix is only for Divide: it is the only operation which wasn't dealing it properly.

I also thought about doing that but I chose not to do in order not to introduce regressions. Anyway I'll argument more in your PR. Thanks.

SparkQA · 2018-09-19T20:08:24Z

Test build #96254 has finished for PR 22450 at commit 27a9ea6.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-09-19T20:10:47Z

retest this please

SparkQA · 2018-09-20T00:47:11Z

Test build #96289 has finished for PR 22450 at commit 27a9ea6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-20T01:54:44Z

My major concern is:

how to prove Divide is the only one having problems of negative scale?
how to prove the fix here is corrected and covers all the corner cases?

I'm reconsidering this proposal

Do you think it makes sense for spark.sql.decimalOperations.allowPrecisionLoss to also toggle how literal promotion happens (the old way vs. the new way)?

mgaido91 · 2018-09-20T07:47:22Z

Let me answer to all your 3 points:

how to prove Divide is the only one having problems of negative scale?

You can check the other operations. Let's go though all of them:

Add and Subtract: in add you never use p1, p2 alone, but you do p1-s1 and p2-s2, so having a negative scale is the same as having scale 0 and corresponding precision;
Multiply: this is the easiest to handle. A negative scale can just cause the scale of the result to be negative too. But that's right, in multiply scale just moves the result to the right/left and this is what a negative scale does too. There are no issues with it.
Remainder/Pmod: same reason of Add and Subtract.

how to prove the fix here is corrected and covers all the corner cases?

We can add more and more test cases, but let's check the logic of the divide rule. In the precision we have 2 parts: the digits before the comma - ie. intDigits- (p1-s1+s2) and the digits after - ie. the scale - (max(6, s1 + p2 + 1)) which are summed and then we have the scale itself. The intDigits are fine for the 1st operand as there is a p1-s1 which natively handles the negative scale, but if the second operand has a negative scale, the number of intDigits can become negative. This is not something we can allow, as we would end up with a precision lower than the scale, which we don't support. Hence, we need to set the intDigits to 0 if they become negative. As far as the scale is regarded, we already have a guard against a negative scale, as we get the max of it and 6. But we have another problem, ie. we are using p2 alone. So in case s2 is negative, in order to get the same p2 we would have avoiding negative scales, we need to adjust it to p2 - s2. No other cases are present.

Do you think it makes sense for spark.sql.decimalOperations.allowPrecisionLoss to also toggle how literal promotion happens (the old way vs. the new way)?

I think what we can do is forbidding negative scale when handling it always, regardless of the value of that flag. I think this can safely be done, as negative scales in this case were not supported before 2.3. But anyway, this would just reduce the number of cases when this can happen... So I think we can do that, but it is not a definitive solution. If you want, I can do this in scope of this PR or I can create a new one or we can do this in the PR you just closed.

dilipbiswal · 2018-09-20T08:01:42Z

@mgaido91
Again this may be something to think about in 3.0 timeframe. I just checked two databases, presto and db2. Both of them treat literals such as 1e26 as double.

db2

db2 => describe select 1e26 from cast

 Column Information

 Number of columns: 1

 SQL type              Type length  Column name                     Name length
 --------------------  -----------  ------------------------------  -----------
 480   DOUBLE                    8  1                                         1

db2 => describe select 1.23 from cast

 Column Information

 Number of columns: 1

 SQL type              Type length  Column name                     Name length
 --------------------  -----------  ------------------------------  -----------
 484   DECIMAL                3, 2  1                                         1

presto

presto:default> explain select 2.34E10;
                                    Query Plan                                    
----------------------------------------------------------------------------------
 - Output[_col0] => [expr:double]                                                 
         Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}

Should spark do the same ? What would be the repercussions if we did that ?

mgaido91 · 2018-09-20T08:29:00Z

yes @dilipbiswal , Hive does the same.

cloud-fan · 2018-09-20T13:27:15Z

@mgaido91 well, your long explanation makes me think we should have a design doc about it, and have more people to review it and ensure we cover all the corner cases. And seems there are more problems like the data type of literals(e.g. 1e3).

I'd suggest we pick a safer approach: allow users to fully turn off #20023 , which is done by #22494

mgaido91 · 2018-09-20T14:46:31Z

seems there are more problems like the data type of literals

sorry, I haven't got what you mean here, may you please explain me?

your long explanation makes me think we should have a design doc about it

I'll prepare a design doc and attach it to the JIRA asap then.

cloud-fan · 2018-09-20T14:58:36Z

@dilipbiswal showed that DB2 and presto treat 1e100 as double instead of decimal. We should consider this option and see what's the consequence of it.

mgaido91 · 2018-09-20T15:19:18Z

oh I see now what you mean, thanks. Yes, Hive does the same. We may have to revisit completely our parsing of literals but since it is a breaking change I am not sure it will be possible before 3.0. And if the next release is going to be 2.5, probably this means that we have to wait before doing anything like that.

cloud-fan · 2018-09-20T15:38:19Z

given how complex it is, I feel we can start the design at 2.5 and implement it at 3.0, what do you think?

mgaido91 · 2018-09-20T15:46:41Z

I think there are 2 separate topics here:

Handling negative scale in decimal operations
I am writing the design doc and I'll update this PR if needed (anyway I'll add more test cases). I think we can target this work for 2.5 too.
Handling/parsing of literals and numbers in general
This is way more complex I think and it involves more places in the code-base and more possible breaking changes. On this I 100% agree that we should start designing it ASAP and target the implementation for 3.0.

Do you agree on the above distinction?

Thanks for your time and help here anyway

cloud-fan · 2018-09-20T15:56:28Z

totally agree, thanks for looking into it!

mgaido91 · 2018-09-21T07:26:19Z

Thank you for your help and guidance.

SparkQA · 2018-09-21T13:28:48Z

Test build #96417 has finished for PR 22450 at commit 4e240d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-11-05T09:56:43Z

@cloud-fan I have seen no more comments on the design doc for a while. I also wrote an email about one week ago to the dev list in order to check if there were further comments but I have seen none. As I don't see any concern on this, shall we go ahead with this PR? Thanks.

mgaido91 · 2018-11-10T11:57:26Z

kindly ping @cloud-fan , thanks

mgaido91 · 2018-12-04T10:14:24Z

@cloud-fan this has been stuck for a while now. Is there something blocking this? Is there something I can do? Thanks.

mgaido91 · 2018-12-18T14:24:06Z

cc also @viirya who commented on the design doc

cloud-fan · 2018-12-18T14:59:01Z

It's Spark 3.0 now, shall we consider forbidding negative scale? This is an undocumented and broken feature, and many databases don't support it either. I checked that parquet doesn't support negative scale, which means Spark can't support it well as Spark can't write it to data source like parquet.

cc @dilipbiswal

mgaido91 · 2018-12-18T15:08:25Z

@cloud-fan yes, it sounds reasonable. I can start working on it if you agree. My only concern is that we won't be able to support operations/workloads which were running in 2.x. So we are introducing a limitation. Shall we send an email to the dev list in order to check and decide with a broader audience how to go ahead?

cloud-fan · 2018-12-18T15:19:57Z

sure, thanks!

mgaido91 · 2019-01-02T20:56:48Z

cc @rxin too, who answered the email in the dev list. Since I saw only his comment in favor of this approach and none against, shall we go ahead with this?

mgaido91 · 2019-01-07T20:12:11Z

@cloud-fan I added some comments according to what you suggested on the mailing list. Please let me know if you were thinking to something different. Thanks.

SparkQA · 2019-01-07T22:15:37Z

Test build #100903 has finished for PR 22450 at commit dd19f7f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-09T01:12:35Z

Test build #100945 has finished for PR 22450 at commit 97b9c56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-01-14T09:30:07Z

any comments on this @cloud-fan @rxin ? I think the approach here reflects what was agreed on in the mailing list and I think I addressed all the comments made there. May you please take a look at this again? Thanks.

mgaido91 · 2019-03-08T10:17:43Z

@cloud-fan @maropu @rxin sorry for pinging you again. This has been stale for a while now. I think in the mailing list it was agreed to go on this way, but in case we want to do something different we can discuss and I can eventually close this PR. What do you think? Thanks.

cloud-fan · 2019-03-14T13:05:34Z

Sorry for the late reply, as I've struggled with it for a long time.

AFAIK there are 2 ways to define a decimal type:

The Java way. precision is the number of digits in the unscaledValue, and scale decides how to convert the unscaledValue to the actual decimal value, according to unscaledValue * 10^-scale.

This means,
123.45 = 12345 * 10^-2, so precision is 5, scale is 2.
0.00123 = 123 * 10^-5, so precision is 3, scale is 5.
12300 = 123 * 10^2, so precision is 3, scale is -2.

The SQL way(at least some old version of SQL standard). precision is the number of digits in the decimal value, scale is the number of digits after dot in the decimal value.

This means,
123.45 has 5 digits in total, 2 digits after dot, so precision is 5, scale is 2.
0.00123 has 5 digits in total(ignore the integral part), and 5 digits after dot, so precision is 5, scale is 5.
12300 has 5 digits in total, no digits after dot, so precision is 5, scale is 0.

AFAIK many database follow the SQL way to define the decimal type, i.e. 0 <= scale <= precision, although the java way is more flexible. If Spark does not want to follow the SQL way to define decimal type, I think we should follow the java way, instead of some middle states betweek SQL and Java.

We should also clearly list the tradeoffs of different options.

As an example, if want to keep supporting negative scale:

we would want to support scale > precision as well, to be consistent with the Java way to define decimal type.
need to fix some corner cases of precision loss(what this PR is trying to fix)
bad compatibility with data sources. (sql("select 1e10 as a").write.parquet("/tmp/tt") would fail)
may have unknown pitfalls, as it's not widely supported by other databases.
fully backward compatible

mgaido91 · 2019-03-15T07:21:25Z

thanks for your comment @cloud-fan. I am not sure what you are proposing here.

AFAIK many database follow the SQL way to define the decimal type, i.e. 0 <= scale <= precision

I'd argue that this is not true. Many SQL DBs do not have this rule, despite some indeed have it (eg. SQLServer).

I think I tried to state the tradeoffs of different choices in the mailing list, ie.:

If we ensure that the scale must be positive, we have a backward compatibility issue, since there may be operations which we are not able to support anymore (not just producing different results, but we are not able to represent the result at all causing a failure);
If we leave things as they are now, ie. we allow negative loss, let me comment your 5 points.

we would want to support scale > precision as well, to be consistent with the Java way to define decimal type.

Not sure this is a good idea in terms of compatibility with data sources and since this is a corner case which doesn't exist now, I think introducing it may lead us to potential problems in that sense and we may be unable to remove the support for backward compatibility reasons. We may try and do that under a config flag though.

need to fix some corner cases of precision loss(what this PR is trying to fix)

bad compatibility with data sources. (sql("select 1e10 as a").write.parquet("/tmp/tt") would fail)

Yes, this is indeed a problem, but it is a problem which is already present, so I think we can consider this PR independent from this issue, as it changes nothing wrt it.

may have unknown pitfalls, as it's not widely supported by other databases.

Not sure what you mean here.

fully backward compatible

This PR is so as there is no change when the scale is non-negative, and when it is it fixes the computation of the precision of the result of division, so the only change if the precision of the result of division. We can also introduce a config to stay on the safe side, but this is basically a fix for a situation which was handled in a bad way before, so I see no reason to turn off this behavior...

cloud-fan · 2019-03-15T08:10:55Z

Many SQL DBs do not have this rule

Can you list some of them? I checked SQL Server, PostgreSQL, MySQL, Presto, Hive, none of them allow negative scale.

The thing I want to avoid is, fixing a specific issue without looking at the big picture. The big picture is, how Spark should define decimal type?

There are two directions we can go:

forbid negative scale to follow other databases. If we want to do this, we should find out the cases that will be broken, and put them in the release notes.
allow negative scale, but also allow scale > precision. I would reject a proposal that only allows negative scale, as its definition will be hard to generalize.

Let's recap the 2 definitions of decimal type:

decimal = unscaledValue * 10^-scale where precision is the number of digits of the unscaledValue.
precision is the number of total digits, scale is the number of digits of the fraction part.

If we only allow negative scale, it fits neither of the two definitions.

That said, I'm OK with either of the directions, but not something in the middle. Personally I'd prefer the first direction, as that's how other mainstream databases do.

dilipbiswal · 2019-03-15T08:13:52Z

Personally I'd prefer the first direction, as that's how other mainstream databases do.

+1
Just a fyi - DB2 also does not allow -ve scale.

mgaido91 · 2019-03-15T11:37:01Z

Can you list some of them?

Oracle for instance allows negative scales.

we should find out the cases that will be broken, and put them in the release notes.

Well, the cases are obvious, ie. when there are negative scales which define numbers which don't fit in (38, 0), surely we cannot handle them anymore; moreover we need to convert to non-negative all the cases when the users provides a scala/java decimal with negative scale and we may cause some precision losses in cases when a negative scale and a non-negative one are the inputs for an arithmetic operation.

Personally I'd prefer the first direction, as that's how other mainstream databases do.

I agree with you in general, but then when I think to the backward compatibility issues I am doubtful about it. And in the discussion in the mailing list I think @rxin preferred to avoid these breaking changes too.

I would reject a proposal that only allows negative scale, as its definition will be hard to generalize.

Just to be clear, here I am not proposing to allow negative scales, I am just proposing to handle them properly in the operations since they happen to be present.

mgaido91 · 2019-03-28T10:11:59Z

Since seems there is no agreement on this PR, I am closing it. We can reopen it if we change our mind. Thanks everybody for the reviews and the discussion!

[SPARK-25454][SQL] Avoid precision loss in division with decimal with…

7c4b454

… negative scale

maropu reviewed Sep 19, 2018

View reviewed changes

address comment

520b64e

cloud-fan reviewed Sep 19, 2018

View reviewed changes

address comments

27a9ea6

cloud-fan reviewed Sep 19, 2018

View reviewed changes

cloud-fan mentioned this pull request Sep 19, 2018

[SPARK-25454][SQL] should not generate negative scale as possible as we can #22470

Closed

mgaido91 mentioned this pull request Sep 20, 2018

[SPARK-25454][SQL] add a new config for picking minimum precision for integral literals #22494

Closed

add tests

4e240d9

add comments

dd19f7f

mgaido91 added 2 commits January 7, 2019 23:24

Merge branch 'master' of github.com:apache/spark into SPARK-25454

b01cbc3

fix test failures

97b9c56

mgaido91 mentioned this pull request Jan 18, 2019

[SPARK-26664][SQL] Make DecimalType's minimum adjusted scale configurable #23587

Closed

mgaido91 closed this Mar 28, 2019

[SPARK-25454][SQL] Avoid precision loss in division with decimal with negative scale #22450

[SPARK-25454][SQL] Avoid precision loss in division with decimal with negative scale #22450

Conversation

mgaido91 commented Sep 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 18, 2018

mgaido91 commented Sep 19, 2018

mgaido91 commented Sep 19, 2018

SparkQA commented Sep 19, 2018

mgaido91 commented Sep 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 19, 2018

SparkQA commented Sep 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 19, 2018 • edited Loading

mgaido91 commented Sep 19, 2018

SparkQA commented Sep 19, 2018

mgaido91 commented Sep 19, 2018

SparkQA commented Sep 20, 2018

cloud-fan commented Sep 20, 2018

mgaido91 commented Sep 20, 2018

dilipbiswal commented Sep 20, 2018 • edited Loading

db2

presto

mgaido91 commented Sep 20, 2018

cloud-fan commented Sep 20, 2018 • edited Loading

mgaido91 commented Sep 20, 2018

cloud-fan commented Sep 20, 2018

mgaido91 commented Sep 20, 2018

cloud-fan commented Sep 20, 2018

mgaido91 commented Sep 20, 2018

cloud-fan commented Sep 20, 2018

mgaido91 commented Sep 21, 2018

SparkQA commented Sep 21, 2018

mgaido91 commented Nov 5, 2018

mgaido91 commented Nov 10, 2018

mgaido91 commented Dec 4, 2018

mgaido91 commented Dec 18, 2018

cloud-fan commented Dec 18, 2018

mgaido91 commented Dec 18, 2018

cloud-fan commented Dec 18, 2018

mgaido91 commented Jan 2, 2019

mgaido91 commented Jan 7, 2019

SparkQA commented Jan 7, 2019

SparkQA commented Jan 9, 2019

mgaido91 commented Jan 14, 2019

mgaido91 commented Mar 8, 2019 • edited Loading

cloud-fan commented Mar 14, 2019 • edited Loading

mgaido91 commented Mar 15, 2019

cloud-fan commented Mar 15, 2019

dilipbiswal commented Mar 15, 2019 • edited Loading

mgaido91 commented Mar 15, 2019

mgaido91 commented Mar 28, 2019

cloud-fan commented Sep 19, 2018 •

edited

Loading

dilipbiswal commented Sep 20, 2018 •

edited

Loading

cloud-fan commented Sep 20, 2018 •

edited

Loading

mgaido91 commented Mar 8, 2019 •

edited

Loading

cloud-fan commented Mar 14, 2019 •

edited

Loading

dilipbiswal commented Mar 15, 2019 •

edited

Loading