Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25454][SQL] Avoid precision loss in division with decimal with negative scale #22450

Closed
wants to merge 7 commits into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

Our rules for determine decimal precision and scale are reflecting Hive and SQLServer's. The problem is that in Spark we allow negative scale, whereas in those other systems this is not possible. So the rule we have for division doesn't take in account the case when the scale is negative.

The PR makes our rule compatible with decimals having negative scale too.

How was this patch tested?

added UTs

@SparkQA
Copy link

SparkQA commented Sep 18, 2018

Test build #96180 has finished for PR 22450 at commit 7c4b454.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96226 has finished for PR 22450 at commit 7c4b454.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@@ -83,4 +83,7 @@ select 12345678912345678912345678912.1234567 + 9999999999999999999999999999999.1
select 123456789123456789.1234567890 * 1.123456789123456789;
select 12345678912345.123456789123 / 0.000000012345678;

-- division with negative scale operands
select 26393499451/ 1000e6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit... add space 26393499451 /

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, what't the result of this query in v2.3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thanks. In 2.3 this truncates the output by some digits, ie. it would return 26.393499

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... thanks.

@@ -129,16 +129,17 @@ object DecimalPrecision extends TypeCoercionRule {
resultType)

case Divide(e1 @ DecimalType.Expression(p1, s1), e2 @ DecimalType.Expression(p2, s2)) =>
val adjP2 = if (s2 < 0) p2 - s2 else p2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule was added long time ago, do you mean this is a long-standing bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is more clear in the related JIRA description and comments. The problem is that here we have never handled properly decimals with negative scale. The point is: before 2.3, this could happen only if someone was creating some specific literal from a BigDecimal, like lit(BigDecimal(100e6)); since 2.3, this can happen with every constant like 100e6 in the SQL code. So the problem has been there for a while, but we haven't seen it because it was less likely to happen.

Another solution would be avoiding having decimals with a negative scale. But this is quite a breaking change, so I'd avoid until a 3.0 release at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see. Can we add a test in DataFrameSuite with decimal literal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the document of this rule to reflect this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but if you agree I'll try and find a better place than DataFrameSuite. I'd prefer adding the new tests to ArithmeticExpressionSuite. Is that ok for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96229 has finished for PR 22450 at commit 7c4b454.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96237 has finished for PR 22450 at commit 520b64e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -40,10 +40,13 @@ import org.apache.spark.sql.types._
* e1 + e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
* e1 - e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2)
* e1 * e2 p1 + p2 + 1 s1 + s2
* e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1)
* e1 / e2 max(p1-s1+s2, 0) + max(6, s1+adjP2+1) max(6, s1+adjP2+1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very critical. Is there any other database using this formula?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think as the other DBs I know the formula of are Hive and MS SQL which don't allow negative scales so they don't have this problem. The formula is not changed from before actually, it just handles a negative scale.

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 19, 2018

I feel there are more places we need to fix for negative scale. I couldn't find any design doc for negative scale in Spark and I believe we supported it by accident.

That said, fixing division is just fixing the specific case the user reported. This is not ideal. We should either officially support negative scale and fix all the cases, or officially forbid negative scale. However, neither of them can be made into a bug fix for branch 2.3 and 2.4.

Instead, I'm proposing a different fix: un-officially forbids negative scale. Users can still create a decimal value with negative scale, but Spark itself should avoid generating such values. See #22470

@mgaido91
Copy link
Contributor Author

@cloud-fan I checked the other operations and they have no issue with negative scale. This is the reason why this fix is only for Divide: it is the only operation which wasn't dealing it properly.

I also thought about doing that but I chose not to do in order not to introduce regressions. Anyway I'll argument more in your PR. Thanks.

@SparkQA
Copy link

SparkQA commented Sep 19, 2018

Test build #96254 has finished for PR 22450 at commit 27a9ea6.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 20, 2018

Test build #96289 has finished for PR 22450 at commit 27a9ea6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

My major concern is:

  1. how to prove Divide is the only one having problems of negative scale?
  2. how to prove the fix here is corrected and covers all the corner cases?

I'm reconsidering this proposal

Do you think it makes sense for spark.sql.decimalOperations.allowPrecisionLoss to also toggle how literal promotion happens (the old way vs. the new way)?

@mgaido91
Copy link
Contributor Author

Let me answer to all your 3 points:

how to prove Divide is the only one having problems of negative scale?

You can check the other operations. Let's go though all of them:

  • Add and Subtract: in add you never use p1, p2 alone, but you do p1-s1 and p2-s2, so having a negative scale is the same as having scale 0 and corresponding precision;
  • Multiply: this is the easiest to handle. A negative scale can just cause the scale of the result to be negative too. But that's right, in multiply scale just moves the result to the right/left and this is what a negative scale does too. There are no issues with it.
  • Remainder/Pmod: same reason of Add and Subtract.

how to prove the fix here is corrected and covers all the corner cases?

We can add more and more test cases, but let's check the logic of the divide rule. In the precision we have 2 parts: the digits before the comma - ie. intDigits- (p1-s1+s2) and the digits after - ie. the scale - (max(6, s1 + p2 + 1)) which are summed and then we have the scale itself. The intDigits are fine for the 1st operand as there is a p1-s1 which natively handles the negative scale, but if the second operand has a negative scale, the number of intDigits can become negative. This is not something we can allow, as we would end up with a precision lower than the scale, which we don't support. Hence, we need to set the intDigits to 0 if they become negative. As far as the scale is regarded, we already have a guard against a negative scale, as we get the max of it and 6. But we have another problem, ie. we are using p2 alone. So in case s2 is negative, in order to get the same p2 we would have avoiding negative scales, we need to adjust it to p2 - s2. No other cases are present.

Do you think it makes sense for spark.sql.decimalOperations.allowPrecisionLoss to also toggle how literal promotion happens (the old way vs. the new way)?

I think what we can do is forbidding negative scale when handling it always, regardless of the value of that flag. I think this can safely be done, as negative scales in this case were not supported before 2.3. But anyway, this would just reduce the number of cases when this can happen... So I think we can do that, but it is not a definitive solution. If you want, I can do this in scope of this PR or I can create a new one or we can do this in the PR you just closed.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Sep 20, 2018

@mgaido91
Again this may be something to think about in 3.0 timeframe. I just checked two databases, presto and db2. Both of them treat literals such as 1e26 as double.

db2

db2 => describe select 1e26 from cast

 Column Information

 Number of columns: 1

 SQL type              Type length  Column name                     Name length
 --------------------  -----------  ------------------------------  -----------
 480   DOUBLE                    8  1                                         1

db2 => describe select 1.23 from cast

 Column Information

 Number of columns: 1

 SQL type              Type length  Column name                     Name length
 --------------------  -----------  ------------------------------  -----------
 484   DECIMAL                3, 2  1                                         1

presto

presto:default> explain select 2.34E10;
                                    Query Plan                                    
----------------------------------------------------------------------------------
 - Output[_col0] => [expr:double]                                                 
         Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00}  

Should spark do the same ? What would be the repercussions if we did that ?

@mgaido91
Copy link
Contributor Author

yes @dilipbiswal , Hive does the same.

@cloud-fan
Copy link
Contributor

cloud-fan commented Sep 20, 2018

@mgaido91 well, your long explanation makes me think we should have a design doc about it, and have more people to review it and ensure we cover all the corner cases. And seems there are more problems like the data type of literals(e.g. 1e3).

I'd suggest we pick a safer approach: allow users to fully turn off #20023 , which is done by #22494

@mgaido91
Copy link
Contributor Author

seems there are more problems like the data type of literals

sorry, I haven't got what you mean here, may you please explain me?

your long explanation makes me think we should have a design doc about it

I'll prepare a design doc and attach it to the JIRA asap then.

@cloud-fan
Copy link
Contributor

@dilipbiswal showed that DB2 and presto treat 1e100 as double instead of decimal. We should consider this option and see what's the consequence of it.

@mgaido91
Copy link
Contributor Author

oh I see now what you mean, thanks. Yes, Hive does the same. We may have to revisit completely our parsing of literals but since it is a breaking change I am not sure it will be possible before 3.0. And if the next release is going to be 2.5, probably this means that we have to wait before doing anything like that.

@cloud-fan
Copy link
Contributor

given how complex it is, I feel we can start the design at 2.5 and implement it at 3.0, what do you think?

@mgaido91
Copy link
Contributor Author

I think there are 2 separate topics here:

  • Handling negative scale in decimal operations
    I am writing the design doc and I'll update this PR if needed (anyway I'll add more test cases). I think we can target this work for 2.5 too.
  • Handling/parsing of literals and numbers in general
    This is way more complex I think and it involves more places in the code-base and more possible breaking changes. On this I 100% agree that we should start designing it ASAP and target the implementation for 3.0.

Do you agree on the above distinction?

Thanks for your time and help here anyway

@cloud-fan
Copy link
Contributor

totally agree, thanks for looking into it!

@mgaido91
Copy link
Contributor Author

Thank you for your help and guidance.

@SparkQA
Copy link

SparkQA commented Sep 21, 2018

Test build #96417 has finished for PR 22450 at commit 4e240d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 5, 2018

@cloud-fan I have seen no more comments on the design doc for a while. I also wrote an email about one week ago to the dev list in order to check if there were further comments but I have seen none. As I don't see any concern on this, shall we go ahead with this PR? Thanks.

@mgaido91
Copy link
Contributor Author

kindly ping @cloud-fan , thanks

@mgaido91
Copy link
Contributor Author

mgaido91 commented Dec 4, 2018

@cloud-fan this has been stuck for a while now. Is there something blocking this? Is there something I can do? Thanks.

@mgaido91
Copy link
Contributor Author

cc also @viirya who commented on the design doc

@cloud-fan
Copy link
Contributor

It's Spark 3.0 now, shall we consider forbidding negative scale? This is an undocumented and broken feature, and many databases don't support it either. I checked that parquet doesn't support negative scale, which means Spark can't support it well as Spark can't write it to data source like parquet.

cc @dilipbiswal

@mgaido91
Copy link
Contributor Author

@cloud-fan yes, it sounds reasonable. I can start working on it if you agree. My only concern is that we won't be able to support operations/workloads which were running in 2.x. So we are introducing a limitation. Shall we send an email to the dev list in order to check and decide with a broader audience how to go ahead?

@cloud-fan
Copy link
Contributor

sure, thanks!

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jan 2, 2019

cc @rxin too, who answered the email in the dev list. Since I saw only his comment in favor of this approach and none against, shall we go ahead with this?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jan 7, 2019

@cloud-fan I added some comments according to what you suggested on the mailing list. Please let me know if you were thinking to something different. Thanks.

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100903 has finished for PR 22450 at commit dd19f7f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2019

Test build #100945 has finished for PR 22450 at commit 97b9c56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

any comments on this @cloud-fan @rxin ? I think the approach here reflects what was agreed on in the mailing list and I think I addressed all the comments made there. May you please take a look at this again? Thanks.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Mar 8, 2019

@cloud-fan @maropu @rxin sorry for pinging you again. This has been stale for a while now. I think in the mailing list it was agreed to go on this way, but in case we want to do something different we can discuss and I can eventually close this PR. What do you think? Thanks.

@cloud-fan
Copy link
Contributor

cloud-fan commented Mar 14, 2019

Sorry for the late reply, as I've struggled with it for a long time.

AFAIK there are 2 ways to define a decimal type:

  1. The Java way. precision is the number of digits in the unscaledValue, and scale decides how to convert the unscaledValue to the actual decimal value, according to unscaledValue * 10^-scale.

This means,
123.45 = 12345 * 10^-2, so precision is 5, scale is 2.
0.00123 = 123 * 10^-5, so precision is 3, scale is 5.
12300 = 123 * 10^2, so precision is 3, scale is -2.

  1. The SQL way(at least some old version of SQL standard). precision is the number of digits in the decimal value, scale is the number of digits after dot in the decimal value.

This means,
123.45 has 5 digits in total, 2 digits after dot, so precision is 5, scale is 2.
0.00123 has 5 digits in total(ignore the integral part), and 5 digits after dot, so precision is 5, scale is 5.
12300 has 5 digits in total, no digits after dot, so precision is 5, scale is 0.

AFAIK many database follow the SQL way to define the decimal type, i.e. 0 <= scale <= precision, although the java way is more flexible. If Spark does not want to follow the SQL way to define decimal type, I think we should follow the java way, instead of some middle states betweek SQL and Java.

We should also clearly list the tradeoffs of different options.

As an example, if want to keep supporting negative scale:

  1. we would want to support scale > precision as well, to be consistent with the Java way to define decimal type.
  2. need to fix some corner cases of precision loss(what this PR is trying to fix)
  3. bad compatibility with data sources. (sql("select 1e10 as a").write.parquet("/tmp/tt") would fail)
  4. may have unknown pitfalls, as it's not widely supported by other databases.
  5. fully backward compatible

@mgaido91
Copy link
Contributor Author

thanks for your comment @cloud-fan. I am not sure what you are proposing here.

AFAIK many database follow the SQL way to define the decimal type, i.e. 0 <= scale <= precision

I'd argue that this is not true. Many SQL DBs do not have this rule, despite some indeed have it (eg. SQLServer).

I think I tried to state the tradeoffs of different choices in the mailing list, ie.:

  • If we ensure that the scale must be positive, we have a backward compatibility issue, since there may be operations which we are not able to support anymore (not just producing different results, but we are not able to represent the result at all causing a failure);
  • If we leave things as they are now, ie. we allow negative loss, let me comment your 5 points.
  1. we would want to support scale > precision as well, to be consistent with the Java way to define decimal type.

Not sure this is a good idea in terms of compatibility with data sources and since this is a corner case which doesn't exist now, I think introducing it may lead us to potential problems in that sense and we may be unable to remove the support for backward compatibility reasons. We may try and do that under a config flag though.

  1. need to fix some corner cases of precision loss(what this PR is trying to fix)
  2. bad compatibility with data sources. (sql("select 1e10 as a").write.parquet("/tmp/tt") would fail)

Yes, this is indeed a problem, but it is a problem which is already present, so I think we can consider this PR independent from this issue, as it changes nothing wrt it.

  1. may have unknown pitfalls, as it's not widely supported by other databases.

Not sure what you mean here.

  1. fully backward compatible

This PR is so as there is no change when the scale is non-negative, and when it is it fixes the computation of the precision of the result of division, so the only change if the precision of the result of division. We can also introduce a config to stay on the safe side, but this is basically a fix for a situation which was handled in a bad way before, so I see no reason to turn off this behavior...

@cloud-fan
Copy link
Contributor

Many SQL DBs do not have this rule

Can you list some of them? I checked SQL Server, PostgreSQL, MySQL, Presto, Hive, none of them allow negative scale.

The thing I want to avoid is, fixing a specific issue without looking at the big picture. The big picture is, how Spark should define decimal type?

There are two directions we can go:

  1. forbid negative scale to follow other databases. If we want to do this, we should find out the cases that will be broken, and put them in the release notes.
  2. allow negative scale, but also allow scale > precision. I would reject a proposal that only allows negative scale, as its definition will be hard to generalize.

Let's recap the 2 definitions of decimal type:

  1. decimal = unscaledValue * 10^-scale where precision is the number of digits of the unscaledValue.
  2. precision is the number of total digits, scale is the number of digits of the fraction part.

If we only allow negative scale, it fits neither of the two definitions.

That said, I'm OK with either of the directions, but not something in the middle. Personally I'd prefer the first direction, as that's how other mainstream databases do.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Mar 15, 2019

Personally I'd prefer the first direction, as that's how other mainstream databases do.

+1
Just a fyi - DB2 also does not allow -ve scale.

@mgaido91
Copy link
Contributor Author

Can you list some of them?

Oracle for instance allows negative scales.

we should find out the cases that will be broken, and put them in the release notes.

Well, the cases are obvious, ie. when there are negative scales which define numbers which don't fit in (38, 0), surely we cannot handle them anymore; moreover we need to convert to non-negative all the cases when the users provides a scala/java decimal with negative scale and we may cause some precision losses in cases when a negative scale and a non-negative one are the inputs for an arithmetic operation.

Personally I'd prefer the first direction, as that's how other mainstream databases do.

I agree with you in general, but then when I think to the backward compatibility issues I am doubtful about it. And in the discussion in the mailing list I think @rxin preferred to avoid these breaking changes too.

I would reject a proposal that only allows negative scale, as its definition will be hard to generalize.

Just to be clear, here I am not proposing to allow negative scales, I am just proposing to handle them properly in the operations since they happen to be present.

@mgaido91
Copy link
Contributor Author

Since seems there is no agreement on this PR, I am closing it. We can reopen it if we change our mind. Thanks everybody for the reviews and the discussion!

@mgaido91 mgaido91 closed this Mar 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants