-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25454][SQL] Avoid precision loss in division with decimal with negative scale #22450
Conversation
Test build #96180 has finished for PR 22450 at commit
|
retest this please |
Test build #96226 has finished for PR 22450 at commit
|
retest this please |
@@ -83,4 +83,7 @@ select 12345678912345678912345678912.1234567 + 9999999999999999999999999999999.1 | |||
select 123456789123456789.1234567890 * 1.123456789123456789; | |||
select 12345678912345.123456789123 / 0.000000012345678; | |||
|
|||
-- division with negative scale operands | |||
select 26393499451/ 1000e6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit... add space 26393499451 /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, what't the result of this query in v2.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thanks. In 2.3 this truncates the output by some digits, ie. it would return 26.393499
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh... thanks.
@@ -129,16 +129,17 @@ object DecimalPrecision extends TypeCoercionRule { | |||
resultType) | |||
|
|||
case Divide(e1 @ DecimalType.Expression(p1, s1), e2 @ DecimalType.Expression(p2, s2)) => | |||
val adjP2 = if (s2 < 0) p2 - s2 else p2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule was added long time ago, do you mean this is a long-standing bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is more clear in the related JIRA description and comments. The problem is that here we have never handled properly decimals with negative scale. The point is: before 2.3, this could happen only if someone was creating some specific literal from a BigDecimal, like lit(BigDecimal(100e6))
; since 2.3, this can happen with every constant like 100e6 in the SQL code. So the problem has been there for a while, but we haven't seen it because it was less likely to happen.
Another solution would be avoiding having decimals with a negative scale. But this is quite a breaking change, so I'd avoid until a 3.0 release at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see. Can we add a test in DataFrameSuite
with decimal literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update the document of this rule to reflect this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but if you agree I'll try and find a better place than DataFrameSuite
. I'd prefer adding the new tests to ArithmeticExpressionSuite
. Is that ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Test build #96229 has finished for PR 22450 at commit
|
Test build #96237 has finished for PR 22450 at commit
|
@@ -40,10 +40,13 @@ import org.apache.spark.sql.types._ | |||
* e1 + e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2) | |||
* e1 - e2 max(s1, s2) + max(p1-s1, p2-s2) + 1 max(s1, s2) | |||
* e1 * e2 p1 + p2 + 1 s1 + s2 | |||
* e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1) | |||
* e1 / e2 max(p1-s1+s2, 0) + max(6, s1+adjP2+1) max(6, s1+adjP2+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very critical. Is there any other database using this formula?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think as the other DBs I know the formula of are Hive and MS SQL which don't allow negative scales so they don't have this problem. The formula is not changed from before actually, it just handles a negative scale.
I feel there are more places we need to fix for negative scale. I couldn't find any design doc for negative scale in Spark and I believe we supported it by accident. That said, fixing division is just fixing the specific case the user reported. This is not ideal. We should either officially support negative scale and fix all the cases, or officially forbid negative scale. However, neither of them can be made into a bug fix for branch 2.3 and 2.4. Instead, I'm proposing a different fix: un-officially forbids negative scale. Users can still create a decimal value with negative scale, but Spark itself should avoid generating such values. See #22470 |
@cloud-fan I checked the other operations and they have no issue with negative scale. This is the reason why this fix is only for Divide: it is the only operation which wasn't dealing it properly. I also thought about doing that but I chose not to do in order not to introduce regressions. Anyway I'll argument more in your PR. Thanks. |
Test build #96254 has finished for PR 22450 at commit
|
retest this please |
Test build #96289 has finished for PR 22450 at commit
|
My major concern is:
I'm reconsidering this proposal
|
Let me answer to all your 3 points:
You can check the other operations. Let's go though all of them:
We can add more and more test cases, but let's check the logic of the divide rule. In the precision we have 2 parts: the digits before the comma - ie. intDigits- (
I think what we can do is forbidding negative scale when handling it always, regardless of the value of that flag. I think this can safely be done, as negative scales in this case were not supported before 2.3. But anyway, this would just reduce the number of cases when this can happen... So I think we can do that, but it is not a definitive solution. If you want, I can do this in scope of this PR or I can create a new one or we can do this in the PR you just closed. |
@mgaido91 db2db2 => describe select 1e26 from cast
Column Information
Number of columns: 1
SQL type Type length Column name Name length
-------------------- ----------- ------------------------------ -----------
480 DOUBLE 8 1 1
db2 => describe select 1.23 from cast
Column Information
Number of columns: 1
SQL type Type length Column name Name length
-------------------- ----------- ------------------------------ -----------
484 DECIMAL 3, 2 1 1 prestopresto:default> explain select 2.34E10;
Query Plan
----------------------------------------------------------------------------------
- Output[_col0] => [expr:double]
Cost: {rows: 1 (10B), cpu: 10.00, memory: 0.00, network: 0.00} Should spark do the same ? What would be the repercussions if we did that ? |
yes @dilipbiswal , Hive does the same. |
@mgaido91 well, your long explanation makes me think we should have a design doc about it, and have more people to review it and ensure we cover all the corner cases. And seems there are more problems like the data type of literals(e.g. I'd suggest we pick a safer approach: allow users to fully turn off #20023 , which is done by #22494 |
sorry, I haven't got what you mean here, may you please explain me?
I'll prepare a design doc and attach it to the JIRA asap then. |
@dilipbiswal showed that DB2 and presto treat |
oh I see now what you mean, thanks. Yes, Hive does the same. We may have to revisit completely our parsing of literals but since it is a breaking change I am not sure it will be possible before 3.0. And if the next release is going to be 2.5, probably this means that we have to wait before doing anything like that. |
given how complex it is, I feel we can start the design at 2.5 and implement it at 3.0, what do you think? |
I think there are 2 separate topics here:
Do you agree on the above distinction? Thanks for your time and help here anyway |
totally agree, thanks for looking into it! |
Thank you for your help and guidance. |
Test build #96417 has finished for PR 22450 at commit
|
@cloud-fan I have seen no more comments on the design doc for a while. I also wrote an email about one week ago to the dev list in order to check if there were further comments but I have seen none. As I don't see any concern on this, shall we go ahead with this PR? Thanks. |
kindly ping @cloud-fan , thanks |
@cloud-fan this has been stuck for a while now. Is there something blocking this? Is there something I can do? Thanks. |
cc also @viirya who commented on the design doc |
It's Spark 3.0 now, shall we consider forbidding negative scale? This is an undocumented and broken feature, and many databases don't support it either. I checked that parquet doesn't support negative scale, which means Spark can't support it well as Spark can't write it to data source like parquet. cc @dilipbiswal |
@cloud-fan yes, it sounds reasonable. I can start working on it if you agree. My only concern is that we won't be able to support operations/workloads which were running in 2.x. So we are introducing a limitation. Shall we send an email to the dev list in order to check and decide with a broader audience how to go ahead? |
sure, thanks! |
cc @rxin too, who answered the email in the dev list. Since I saw only his comment in favor of this approach and none against, shall we go ahead with this? |
@cloud-fan I added some comments according to what you suggested on the mailing list. Please let me know if you were thinking to something different. Thanks. |
Test build #100903 has finished for PR 22450 at commit
|
Test build #100945 has finished for PR 22450 at commit
|
any comments on this @cloud-fan @rxin ? I think the approach here reflects what was agreed on in the mailing list and I think I addressed all the comments made there. May you please take a look at this again? Thanks. |
@cloud-fan @maropu @rxin sorry for pinging you again. This has been stale for a while now. I think in the mailing list it was agreed to go on this way, but in case we want to do something different we can discuss and I can eventually close this PR. What do you think? Thanks. |
Sorry for the late reply, as I've struggled with it for a long time. AFAIK there are 2 ways to define a decimal type:
This means,
This means, AFAIK many database follow the SQL way to define the decimal type, i.e. We should also clearly list the tradeoffs of different options. As an example, if want to keep supporting negative scale:
|
thanks for your comment @cloud-fan. I am not sure what you are proposing here.
I'd argue that this is not true. Many SQL DBs do not have this rule, despite some indeed have it (eg. SQLServer). I think I tried to state the tradeoffs of different choices in the mailing list, ie.:
Not sure this is a good idea in terms of compatibility with data sources and since this is a corner case which doesn't exist now, I think introducing it may lead us to potential problems in that sense and we may be unable to remove the support for backward compatibility reasons. We may try and do that under a config flag though.
Yes, this is indeed a problem, but it is a problem which is already present, so I think we can consider this PR independent from this issue, as it changes nothing wrt it.
Not sure what you mean here.
This PR is so as there is no change when the scale is non-negative, and when it is it fixes the computation of the precision of the result of division, so the only change if the precision of the result of division. We can also introduce a config to stay on the safe side, but this is basically a fix for a situation which was handled in a bad way before, so I see no reason to turn off this behavior... |
Can you list some of them? I checked SQL Server, PostgreSQL, MySQL, Presto, Hive, none of them allow negative scale. The thing I want to avoid is, fixing a specific issue without looking at the big picture. The big picture is, how Spark should define decimal type? There are two directions we can go:
Let's recap the 2 definitions of decimal type:
If we only allow negative scale, it fits neither of the two definitions. That said, I'm OK with either of the directions, but not something in the middle. Personally I'd prefer the first direction, as that's how other mainstream databases do. |
+1 |
Oracle for instance allows negative scales.
Well, the cases are obvious, ie. when there are negative scales which define numbers which don't fit in (38, 0), surely we cannot handle them anymore; moreover we need to convert to non-negative all the cases when the users provides a scala/java decimal with negative scale and we may cause some precision losses in cases when a negative scale and a non-negative one are the inputs for an arithmetic operation.
I agree with you in general, but then when I think to the backward compatibility issues I am doubtful about it. And in the discussion in the mailing list I think @rxin preferred to avoid these breaking changes too.
Just to be clear, here I am not proposing to allow negative scales, I am just proposing to handle them properly in the operations since they happen to be present. |
Since seems there is no agreement on this PR, I am closing it. We can reopen it if we change our mind. Thanks everybody for the reviews and the discussion! |
What changes were proposed in this pull request?
Our rules for determine decimal precision and scale are reflecting Hive and SQLServer's. The problem is that in Spark we allow negative scale, whereas in those other systems this is not possible. So the rule we have for division doesn't take in account the case when the scale is negative.
The PR makes our rule compatible with decimals having negative scale too.
How was this patch tested?
added UTs