-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-11420 Updating Stddev support via Imperative Aggregate #9380
Conversation
@@ -1135,7 +992,76 @@ abstract class CentralMomentAgg(child: Expression) extends ImperativeAggregate w | |||
moments(4) = buffer.getDouble(fourthMomentOffset) | |||
} | |||
|
|||
getStatistic(n, mean, moments) | |||
if (n == 0.0) null | |||
else if (n == 1.0) 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we want this behavior, since these edge cases should be handled in the getStatistic
implementation. If you see previous PR we established that Skewness
and Kurtosis
should yield Double.NaN
when n == 1.0
but other functions like VariancePop
should yield 0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
so for skewness and kurtosis in case of count =1, we want to return null instead of 0. I can address it, but instead of returning Double.NaN, should we return null for stddev/variance when count = 0, null will be in line with all other stats functions, like mix, max... |
I propose to return null for all cases which currently Double.NaN is returned. and change getStatistics() to return Any instead of Double. |
@JihongMA I'm not sure about that. I don't think we should return |
getStatistics() will continue to return Double value for normal cases, changing it to return null only for edge cases. is there a strong reason to return Double.NaN? when count = 0, all other stats function, min, max, avg.. all return null. |
@mengxr Please take another look. |
@mengxr rebased with the changes @rxin [SPARK-11490], stddev / variance mapped to the corresponding sample stddev / variance. I checked Hive doesn't support this mapping, but I found other MPP database like Presto did the same alias mapping. |
add to whitelist |
ok to test |
@JihongMA I don't know if there are any strong reasons in terms of catalyst. However, personally I think we should separate changing the return type and |
+1 on @yu-iskw 's suggestion. Let's keep the changes in this PR minimal. Just replace |
BTW with #9480, we might not need to replace it with imperative aggregate anymore. |
Test build #45135 has finished for PR 9380 at commit
|
Test build #45256 has finished for PR 9380 at commit
|
Even though its simple, I think this implementation is boxing the result, which could result in slower performance on real workloads (but is harder to see in micro benchmarks) |
Which part is boxing the result? I tested the following on master with changes from #9480: val df = sqlContext.range(100000000)
df.select(var_samp("id")).show(); // ~7.5s
df.select(stddev_samp("id")).show() // ~10s Both have low GC activities. |
The eval call is boxing which you aren't going to see without a groupby. |
But eval only happens once per group. |
@JihongMA Could you merge the current master? There are some merge conflicts. For > mean(c())
[1] NA
> var(c(1))
[1] NA > np.mean([])
Out[1] = na
> np.var([1], ddof=1)
Out[2] = nan @marmbrus I think we can move the implementation from imperative to declarative in 1.7. This PR is to re-use the |
Test build #45678 has finished for PR 9380 at commit
|
SparkR support has just been added so this change breaks tests
|
@felixcheung Thank you! this is the change I have made to make it pass for R. I am not familiar with R . df3 <- agg(gd, age = "stddev") |
@JihongMA yap that should fix them |
@AmplabJenkins please retest the change. |
Jenkins, test this please. |
@JihongMA thanks for the update! Could you revert |
Test build #45749 has finished for PR 9380 at commit
|
test this please |
Test build #45755 has finished for PR 9380 at commit
|
LGTM. Merged into master and branch-1.6. Thanks! Btw, there is a minor style issue I marked inline. @JihongMA Could you submit another PR to change the output of |
switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes #9380 from JihongMA/SPARK-11420.
@mengxr sure, will take care mean via seperate PR. |
@mengxr do we want to change the behavior for min, max as well? |
No, |
Sounds good. |
switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes apache#9380 from JihongMA/SPARK-11420.
switched stddev support from DeclarativeAggregate to ImperativeAggregate.