-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10641][SQL] Add Skewness and Kurtosis Support #9003
Conversation
A few notes:
|
} | ||
} | ||
|
||
//case class Average(child: Expression) extends CentralMomentAgg(child) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregate functions which depend on lower order moments can easily be implemented using the CentralMomentAgg base class. I have commented them out for now, but left them for discussion purposes.
@sethah, are you working on this? My suggestion is to implement skewness and kurtosis in a single aggregate function that implements ImperativeAggregate based on the discussion in SPARK-10953, but do not change existing implementation of standard deviation and variance. This makes the PR simpler and easier to review and merge. |
@mengxr I am working on it, and I have incorporated changes from the note you posted on the Jira - thanks for that! I pushed some changes. I had already changed the existing implementation of stddev unfortunately, but I can revert those changes so that the PR is a bit smaller. I was thinking that since variance hasn't been merged yet, that I could include it in the PR, but if you think it's best to leave it out I can do that as well. I tested the imperative versions vs the codegen versions I had before and they are significantly faster. I will work on getting the changes for stddev reverted so that it will be ready for review. |
val updateValue = v match { | ||
case d: java.lang.Number => d.doubleValue() | ||
case _ => 0.0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please double check if it will handle all numeric types correctly, e.g: Decimal?
@sethah Besides inline comments, we can also think about how to simplify the class inheritance. It would be better if we don't need to touch |
add to whitelist |
@mengxr I think having def eval(buffer: InternalRow): Any = this match {
case _: VariancePop => { }
case _: Variance => { }
case _: Skewness => { }
case _: Kurtosis => { } |
Test build #43961 has finished for PR 9003 at commit
|
@sethah The storage of val M0 = buffer.getDouble(mutableAggBufferOffset)
val M2 = buffer.getDouble(mutableAggBufferOffset + 2)
val M4 = buffer.getDouble(mutableAggBufferOffset + 4) I suggest adding an abstract method to override final def eval(buffer: InternalRow): Any = {
val n = buffer.getDouble(offset)
val mean = buffer.getDouble(offset + 1)
val moments = Array.ofDim[Double](maxMomemt)
moments[0] = 1.0
moments[1] = 0.0
// fill in other moments up to maxMoment
getStatistic(n, mean, moments)
} Then we can hide the storage of |
Test build #44039 has finished for PR 9003 at commit
|
Test build #44179 has finished for PR 9003 at commit
|
/** | ||
* Compute aggregate statistic from sufficient moments. | ||
* @param centralMoments Length `momentOrder + 1` array of central moments needed to | ||
* compute the aggregate stat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is useful to mention that whether they are divided by n
or not.
LGTM except minor inline comments. Ping @yhuai for a final pass. |
Test build #44234 has finished for PR 9003 at commit
|
n = n1 + n2 | ||
buffer1.setDouble(nOffset, n) | ||
delta = mean2 - mean1 | ||
deltaN = if (n == 0.0) 0.0 else delta / n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed divide by zero case here, which was causing problems when number of partitions > number of samples.
Test build #44267 has finished for PR 9003 at commit
|
Test build #44272 has finished for PR 9003 at commit
|
@sethah It seems that |
@mengxr Corrected |
Test build #44366 has finished for PR 9003 at commit
|
I guess this is still failing HiveComparisonTest due to small error in the variance. [info] !== HIVE - 1 row(s) == == CATALYST - 1 row(s) == abs(20428.072876000006 - 20428.07287599999) = 1.4552e-11 Not sure if we need to change the test? |
Had an offline discussion with @yhuai. We can remove Btw, we can remove |
override def prettyName: String = "variance_samp" | ||
|
||
override def toString: String = s"VAR_SAMP($child)" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the error message if we call this function when spark.sql.useAggregate2=false
? It will be good to provide a meaning error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai This is the error I get when calling one of the new skewness
on a dataframe when spark.sql.useAggregate2=false
:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree:{colName}#2
I'm not exactly sure where we ought to throw the error. I'd appreciate any tips on how to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change with AggregateExpression
to with AggregateExpression1
? Then, in the newInstance
method, we throw an UnsupportedOperationException
to let users know that they need to set spark.sql.useAggregate2
to true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai Thanks for the suggestion! I think that's working now if you want to review it :)
Test build #44595 has finished for PR 9003 at commit
|
Test build #44611 has finished for PR 9003 at commit
|
Thanks @mengxr , I will send a PR for Stddev. |
Implementing skewness and kurtosis support based on following algorithm:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics