-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rf_agg_approx_quantiles function #429
Conversation
Initial implementation with quantiles Signed-off-by: Jason T. Brown <jason@astraea.earth>
|
||
describe("DataFrame.tileStats extension methods") { | ||
|
||
val df = TestData.sampleGeoTiff.toDF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to see a test of this with projected raster tile, raster ref, raster source etc. maybe in integration tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get there with Expressions...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I working on some alternative approaches to discuss that I think will be more "idiomatic RasterFrames".
@@ -155,6 +155,9 @@ trait DataFrameMethods[DF <: DataFrame] extends MethodExtensions[DF] with Metada | |||
def withPrefixedColumnNames(prefix: String): DF = | |||
self.columns.foldLeft(self)((df, c) ⇒ df.withColumnRenamed(c, s"$prefix$c").asInstanceOf[DF]) | |||
|
|||
/** */ | |||
def tileStat(): RasterFrameStatFunctions = new RasterFrameStatFunctions(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on an alternative approach to discuss, one that takes an approach more in line with the other aggregate functions (i.e. rf_agg_foo_bar
). While I know that Spark DataFrames uses this extension method approach in a couple of places (stats
, na
, ...) I think it adds an inconsistency that can feel arbitrary if the same thing can be achieved with columnar functions. na
is a good example of one that can't, because it's effectively implementing a builder for NA handling, but I don't think were in that situation.
* | ||
* @since 2.0.0 | ||
*/ | ||
def approxTileQuantile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this copied? If so, we need to attribute and reference, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya oops
|
||
describe("DataFrame.tileStats extension methods") { | ||
|
||
val df = TestData.sampleGeoTiff.toDF() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get there with Expressions...
val result = df.tileStat().approxTileQuantile( | ||
Array("tile", "tilePlus2"), | ||
Array(0.25, 0.75), | ||
0.00001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to default this? Or is it strongly scale dependent?
…lt-in functions all the way to a custom aggregate expression.
See this for a couple alternate approaches that don't depend DataFrame extension methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See alternate approaches and let's converge on one.
This reverts commit 456914d.
Alternate approaches to exposing quantile summaries API
Attempt to register with SQL Signed-off-by: Jason T. Brown <jason@astraea.earth>
@metasim tests failing on the SQL expression. Not sure how to work in all the different parameters. We could easily register some functions to compute default percentiles like median, quartiles, quantiles, deciles, etc at default relative error. |
@vpipkt I think enabling SQL support is going to take some more work, given the SQL function parameter requirements.... perhaps we need to look for something in the official API that uses non-columnar parameters to a function (does it exist)? For some reason I'm loath to make all of those parameters columnar, but perhaps that's the proper way to do it I think the current question is do we merge this without support for SQL, or do we work this out first? |
I am fine with a little divergence in the API between SQL and python/scala over this. Possible paths:
|
Signed-off-by: Jason T. Brown <jason@astraea.earth>
Latest commit should pass tests and is basically the option 1 above Recommend we add an issue for option 2, lower priority. |
Signed-off-by: Jason T. Brown <jason@astraea.earth>
tile.foreachDouble(d => if (!isNoData(d)) result = result.insert(d)) | ||
buffer.update(0, result.toRow) | ||
} | ||
else buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler is complaining about this. Since the function returns Unit
this line could be omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh, that was me not thinking "mutably". It can go.
Signed-off-by: Jason T. Brown <jason@astraea.earth>
Initial implementation with quantiles
To Do