-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Expr|Series.rolling_mean
method
#1290
Conversation
Expr|Series.rolling_mean
Expr|Series.rolling_mean
method
thanks @FBruzzesi ! |
Sure, even better to start simpler from there :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thanks
pending on the unstable api warning being introduced (#1367), I just have a comment about pyarrow
if we're not sure, i think it's ok to leave it out for now
maybe one day we can get to the point where just having something be part of the Narwhals API is enough to put light pressure on dataframe authors to support a function π
narwhals/_arrow/utils.py
Outdated
@@ -452,3 +455,40 @@ def _parse_time_format(arr: pa.Array) -> str: | |||
if pc.all(matches.is_valid()).as_py(): | |||
return time_fmt | |||
return "" | |||
|
|||
|
|||
def _window_agg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're still iterating in Python here...dunno, maybe we should just raise NotImplementedError here?
curious to see a timing comparison of:
- convert to pandas, do rolling mean, back to pyarrow
- do it with this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give it a shot and report back :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this is definitly orders of magnitude slower. I will try with a specified implementation
Thanks for the feedback Marco
My concern is that for plotly we convert dataframe that support interchange protocol to pyarrow and not to pandas. I am afraid that not supporting it would break some user workflows. If that's how we aim to proceed, I would suggest to convert interchange protocol to pandas as it was done before the narwhals PR
That would be nice π |
are we sure it would break a workflow? as far as I can tell, in plotly, you convert to pandas for non-pandas input in trendline so, if someone was using plotly before and interchanging to pandas, then it means they already have pandas installed, so this shouldn't break their workflow if i understand correctly |
Yes correct, this is the current behavior, yet the use case for supporting rolling, expanding, and ewm is to eventually completely remove such conversion in trendline plotly module and use the narwhals functionalities instead. Therefore, if a user provides a dataframe with interchange protocol that we don't support natively, then it gets converted to arrow, and if rolling is not supported, then it would end up breaking. Does this make sense? |
would it work to use the narwhals functionalities if available, and otherwise just convert to pandas? |
@@ -732,3 +733,39 @@ def validate_strict_and_pass_though( | |||
msg = "Cannot pass both `strict` and `pass_through`" | |||
raise ValueError(msg) | |||
return pass_through | |||
|
|||
|
|||
def _validate_rolling_arguments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure is this is the right place to keep private functionalities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
narwhals/_arrow/series.py
Outdated
rolling_sum = self.rolling_sum( | ||
window_size=window_size, min_periods=min_periods, center=center | ||
) | ||
rolling_count = ( | ||
(~self.is_null()) | ||
.cast(self._dtypes.Int32()) | ||
.rolling_sum(window_size=window_size, min_periods=min_periods, center=center) | ||
) | ||
return rolling_sum / rolling_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lazy way of doing it. I wanted to give it a try, yet timing on 1M rows is 50% slower than pandas.
Performances can be enhanced with the same routine of rolling_sum
, and dividing by the count at the end. I will address it.
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.
Opening as draft because of the following:
anymost other aggregate functions, but would like a feedback on that