-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for Series|Expr.skew
method
#1173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome effort, thanks @CarloLepelaars , good to have you as contributor! Looks like there's a doctest failure
Thanks for the kind words! Doctest should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating, just left some comments (i'm a little tired today though so sorry if my comments don't make sense 😅 )
btw, if you wanted to just fix a typo somewhere in a separate pr (or, say, take #1170), then once you're already a contributor, CI will always run automatically without me having to approve and run - just bringing this up in case it makes it easier for you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @CarloLepelaars, thanks for the PR!
I left a few comments - the main challenge seems to be how different implementations are between pandas and polars native methods. However polars provide the formula it uses for the computation. It should be possible to reproduce that with native methods or using the series/expr methods that are already implemented in narwhals :)
narwhals/_arrow/series.py
Outdated
@@ -298,6 +299,17 @@ def std(self, ddof: int = 1) -> int: | |||
|
|||
return pc.stddev(self._native_series, ddof=ddof) # type: ignore[no-any-return] | |||
|
|||
def skew(self) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it would end up returning a pyarrow scalar, I think we should keep the implementation with native methods, or you can reuse methods implemented, such as all elementary operations
narwhals/series.py
Outdated
@@ -519,6 +519,40 @@ def mean(self) -> Any: | |||
""" | |||
return self._compliant_series.mean() | |||
|
|||
def skew(self) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as Expr.skew
, polars exposes a bias
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See conversation in narwhals/expr.py
This is indeed challenging @FBruzzesi. I've made it so every backend returns the biased population skewness, but we can potentially include an option for the unbiased skewness. |
Hmm, any idea what this last error for Marimo Python 3.12 is about? This is the only workflow breaking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @CarloLepelaars thanks for adjusting! This looks better now!
I left a comment for the pyarrow case, and I have other two considerations:
- Should we account for the
len(ser) < 3
case and return 0? - It may be worth checking that the numbers are same even when nulls are present
Let's see, this is where Pandas diverges from the rest. To make it consistent we should only handle the case where I thought that Pandas uses the SciPy implementation of
Good one! Can add a case in |
Yes, we are trying to stick with polars api and behavior, so let's manually force that if needed!
That would be great - if it is too much though, we can also make it in a follow up PR |
I've covered the cases as discussed and made them consistent with Polars behavior. |
Thanks for addressing the cases, the CI failure seems unrelated. However I am still not quite sure that we are matching polars behavior. When counting number of elements for the base cases, we should ignore null values, then (pseudo code): if n_not_nulls==0:
return None # same as pl.Series([]).skew() and pl.Series([None]).skew()
elif n_not_nulls==1:
return float("nan") # same as pl.Series([1]).skew() and pl.Series([1, None]).skew()
elif n_not_nulls==2:
return 0.0 # same as pl.Series([1, 2]).skew() and pl.Series([1, 2, None]).skew()
else:
return <compute_skew> |
Implemented your suggestions for nan policy. There is only one edge case left for Dask, where it outputs |
Hey @CarloLepelaars, thanks for adjusting. CI is failing because in #1224 ,
Regarding dask, I am not able to try it now, it could definitly be a tricky one to get right! I am ok with marking it as xfail in tests for now |
def skew(self: Self) -> Self: | ||
return self._from_call( | ||
lambda _input: _input.skew(), | ||
"skew", | ||
returns_scalar=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of dask, the behavior is not 100% consistent with polars for length 0, 1, 2.
Honestly, I am ok with that. The majority of use cases, especially if distributed data is needed should not involve those sizes to begin with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work, thanks - just got a comment on the warnings
tests/expr_and_series/unary_test.py
Outdated
data = {"a": [1], "b": [2], "c": [float("nan")]} | ||
# Dask runs into a divide by zero RuntimeWarning for 1 element skew. | ||
with warnings.catch_warnings(): | ||
warnings.simplefilter("ignore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any change we could make this a more targeted warning filter? just in case we accidentally filter out warnings we should pay attention to
i.e. warnings.filter
with message, category, action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used my favorite trick once again 😂
thanks both! should be good, will do another check but this should make it into the next release |
Awesome, thank you both for working with me on this! Interesting trick to match the warning to Dask only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @CarloLepelaars , and @FBruzzesi for review!
in general we're returning native scalars (e.g. numpy scalars for pandas, pyarrow scalars for pyarrow) so I've kept that consistent with the rest of the api here
just pushed a fix as the will merge on green and this can enter the next release 🥦 |
This PR adds
skew
to Narwhals. Support is added for Polars, Pandas-like, Arrow and Dask.Checklist