feat: add support for `Series|Expr.skew` method #1173

CarloLepelaars · 2024-10-14T15:51:04Z

This PR adds skew to Narwhals. Support is added for Polars, Pandas-like, Arrow and Dask.

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

MarcoGorelli

Awesome effort, thanks @CarloLepelaars , good to have you as contributor! Looks like there's a doctest failure

CarloLepelaars · 2024-10-14T16:29:58Z

Thanks for the kind words! Doctest should be fixed now.

MarcoGorelli

thanks for updating, just left some comments (i'm a little tired today though so sorry if my comments don't make sense 😅 )

narwhals/_arrow/series.py

narwhals/_pandas_like/series.py

narwhals/expr.py

MarcoGorelli · 2024-10-14T17:15:13Z

btw, if you wanted to just fix a typo somewhere in a separate pr (or, say, take #1170), then once you're already a contributor, CI will always run automatically without me having to approve and run - just bringing this up in case it makes it easier for you

FBruzzesi

Hey @CarloLepelaars, thanks for the PR!

I left a few comments - the main challenge seems to be how different implementations are between pandas and polars native methods. However polars provide the formula it uses for the computation. It should be possible to reproduce that with native methods or using the series/expr methods that are already implemented in narwhals :)

narwhals/_arrow/namespace.py

FBruzzesi · 2024-10-14T18:15:44Z

narwhals/_arrow/series.py

@@ -298,6 +299,17 @@ def std(self, ddof: int = 1) -> int:

        return pc.stddev(self._native_series, ddof=ddof)  # type: ignore[no-any-return]

+    def skew(self) -> float:


Although it would end up returning a pyarrow scalar, I think we should keep the implementation with native methods, or you can reuse methods implemented, such as all elementary operations

narwhals/_pandas_like/namespace.py

narwhals/_pandas_like/series.py

narwhals/_polars/namespace.py

narwhals/expr.py

narwhals/_pandas_like/series.py

FBruzzesi · 2024-10-14T18:21:53Z

narwhals/series.py

@@ -519,6 +519,40 @@ def mean(self) -> Any:
        """
        return self._compliant_series.mean()

+    def skew(self) -> Any:


Same as Expr.skew, polars exposes a bias parameter

See conversation in narwhals/expr.py

CarloLepelaars · 2024-10-14T20:03:46Z

Hey @CarloLepelaars, thanks for the PR!

I left a few comments - the main challenge seems to be how different implementations are between pandas and polars native methods. However polars provide the formula it uses for the computation. It should be possible to reproduce that with native methods or using the series/expr methods that are already implemented in narwhals :)

This is indeed challenging @FBruzzesi. I've made it so every backend returns the biased population skewness, but we can potentially include an option for the unbiased skewness.

CarloLepelaars · 2024-10-17T18:31:08Z

Hmm, any idea what this last error for Marimo Python 3.12 is about? This is the only workflow breaking.

FAILED tests/_plugins/ui/_impl/tables/test_narwhals.py::TestNarwhalsTableManagerFactory::test_complex_data_field_types - TypeError: write() argument must be str, not dict

FBruzzesi

Hey @CarloLepelaars thanks for adjusting! This looks better now!

I left a comment for the pyarrow case, and I have other two considerations:

Should we account for the len(ser) < 3 case and return 0?
It may be worth checking that the numbers are same even when nulls are present

narwhals/_arrow/series.py

narwhals/series.py

CarloLepelaars · 2024-10-18T13:04:07Z

Should we account for the len(ser) < 3 case and return 0?

Let's see, this is where Pandas diverges from the rest. To make it consistent we should only handle the case where len(data)==2. In that case Pandas and PyArrow can return 0. Do you also think that is the way to go?

I thought that Pandas uses the SciPy implementation of skew under the hood, but apparently they are different?

>>> sample_data = [2, 10]
>>> scipy_skew = skew(sample_data)
>>> pandas_skew = pd.Series(sample_data).skew()
>>> polars_skew = pl.Series(sample_data).skew()
>>> print("Skewness for 2 elements:")
>>> print(f"SciPy:  {scipy_skew:.6f}")
>>> print(f"Pandas: {pandas_skew:.6f}")
>>> print(f"Polars: {polars_skew:.6f}")

Skewness for 2 elements:
SciPy:  0.000000
Pandas: nan
Polars: 0.000000
# ----------------------------------------------
>>> sample_data = [2]
>>> scipy_skew = skew(sample_data)
>>> pandas_skew = pd.Series(sample_data).skew()
>>> polars_skew = pl.Series(sample_data).skew()
>>> print("Skewness for 2 elements:")
>>> print(f"SciPy:  {scipy_skew:.6f}")
>>> print(f"Pandas: {pandas_skew:.6f}")
>>> print(f"Polars: {polars_skew:.6f}")

Skewness for 1 element:
SciPy:  nan
Pandas: nan
Polars: nan

It may be worth checking that the numbers are same even when nulls are present

Good one! Can add a case in unary_test.py that has nulls.

FBruzzesi · 2024-10-18T13:33:07Z

Let's see, this is where Pandas diverges from the rest. To make it consistent we should only handle the case where len(data)==2. In that case Pandas and PyArrow can return 0. Do you also think that is the way to go?

Yes, we are trying to stick with polars api and behavior, so let's manually force that if needed!

Good one! Can add a case in unary_test.py that has nulls.

That would be great - if it is too much though, we can also make it in a follow up PR

CarloLepelaars · 2024-10-18T15:28:34Z

@FBruzzesi

I've covered the cases as discussed and made them consistent with Polars behavior. unary_test.py now also covers data with nan and cases where there are less than 3 rows.

FBruzzesi · 2024-10-18T21:12:44Z

I've covered the cases as discussed and made them consistent with Polars behavior. unary_test.py now also covers data with nan and cases where there are less than 3 rows.

Thanks for addressing the cases, the CI failure seems unrelated.

However I am still not quite sure that we are matching polars behavior. When counting number of elements for the base cases, we should ignore null values, then (pseudo code):

if n_not_nulls==0:
    return None   # same as pl.Series([]).skew() and pl.Series([None]).skew()
elif n_not_nulls==1:
    return float("nan")  # same as pl.Series([1]).skew() and pl.Series([1, None]).skew()
elif n_not_nulls==2:
    return 0.0  # same as pl.Series([1, 2]).skew() and pl.Series([1, 2, None]).skew()
else:
    return <compute_skew>

CarloLepelaars · 2024-10-23T13:18:43Z

Implemented your suggestions for nan policy. There is only one edge case left for Dask, where it outputs nan instead of 0.0 with 2 non null elements. Not sure how to adjust _dask/expr.py to account for that.

FBruzzesi · 2024-10-24T07:23:02Z

Hey @CarloLepelaars, thanks for adjusting. CI is failing because in #1224 , compare_dicts was renamed to assert_equal_data.

Implemented your suggestions for nan policy. There is only one edge case left for Dask, where it outputs nan instead of 0.0 with 2 non null elements. Not sure how to adjust _dask/expr.py to account for that.

Regarding dask, I am not able to try it now, it could definitly be a tricky one to get right! I am ok with marking it as xfail in tests for now

FBruzzesi · 2024-11-12T21:08:24Z

narwhals/_dask/expr.py

+    def skew(self: Self) -> Self:
+        return self._from_call(
+            lambda _input: _input.skew(),
+            "skew",
+            returns_scalar=True,
+        )


In case of dask, the behavior is not 100% consistent with polars for length 0, 1, 2.
Honestly, I am ok with that. The majority of use cases, especially if distributed data is needed should not involve those sizes to begin with

MarcoGorelli

awesome work, thanks - just got a comment on the warnings

MarcoGorelli · 2024-11-13T11:42:17Z

tests/expr_and_series/unary_test.py

+    data = {"a": [1], "b": [2], "c": [float("nan")]}
+    # Dask runs into a divide by zero RuntimeWarning for 1 element skew.
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")


any change we could make this a more targeted warning filter? just in case we accidentally filter out warnings we should pay attention to

i.e. warnings.filter with message, category, action

I used my favorite trick once again 😂

MarcoGorelli · 2024-11-13T13:17:20Z

thanks both! should be good, will do another check but this should make it into the next release

CarloLepelaars · 2024-11-13T14:18:25Z

Awesome, thank you both for working with me on this! Interesting trick to match the warning to Dask only.

for more information, see https://pre-commit.ci

MarcoGorelli

thanks @CarloLepelaars , and @FBruzzesi for review!

in general we're returning native scalars (e.g. numpy scalars for pandas, pyarrow scalars for pyarrow) so I've kept that consistent with the rest of the api here

…eat/skew

MarcoGorelli · 2024-11-23T12:16:16Z

just pushed a fix as the else part of m3 / (m2**1.5) if m2 != 0 else 0 wasn't right nor tested

will merge on green and this can enter the next release 🥦

Implement skew for Arrow, Pandas-like and Polars

90d9742

CarloLepelaars changed the title ~~Skewness~~ feat: skew Oct 14, 2024

CarloLepelaars changed the title ~~feat: skew~~ feat: skew Oct 14, 2024

github-actions bot added the enhancement New feature or request label Oct 14, 2024

MarcoGorelli reviewed Oct 14, 2024

View reviewed changes

Fix doctests

c82fec1

MarcoGorelli reviewed Oct 14, 2024

View reviewed changes

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

narwhals/_pandas_like/series.py Outdated Show resolved Hide resolved

narwhals/expr.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Oct 14, 2024

View reviewed changes

CarloLepelaars added 2 commits October 14, 2024 21:43

Remove skew in namespace. Remove n > 3 requirement. Fix expr doc

e118e4d

Use biases population skewness

2530f81

CarloLepelaars added 5 commits October 15, 2024 18:09

Add pyarrow example for skew Expr

fc37529

Merge branch 'main' into feat/skew

be2f503

Fix: Add a_skew to schema

02fdb4c

Use native operation for PandasLikeSeries skew. Dask skew expr

895be9c

Use native pyarrow operations for skew

a3b71bc

Merge branch 'main' into feat/skew

9ed06d7

FBruzzesi reviewed Oct 17, 2024

View reviewed changes

narwhals/_arrow/series.py Outdated Show resolved Hide resolved

narwhals/series.py Outdated Show resolved Hide resolved

Simplify arrow skew. non-trivial example for series.skew.

4ff077d

unary_test with nan data. 2 element and 1 element unary tests

11efd49

Fix doctest for Series skew

26a64f8

Make skew nan policy consistent with Polars

2014036

FBruzzesi added 3 commits October 29, 2024 08:40

Merge branch 'main' into feat/skew

aaada24

merge main

3e7eeab

merge main and add test for coverage

7f6fe07

FBruzzesi reviewed Nov 12, 2024

View reviewed changes

FBruzzesi changed the title ~~feat: skew~~ feat: add support for Series|Expr.skew method Nov 12, 2024

MarcoGorelli reviewed Nov 13, 2024

View reviewed changes

FBruzzesi added 2 commits November 13, 2024 13:45

Merge branch 'main' into feat/skew

2f2912c

match RuntimeWarning for dask only

082664f

MarcoGorelli and others added 3 commits November 23, 2024 11:57

Merge remote-tracking branch 'upstream/main' into feat/skew

56299a3

stay pyarrow-native longer

2dac3f3

[pre-commit.ci] auto fixes from pre-commit.com hooks

0d3b6ec

for more information, see https://pre-commit.ci

MarcoGorelli approved these changes Nov 23, 2024

View reviewed changes

MarcoGorelli added 2 commits November 23, 2024 12:14

fix mistake

7f91b19

Merge branch 'feat/skew' of github.com:CarloLepelaars/narwhals into f…

3399530

…eat/skew

doctest

87f71d7

MarcoGorelli merged commit 35c34f4 into narwhals-dev:main Nov 23, 2024
23 checks passed

CarloLepelaars deleted the feat/skew branch November 23, 2024 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for `Series|Expr.skew` method #1173

feat: add support for `Series|Expr.skew` method #1173

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

MarcoGorelli left a comment

CarloLepelaars commented Oct 14, 2024

MarcoGorelli left a comment

MarcoGorelli commented Oct 14, 2024

FBruzzesi left a comment

FBruzzesi Oct 14, 2024

FBruzzesi Oct 14, 2024

CarloLepelaars Oct 17, 2024

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

CarloLepelaars commented Oct 17, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading

CarloLepelaars commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading

CarloLepelaars commented Oct 18, 2024

FBruzzesi commented Oct 18, 2024 •

edited

Loading

CarloLepelaars commented Oct 23, 2024

FBruzzesi commented Oct 24, 2024 •

edited

Loading

FBruzzesi Nov 12, 2024

MarcoGorelli left a comment

MarcoGorelli Nov 13, 2024

FBruzzesi Nov 13, 2024

MarcoGorelli commented Nov 13, 2024

CarloLepelaars commented Nov 13, 2024

MarcoGorelli left a comment

MarcoGorelli commented Nov 23, 2024

		@@ -298,6 +299,17 @@ def std(self, ddof: int = 1) -> int:

		return pc.stddev(self._native_series, ddof=ddof) # type: ignore[no-any-return]

		def skew(self) -> float:

feat: add support for Series|Expr.skew method #1173

feat: add support for Series|Expr.skew method #1173

Conversation

CarloLepelaars commented Oct 14, 2024 • edited Loading

Checklist

MarcoGorelli left a comment

Choose a reason for hiding this comment

CarloLepelaars commented Oct 14, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Oct 14, 2024

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi Oct 14, 2024

Choose a reason for hiding this comment

FBruzzesi Oct 14, 2024

Choose a reason for hiding this comment

CarloLepelaars Oct 17, 2024

Choose a reason for hiding this comment

CarloLepelaars commented Oct 14, 2024 • edited Loading

CarloLepelaars commented Oct 17, 2024 • edited Loading

FBruzzesi left a comment • edited Loading

Choose a reason for hiding this comment

CarloLepelaars commented Oct 18, 2024 • edited Loading

FBruzzesi commented Oct 18, 2024 • edited Loading

CarloLepelaars commented Oct 18, 2024

FBruzzesi commented Oct 18, 2024 • edited Loading

CarloLepelaars commented Oct 23, 2024

FBruzzesi commented Oct 24, 2024 • edited Loading

FBruzzesi Nov 12, 2024

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Nov 13, 2024

Choose a reason for hiding this comment

FBruzzesi Nov 13, 2024

Choose a reason for hiding this comment

MarcoGorelli commented Nov 13, 2024

CarloLepelaars commented Nov 13, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Nov 23, 2024

feat: add support for `Series|Expr.skew` method #1173

feat: add support for `Series|Expr.skew` method #1173

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

CarloLepelaars commented Oct 14, 2024 •

edited

Loading

CarloLepelaars commented Oct 17, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading

CarloLepelaars commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 18, 2024 •

edited

Loading

FBruzzesi commented Oct 24, 2024 •

edited

Loading