Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support skipna parameter in GroupBy min, max, prod, median, var, std and sem methods #60752

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

snitish
Copy link
Contributor

@snitish snitish commented Jan 22, 2025

Second (and final) batch of GroupBy reductions being enhanced to support the skipna parameter.

@snitish snitish marked this pull request as draft January 22, 2025 01:33
pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved
@snitish snitish changed the title ENH: Support skipna parameter in GroupBy prod, var, std and sem methods ENH: Support skipna parameter in GroupBy min, max, prod, var, std and sem methods Jan 22, 2025
@snitish snitish marked this pull request as ready for review January 22, 2025 04:14
@snitish snitish changed the title ENH: Support skipna parameter in GroupBy min, max, prod, var, std and sem methods ENH: Support skipna parameter in GroupBy min, max, prod, median, var, std and sem methods Jan 22, 2025
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. One question

pandas/core/_numba/kernels/min_max_.py Outdated Show resolved Hide resolved
pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few smaller comments. @rhshadrach mind taking a look as well?


if not isna_entry:
nobs[lab, j] += 1
oldmean = mean[lab, j]
mean[lab, j] += (val - oldmean) / nobs[lab, j]
out[lab, j] += (val - mean[lab, j]) * (val - oldmean)
elif not skipna:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case skipna is True wouldn't we still need to assign to out here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because if skipna is True and value is NA, we skip the value and thus retain existing behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected result when the group has all NA values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach In case of all-NA values, the result would be NA regardless of skipna, i.e. consistent with Series.mean() etc.

>>> pd.Series([np.nan]*10).groupby(by=["A","B"]*5).mean(skipna=True)
A   NaN
B   NaN
dtype: float64
>>> pd.Series([np.nan]*10).groupby(by=["A","B"]*5).mean(skipna=False)
A   NaN
B   NaN
dtype: float64
>>> pd.Series([np.nan]*10).mean(skipna=True)
nan
>>> pd.Series([np.nan]*10).mean(skipna=False)
nan

pandas/core/resample.py Show resolved Hide resolved
pandas/core/resample.py Show resolved Hide resolved
@rhshadrach rhshadrach added Enhancement Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 25, 2025
@rhshadrach rhshadrach added this to the 3.0 milestone Jan 25, 2025
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Can you add a test (adding to your current parametrizations would be fine) where the entire group is NA.


if not isna_entry:
nobs[lab, j] += 1
oldmean = mean[lab, j]
mean[lab, j] += (val - oldmean) / nobs[lab, j]
out[lab, j] += (val - mean[lab, j]) * (val - oldmean)
elif not skipna:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected result when the group has all NA values?


if not skipna and np.isnan(val):
output[lab] = np.nan
nobs_arr[lab] += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make no difference, but don't we usually think of NA values as not being observations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it makes no difference, but my rationale was that if skipna is False, NAs can be considered valid observations. Happy to change it if you think it should not update nobs.

Copy link
Member

@rhshadrach rhshadrach Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real disagreement with your rational (or agreement for that matter 😄), but for the ops I spot checked we consistently do not count NA values as observations, regardless of skipna. I think we should be consistent here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Removed that line.

pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/core/resample.py Show resolved Hide resolved
pandas/core/resample.py Show resolved Hide resolved
@snitish
Copy link
Contributor Author

snitish commented Jan 25, 2025

Looks great! Can you add a test (adding to your current parametrizations would be fine) where the entire group is NA.

Thanks for the review @rhshadrach. Added the all-NA tests and responded to comments.

@rhshadrach
Copy link
Member

Failure on the future infer string is unrelated (and is fixed by #60796). Rerunning Ubuntu 310 just to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: enable skipna on groupby reduction ops
3 participants