-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support skipna parameter in GroupBy min, max, prod, median, var, std and sem methods #60752
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. One question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few smaller comments. @rhshadrach mind taking a look as well?
|
||
if not isna_entry: | ||
nobs[lab, j] += 1 | ||
oldmean = mean[lab, j] | ||
mean[lab, j] += (val - oldmean) / nobs[lab, j] | ||
out[lab, j] += (val - mean[lab, j]) * (val - oldmean) | ||
elif not skipna: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case skipna
is True
wouldn't we still need to assign to out
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because if skipna
is True
and value is NA, we skip the value and thus retain existing behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected result when the group has all NA values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach In case of all-NA values, the result would be NA regardless of skipna
, i.e. consistent with Series.mean() etc.
>>> pd.Series([np.nan]*10).groupby(by=["A","B"]*5).mean(skipna=True)
A NaN
B NaN
dtype: float64
>>> pd.Series([np.nan]*10).groupby(by=["A","B"]*5).mean(skipna=False)
A NaN
B NaN
dtype: float64
>>> pd.Series([np.nan]*10).mean(skipna=True)
nan
>>> pd.Series([np.nan]*10).mean(skipna=False)
nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Can you add a test (adding to your current parametrizations would be fine) where the entire group is NA.
|
||
if not isna_entry: | ||
nobs[lab, j] += 1 | ||
oldmean = mean[lab, j] | ||
mean[lab, j] += (val - oldmean) / nobs[lab, j] | ||
out[lab, j] += (val - mean[lab, j]) * (val - oldmean) | ||
elif not skipna: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected result when the group has all NA values?
pandas/core/_numba/kernels/var_.py
Outdated
|
||
if not skipna and np.isnan(val): | ||
output[lab] = np.nan | ||
nobs_arr[lab] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might make no difference, but don't we usually think of NA values as not being observations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it makes no difference, but my rationale was that if skipna is False, NAs can be considered valid observations. Happy to change it if you think it should not update nobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real disagreement with your rational (or agreement for that matter 😄), but for the ops I spot checked we consistently do not count NA values as observations, regardless of skipna
. I think we should be consistent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Removed that line.
Thanks for the review @rhshadrach. Added the all-NA tests and responded to comments. |
Failure on the future infer string is unrelated (and is fixed by #60796). Rerunning Ubuntu 310 just to be sure. |
doc/source/whatsnew/v3.0.0.rst
file if fixing a bug or adding a new feature.Second (and final) batch of GroupBy reductions being enhanced to support the
skipna
parameter.