BUG: groupby().any() returns True instead of False for groups where timedelta column is all null #59712

sfc-gh-mvashishtha · 2024-09-04T23:21:23Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.DataFrame([pd.Timedelta(1), pd.NaT]).groupby([0, 1]).any()

Issue Description

For other dtypes, like integers and strings, groupby().any() returns True for groups where all the values are null, e.g.

pd.DataFrame([1, None]).groupby([0, 1]).any()

pd.DataFrame(["a", None]).groupby([0, 1]).any()

Expected Behavior

groupby().any() should return False for groups where all the timedelta values are null.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.9.18.final.0
python-bits : 64
OS : Darwin
OS-release : 23.6.0
Version : Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:04 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-09-07T12:22:31Z

Thanks for the report! Confirmed on main. It appears to me the issue is that we view the values as integers prior to computing mask of whether the values are NA or not.

pandas/pandas/core/groupby/ops.py

Lines 375 to 384 in 80b6850

    
               values = values.view("int64") 
        
               is_numeric = True 
        
           elif dtype.kind == "b": 
        
               values = values.view("uint8") 
        
           if values.dtype == "float16": 
        
               values = values.astype(np.float32) 
        
           if self.how in ["any", "all"]: 
        
               if mask is None: 
        
                   mask = isna(values)

I believe switching the order of these will resolve the bug, PRs to fix are welcome!

rhshadrach · 2024-09-07T12:23:32Z

Hopefully this is an easy fix (and will need a test!), so marking as a good first issue.

vivrdprasanna · 2024-09-07T19:40:47Z

take

40gilad · 2024-09-08T09:13:59Z

took it !

vivrdprasanna · 2024-09-08T23:08:42Z

Hey @40gilad , as a first time contributor, I'd hoped to take a stab at this issue. I see that you took it upon yourself to submit a proposed fix. No worries at all - I'll move onto another issue, but just wanted to flag this for future reference.

Rahul20037237 · 2024-09-10T00:48:27Z

take

Petroncini · 2024-09-11T02:52:08Z

take

prafulmaka · 2024-09-22T20:42:38Z

Whats still pending to do here?

mpvaldez · 2024-09-25T13:34:32Z

Hi! Can I take this? I want to start collaborating and I found the solution

sfc-gh-mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2024

rhshadrach added Groupby good first issue Reduction Operations sum, mean, min, max, etc. and removed good first issue Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 7, 2024

github-actions bot assigned vivrdprasanna Sep 7, 2024

40gilad mentioned this issue Sep 8, 2024

BUG: Fix groupby().any() behavior for timedelta columns with all null valuesfix issue #59712 #59750

Closed

github-actions bot assigned Rahul20037237 Sep 10, 2024

github-actions bot assigned Petroncini Sep 11, 2024

Petroncini mentioned this issue Sep 11, 2024

BUG: groupby().any() returns true for groups with timedelta all NaT #59782

Merged

Rahul20037237 removed their assignment Sep 17, 2024

vivrdprasanna removed their assignment Sep 21, 2024

rhshadrach added this to the 3.0 milestone Oct 1, 2024

rhshadrach closed this as completed in #59782 Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby().any() returns True instead of False for groups where timedelta column is all null #59712

BUG: groupby().any() returns True instead of False for groups where timedelta column is all null #59712

sfc-gh-mvashishtha commented Sep 4, 2024

INSTALLED VERSIONS

rhshadrach commented Sep 7, 2024

rhshadrach commented Sep 7, 2024

vivrdprasanna commented Sep 7, 2024

40gilad commented Sep 8, 2024

vivrdprasanna commented Sep 8, 2024

Rahul20037237 commented Sep 10, 2024

Petroncini commented Sep 11, 2024

prafulmaka commented Sep 22, 2024

mpvaldez commented Sep 25, 2024

BUG: groupby().any() returns True instead of False for groups where timedelta column is all null #59712

BUG: groupby().any() returns True instead of False for groups where timedelta column is all null #59712

Comments

sfc-gh-mvashishtha commented Sep 4, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Sep 7, 2024

rhshadrach commented Sep 7, 2024

vivrdprasanna commented Sep 7, 2024

40gilad commented Sep 8, 2024

vivrdprasanna commented Sep 8, 2024

Rahul20037237 commented Sep 10, 2024

Petroncini commented Sep 11, 2024

prafulmaka commented Sep 22, 2024

mpvaldez commented Sep 25, 2024