Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: GROUPING UNPOPULATED DATAFRAME raises exception - index name clashes with duplicate column name #44350

Closed
3 tasks done
sjvdm opened this issue Nov 8, 2021 · 3 comments
Closed
3 tasks done
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@sjvdm
Copy link

sjvdm commented Nov 8, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

## GROUPING POPULATED DATAFRAME WORKS - index name does not clash with column name
df = pd.DataFrame(data={'f1':[1,2],'f2':[3,4],'f3':[5,6]})
df.groupby('f1').apply(lambda x: x.loc[:,'f2'].nunique()).reset_index()

#=======================================================================
## GROUPING UNPOPULATED DATAFRAME raises exception - index name clashes with duplicate column name
df = pd.DataFrame(columns=['f1','f2','f3'])
df = df.groupby('f1').apply(lambda x: x.loc[:,'f2'].nunique()).reset_index()

'''
ValueError: cannot insert f1, already exists
'''

Issue Description

There is inconsistent behaviour when grouping a populated dataframe and an empty dataframe in terms of index naming. When the index is reset, an empty dataframe will raise an exception on duplicate key.

Expected Behavior

Even when the dataframe is empty, it should not have the groupby key as a column (as with a populated dataframe)

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-240.22.1.el8_3.x86_64
Version : #1 SMP Thu Apr 8 19:01:30 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.3.1
setuptools : 46.4.0
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None

@sjvdm sjvdm added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 8, 2021
@mroeschke mroeschke added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 14, 2021
@mroeschke mroeschke changed the title BUG: groupby BUG: GROUPING UNPOPULATED DATAFRAME raises exception - index name clashes with duplicate column name Nov 14, 2021
@rhshadrach
Copy link
Member

Thanks for the report! In both cases the index has the name f1, the difference is in the columns. With an empty frame, apply has no information on what columns should be in the result, and so just returns all of them.

Similar to #41997

Even when the dataframe is empty, it should not have the groupby key as a column (as with a populated dataframe)

This is not correct.

df = pd.DataFrame(data={'f1':[1,2],'f2':[3,4],'f3':[5,6]})
print(df.groupby('f1').apply(lambda x: x))

df = pd.DataFrame(data={'f1':[1,1],'f2':[3,4],'f3':[5,6]})
print(df.groupby('f1').apply(lambda x: x.iloc[0]))

gives

   f1  f2  f3
0   1   3   5
1   2   4   6

and

0   f1  f2  f3
f1            
1    1   3   5

@rhshadrach
Copy link
Member

This error can now be avoided due to #54950 by passing include_groups=False to apply and will be the behavior in the future.

@sjvdm
Copy link
Author

sjvdm commented Mar 3, 2024

Thanks for the awesome work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

3 participants