ENH: Adding pd.options.observed_true_on_all_groupbys #49904

PMLP-novo · 2022-11-25T09:31:30Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

After adding observed = True to allot of my group by's in order to avoid memory crashes. I wish to be able to change the default in away that I can put on of my script.

Feature Description

I want to be able to set:

import pandas as pd
pd.options.observed_true_on_all_groupbys

I know it will make the following:
df.groupby("var",observed=False)
not being respected. But I don't think anybody would want that and I have tried to make it as clear as posible in the naming

Alternative Solutions

Impliment #43999
Or make a warning on memory usage if there is more than 100,000,000 buckets used and there is less than 1,000,000 unique values in any of the variables For example.

Additional Context

Allot of people are facing this problem https://stackoverflow.com/questions/50051210/avoiding-memory-issues-for-groupby-on-large-pandas-dataframe

The text was updated successfully, but these errors were encountered:

PMLP-novo · 2022-11-25T09:39:03Z

An alternative suggestion could be to that the observed was determined at runtime by default. So if there will be created more groups than lets say 100,000,000 if groups are created in the Cartesian way, then we automatically change to observed = True.
I the code this should be having the default observed = None. This solution will be backwards compatible if users have set observed.

rhshadrach · 2022-11-25T22:43:04Z

Thanks for the request. I agree that having the default be observed=False is a bit of a pain point, but I'm -1 on making code df.groupby("var",observed=False) do anything but have observed=False as this would be very counter-intuitive. I'm also -1 on having magic numbers as cutoff points that change the behavior for the same reason.

I think the proper resolution is to change the default to True as in #43999

mroeschke · 2022-11-26T02:16:45Z

Agreed that the proper resolution in #43999 is preferable than having options that side-step behavior. I think since we had #35967 previously as an attempt to change this, and with 2.0 being the next release, I think we may be more open to changing the default. Closing

PMLP-novo added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 25, 2022

rhshadrach added Groupby Categorical Categorical Data Type labels Nov 25, 2022

mroeschke closed this as completed Nov 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Adding pd.options.observed_true_on_all_groupbys #49904

ENH: Adding pd.options.observed_true_on_all_groupbys #49904

PMLP-novo commented Nov 25, 2022

PMLP-novo commented Nov 25, 2022 •

edited

Loading

rhshadrach commented Nov 25, 2022

mroeschke commented Nov 26, 2022

ENH: Adding pd.options.observed_true_on_all_groupbys #49904

ENH: Adding pd.options.observed_true_on_all_groupbys #49904

Comments

PMLP-novo commented Nov 25, 2022

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

PMLP-novo commented Nov 25, 2022 • edited Loading

rhshadrach commented Nov 25, 2022

mroeschke commented Nov 26, 2022

PMLP-novo commented Nov 25, 2022 •

edited

Loading