Deprecate groupby/pivot observed=False default #35967

jseabold · 2020-08-28T21:52:59Z

Relates to groupby with categorical type returns all combinations #17594, Closes PERF: groupby with many empty groups memory blowup #30552
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Had a relatively small 70k data frame that I was trying to do a groupby sum on blow up on me today. This was the reason. I had something like zip codes and cities as categoricals, expected SQL-like groupby but instead got a cartesian product of 'cities' and 'zips'. Sounds like there was some previous desire to explore a new default.

Didn't try to do any wild stuff to keep up with the stacklevel depending on where this was called from.

jreback · 2020-08-28T22:27:19Z

hmm i thought we had an issue about

jseabold · 2020-08-30T15:10:06Z

Will see about avoiding pytest.warns though the CI stuff doesn't make it real clear why. I think the included warnings assertions didn't like asserting there wasn't a warning but I'll check again.

First question is, do you want this PR? Does it need discussion?

Second question is, what do y'all suggest I do about the warnings coming in the examples? Suppress them? Fix the examples to use use the new keyword?

jreback · 2020-08-30T21:34:04Z

we don't use pytest.warn instead use tm.assert_produces_warning

i think the intent of the PR is good - haven't looked closely yet.

TomAugspurger · 2020-08-31T16:08:13Z

#30552 is the related issue.

I'm unsure about how to proceed here. We've overloaded Categorical for two purposes: 1: a statistical concept for the fixed set of categories, and 2: the memory-savings for low-cardinality data. I think that observed=True is a good default for Categoricals and a bad default for memory-saving. I think my ideal outcome is to

implement a dtype that just does the dictionary encoding (API: Add Dictionary-encoded Extension Type #20899).
Change the default observed=False to observed=None to mean "ask the extension type for its default." Then Categorical can implement observed=False by default and the Dictionary-encoded type can have observed=True.

But that's also much more work that this PR, so like I said, I'm unsure how to proceed.

jseabold · 2020-08-31T17:47:29Z

Yeah, that makes sense as a better solution, and I was of two minds about whether to do this, but I struggled to find an example for when I'd ever want observed=False. Do you have one in mind? (Edit: see your example of survey responses. Let me think through that.). If I really wanted a cartesian product of things I don't actually have in my data, I don't think I'd reach for groupby to do that.

In the above example, I think I'd just never use Categoricals but I'd want all of their sugar for my Dictionary-encoded type.

jseabold · 2020-08-31T17:50:09Z

I started fixing up the doc warnings, but I think there's some more I need to think through with crosstab and dropna.

jseabold · 2020-08-31T17:52:50Z

Yeah, this is basically my position.

#30552 (comment)

jseabold · 2020-08-31T20:10:42Z

Rebased to get rid of the merge conflict. Not sure why the coverage tests are saying I added untested lines.

jseabold · 2020-09-01T15:00:34Z

Going to be a bit of (tedious) work to get the tests and doc builds passing on the warnings as errors runs, let me know whether this is likely to get merged, and I'll come back and fix the tests and look at the crosstab stuff.

TomAugspurger · 2020-09-01T16:12:58Z

I'd say something like 50-80% probability of being accepted? As you say, it's only sometimes where this behavior is desired for (statistical) categorical columns, and it's never desired for the memory-savings purpose.

cc @jankatins, since I suspect this goes back to the original categorical implementation, if you have thoughts.

jankatins · 2020-09-01T19:09:51Z

Yes, the original usecase I had in mind was a survey with lots of likert like scales: "Strong Disagree ... neutral ... Strong Agree". The original categorical was also build around what R does for factors. All the "problems" started o surface when someone discovered that categoricals save memory and time when dealing with strings.

The basic group by was "aggregator(num_col) per cat_column_y" and it should produce the same structure (ordering,number of rows) no matter if cat_column_y contained all values or not (so NA/0), e.g. to get nice plots which look structural similar in a report. For the same reason I would guess it was decided that group bys with two cat columns should show all combinations.

Categoricals defaults are (or at least were at the beginning) all geared towards that usecase. If stuff like this (and there were already others) take over it makes sense to simply rename it to something like "DictEncodedArrayBase" and add a new Categorical and a "StringDictEncodedArray" on top of it... :-)

jseabold · 2020-09-01T20:30:06Z

All the "problems" started o surface when someone discovered that categoricals save memory and time when dealing with strings.

Just want to gently push back on the (maybe perceived) notion that this is somehow an abuse of categoricals. High-cardinality, non-independent categoricals/factors are definitely a thing, and the default can not only explode memory but also gives non-sensical answers.

I think the departure from (the flavors of) SQL(s I use) was more unexpected for me.

Buut, as I mentioned in the issue thread, I've definitely wanted both behaviors of this just today. Is there some design philosophy here to fallback to for guidance (like prefer standard SQL semantics or refuse the temptation to guess or something)? I'm not sure just a type with sensible defaults is going to solve the issue for me. E.g., I want the observed=True behavior for two independent factors but not for two dependent factors and only when they'd both be included in a groupby. It's the presence of another one that leads me to want the behavior.

All this said, now that I know and am thinking about these things I think I'll always have to specify observed. Part of me just wants to raise an error if observed=None and force folks to think, but that'd be a pretty bad UX.

jankatins · 2020-09-01T22:47:26Z

For me this feels a bit like the "stringsAsFactors" saga on R: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ :-)

jseabold · 2020-09-01T23:06:07Z

For me this feels a bit like the "stringsAsFactors" saga on R

Ha, indeed. Now you've got me rethinking everything.

jseabold · 2020-09-16T21:53:18Z

Been thinking about this a bit, given the comments and use cases. I have a proposal that may or may not be a good one.

What about making the default None and in the presence of a categorical, if the default isn't changed it raises an error with a message that makes users choose True or False.

This would be noisy but would avoid the temptation to guess and whatever "my way is the most typical" bias that could creep in. It wouldn't force a "stringsAsFactors" situaish just an "ugh SettingWithCopyWarning" situaish (which I can live with). It also could be temporary until there's another extension dtype ready that's more appropriate for situations where your strings aren't factors (or your factors aren't independent).

Thoughts?

github-actions · 2020-10-17T00:15:58Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

TomAugspurger · 2020-10-20T16:00:49Z

What about making the default None and in the presence of a categorical, if the default isn't changed it raises an error with a message that makes users choose True or False.

That sounds fine to me (with the default None being a warning for now, saying that an exception will be raised in the future).

jreback · 2020-10-20T23:15:11Z

@jseabold yeah sorry haven't gotten to this, but if you can merge master and implement @TomAugspurger suggestion would be great.

jseabold · 2020-10-20T23:38:49Z

Yeah, sounds good.

jreback · 2020-11-26T19:04:02Z

@jseabold if you have a chance to merge master and fix this up

jseabold · 2020-12-07T19:11:42Z

I must have some global black config that conflicts with what y'all check with. Every time I save a file, it blackens it and I run into a conflict I need to fix.

jorisvandenbossche · 2020-12-07T20:41:21Z

Regarding the future behaviour we want: always raising whenever you have a categorical in your groupby keys also might not be such a great user experience ..

Thinking about some other possible alternatives:

Keep the default of observed=False for a single grouper, but deprecate + eventually change to observed=True for multiple groupers
- This is basically the behaviour as it was before the observed keyword was introduced. And the memory issues (cartesian product combinatory explosion) only are an issue with multiple groupers, I think?
- On the other hand, having different default behaviour for this depending on a single vs multiple grouper might also be confusing / surprising, and gives inconsistent behaviour ..
Detect if there are unobserved categories, and only raise an error if observed is not explicitly specified in that case
- This would ensure that for a decent group of use cases we don't annoy users with warnings or errors
- On the other hand, different behaviour depending on the actual values in the column is also not great (eg it could work fine on a test dataset, but then start failing on a different dataset if categories are missing)

jseabold · 2020-12-07T22:23:05Z

Ha, yeah... I don't disagree about the UX. Everything about this smells - type dependent keywords, ambiguous desired behavior... Raising seems to be the only thing that refuses the temptation to guess though. In terms of lesser evils, it's good to have a guiding principle.

Maybe like ascending=False, observed=True will just have to be a thing I type many times / day.

I kind of figured Tom would have new types done by the time that this comes down to actually raising an error, though I haven't really thought through how all of that will go.

github-actions · 2021-01-07T00:31:00Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jseabold · 2021-01-07T16:49:04Z

This needs a decision. I'm going to put off merge conflicts chores until then.

jseabold · 2021-02-12T00:58:05Z

Just blew my computer up again. Unstaling this PR.

mroeschke · 2021-07-11T20:27:45Z

Thanks for the PR @jseabold but it appears that this discussion and PR has sufficiently stalled. It appears that this issue should be addressed in a way but may be better for discussion to resume on the path forward in #30552 first. Closing but we can reopen this PR if this is the path the core devs decide on

jseabold changed the title ~~Depr observed default~~ Deprecate groupby/pivot observed=False default Aug 28, 2020

jseabold force-pushed the depr-observed-default branch from 56825f3 to ba00edb Compare August 31, 2020 19:02

ant1j mentioned this pull request Sep 28, 2020

Categorical in GroupBy with aggregations raise error under specific conditions #36698

Closed

github-actions bot added the Stale label Oct 17, 2020

TomAugspurger mentioned this pull request Nov 18, 2020

Feature Request: add observed- keyword to groupby dask/dask#4371

Closed

jreback added Categorical Categorical Data Type Deprecate Functionality to remove in pandas and removed Stale labels Nov 26, 2020

jseabold force-pushed the depr-observed-default branch 2 times, most recently from ffb3244 to faf8b70 Compare November 30, 2020 16:34

jseabold added 10 commits December 7, 2020 09:29

Expect FutureWarning

394fe5b

crosstab doesn't have observed keyword

be6d3c1

Filter fewer warnings

6cca8c0

Hard code observed behavior to silence warning. See pandas-dev#35967

0864317

PR in comment

029edd0

Hardcode vs. filtering warning

5d15dd1

Blacken

fb6e4b0

Silence deprecation warning.

363865b

Hard code default behavior. See pandas-dev#35967

d4d918b

Hardcode default categorical behavior. See pandas-dev#35967

57d99a7

jseabold force-pushed the depr-observed-default branch from cedf7b1 to 739833a Compare December 7, 2020 18:44

jseabold force-pushed the depr-observed-default branch from 739833a to 3b59b23 Compare December 7, 2020 19:17

Will raise in the future

8526064

jseabold force-pushed the depr-observed-default branch from 3b59b23 to 8526064 Compare December 7, 2020 19:55

github-actions bot added the Stale label Jan 7, 2021

mroeschke added the Needs Discussion Requires discussion from core team before further action label Apr 2, 2021

mroeschke closed this Jul 11, 2021

jseabold mentioned this pull request Sep 10, 2021

ENH: Add observed keyword to value_counts #43498

Closed

jreback mentioned this pull request Oct 13, 2021

DEPR: Change default to observed=True in DataFrame.groupby #43999

Closed

Liam3851 mentioned this pull request Dec 30, 2021

BUG: group by with categorical columns causes an exception #45128

Closed

3 tasks

mroeschke mentioned this pull request Nov 26, 2022

ENH: Adding pd.options.observed_true_on_all_groupbys #49904

Closed

3 tasks

datapythonista mentioned this pull request Jan 5, 2023

DEPR: Enforce 2.0 deprecations #50579

Closed

jseabold mentioned this pull request Mar 7, 2023

PERF: groupby with many empty groups memory blowup #30552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate groupby/pivot observed=False default #35967

Deprecate groupby/pivot observed=False default #35967

jseabold commented Aug 28, 2020 •

edited

Loading

jreback commented Aug 28, 2020

jseabold commented Aug 30, 2020

jreback commented Aug 30, 2020

TomAugspurger commented Aug 31, 2020

jseabold commented Aug 31, 2020 •

edited

Loading

jseabold commented Aug 31, 2020

jseabold commented Aug 31, 2020

jseabold commented Aug 31, 2020

jseabold commented Sep 1, 2020

TomAugspurger commented Sep 1, 2020

jankatins commented Sep 1, 2020 •

edited

Loading

jseabold commented Sep 1, 2020

jankatins commented Sep 1, 2020

jseabold commented Sep 1, 2020

jseabold commented Sep 16, 2020

github-actions bot commented Oct 17, 2020

TomAugspurger commented Oct 20, 2020

jreback commented Oct 20, 2020

jseabold commented Oct 20, 2020

jreback commented Nov 26, 2020

jseabold commented Dec 7, 2020

jorisvandenbossche commented Dec 7, 2020 •

edited

Loading

jseabold commented Dec 7, 2020

github-actions bot commented Jan 7, 2021

jseabold commented Jan 7, 2021

jseabold commented Feb 12, 2021

mroeschke commented Jul 11, 2021

Deprecate groupby/pivot observed=False default #35967

Deprecate groupby/pivot observed=False default #35967

Conversation

jseabold commented Aug 28, 2020 • edited Loading

jreback commented Aug 28, 2020

jseabold commented Aug 30, 2020

jreback commented Aug 30, 2020

TomAugspurger commented Aug 31, 2020

jseabold commented Aug 31, 2020 • edited Loading

jseabold commented Aug 31, 2020

jseabold commented Aug 31, 2020

jseabold commented Aug 31, 2020

jseabold commented Sep 1, 2020

TomAugspurger commented Sep 1, 2020

jankatins commented Sep 1, 2020 • edited Loading

jseabold commented Sep 1, 2020

jankatins commented Sep 1, 2020

jseabold commented Sep 1, 2020

jseabold commented Sep 16, 2020

github-actions bot commented Oct 17, 2020

TomAugspurger commented Oct 20, 2020

jreback commented Oct 20, 2020

jseabold commented Oct 20, 2020

jreback commented Nov 26, 2020

jseabold commented Dec 7, 2020

jorisvandenbossche commented Dec 7, 2020 • edited Loading

jseabold commented Dec 7, 2020

github-actions bot commented Jan 7, 2021

jseabold commented Jan 7, 2021

jseabold commented Feb 12, 2021

mroeschke commented Jul 11, 2021

jseabold commented Aug 28, 2020 •

edited

Loading

jseabold commented Aug 31, 2020 •

edited

Loading

jankatins commented Sep 1, 2020 •

edited

Loading

jorisvandenbossche commented Dec 7, 2020 •

edited

Loading