Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate groupby/pivot observed=False default #35967

Closed
wants to merge 28 commits into from

Conversation

jseabold
Copy link
Contributor

@jseabold jseabold commented Aug 28, 2020

Had a relatively small 70k data frame that I was trying to do a groupby sum on blow up on me today. This was the reason. I had something like zip codes and cities as categoricals, expected SQL-like groupby but instead got a cartesian product of 'cities' and 'zips'. Sounds like there was some previous desire to explore a new default.

Didn't try to do any wild stuff to keep up with the stacklevel depending on where this was called from.

@jseabold jseabold changed the title Depr observed default Deprecate groupby/pivot observed=False default Aug 28, 2020
@jreback
Copy link
Contributor

jreback commented Aug 28, 2020

hmm i thought we had an issue about

@jseabold
Copy link
Contributor Author

Will see about avoiding pytest.warns though the CI stuff doesn't make it real clear why. I think the included warnings assertions didn't like asserting there wasn't a warning but I'll check again.

First question is, do you want this PR? Does it need discussion?

Second question is, what do y'all suggest I do about the warnings coming in the examples? Suppress them? Fix the examples to use use the new keyword?

@jreback
Copy link
Contributor

jreback commented Aug 30, 2020

we don't use pytest.warn instead use tm.assert_produces_warning

i think the intent of the PR is good - haven't looked closely yet.

@TomAugspurger
Copy link
Contributor

#30552 is the related issue.

I'm unsure about how to proceed here. We've overloaded Categorical for two purposes: 1: a statistical concept for the fixed set of categories, and 2: the memory-savings for low-cardinality data. I think that observed=True is a good default for Categoricals and a bad default for memory-saving. I think my ideal outcome is to

  1. implement a dtype that just does the dictionary encoding (API: Add Dictionary-encoded Extension Type #20899).
  2. Change the default observed=False to observed=None to mean "ask the extension type for its default." Then Categorical can implement observed=False by default and the Dictionary-encoded type can have observed=True.

But that's also much more work that this PR, so like I said, I'm unsure how to proceed.

@jseabold
Copy link
Contributor Author

jseabold commented Aug 31, 2020

Yeah, that makes sense as a better solution, and I was of two minds about whether to do this, but I struggled to find an example for when I'd ever want observed=False. Do you have one in mind? (Edit: see your example of survey responses. Let me think through that.). If I really wanted a cartesian product of things I don't actually have in my data, I don't think I'd reach for groupby to do that.

In the above example, I think I'd just never use Categoricals but I'd want all of their sugar for my Dictionary-encoded type.

@jseabold
Copy link
Contributor Author

I started fixing up the doc warnings, but I think there's some more I need to think through with crosstab and dropna.

@jseabold
Copy link
Contributor Author

Yeah, this is basically my position.

#30552 (comment)

@jseabold jseabold force-pushed the depr-observed-default branch from 56825f3 to ba00edb Compare August 31, 2020 19:02
@jseabold
Copy link
Contributor Author

Rebased to get rid of the merge conflict. Not sure why the coverage tests are saying I added untested lines.

@jseabold
Copy link
Contributor Author

jseabold commented Sep 1, 2020

Going to be a bit of (tedious) work to get the tests and doc builds passing on the warnings as errors runs, let me know whether this is likely to get merged, and I'll come back and fix the tests and look at the crosstab stuff.

@TomAugspurger
Copy link
Contributor

I'd say something like 50-80% probability of being accepted? As you say, it's only sometimes where this behavior is desired for (statistical) categorical columns, and it's never desired for the memory-savings purpose.

cc @jankatins, since I suspect this goes back to the original categorical implementation, if you have thoughts.

@jankatins
Copy link
Contributor

jankatins commented Sep 1, 2020

Yes, the original usecase I had in mind was a survey with lots of likert like scales: "Strong Disagree ... neutral ... Strong Agree". The original categorical was also build around what R does for factors. All the "problems" started o surface when someone discovered that categoricals save memory and time when dealing with strings.

The basic group by was "aggregator(num_col) per cat_column_y" and it should produce the same structure (ordering,number of rows) no matter if cat_column_y contained all values or not (so NA/0), e.g. to get nice plots which look structural similar in a report. For the same reason I would guess it was decided that group bys with two cat columns should show all combinations.

Categoricals defaults are (or at least were at the beginning) all geared towards that usecase. If stuff like this (and there were already others) take over it makes sense to simply rename it to something like "DictEncodedArrayBase" and add a new Categorical and a "StringDictEncodedArray" on top of it... :-)

@jseabold
Copy link
Contributor Author

jseabold commented Sep 1, 2020

All the "problems" started o surface when someone discovered that categoricals save memory and time when dealing with strings.

Just want to gently push back on the (maybe perceived) notion that this is somehow an abuse of categoricals. High-cardinality, non-independent categoricals/factors are definitely a thing, and the default can not only explode memory but also gives non-sensical answers.

I think the departure from (the flavors of) SQL(s I use) was more unexpected for me.

Buut, as I mentioned in the issue thread, I've definitely wanted both behaviors of this just today. Is there some design philosophy here to fallback to for guidance (like prefer standard SQL semantics or refuse the temptation to guess or something)? I'm not sure just a type with sensible defaults is going to solve the issue for me. E.g., I want the observed=True behavior for two independent factors but not for two dependent factors and only when they'd both be included in a groupby. It's the presence of another one that leads me to want the behavior.

All this said, now that I know and am thinking about these things I think I'll always have to specify observed. Part of me just wants to raise an error if observed=None and force folks to think, but that'd be a pretty bad UX.

@jankatins
Copy link
Contributor

For me this feels a bit like the "stringsAsFactors" saga on R: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ :-)

@jseabold
Copy link
Contributor Author

jseabold commented Sep 1, 2020

For me this feels a bit like the "stringsAsFactors" saga on R

Ha, indeed. Now you've got me rethinking everything.

@jseabold
Copy link
Contributor Author

Been thinking about this a bit, given the comments and use cases. I have a proposal that may or may not be a good one.

What about making the default None and in the presence of a categorical, if the default isn't changed it raises an error with a message that makes users choose True or False.

This would be noisy but would avoid the temptation to guess and whatever "my way is the most typical" bias that could creep in. It wouldn't force a "stringsAsFactors" situaish just an "ugh SettingWithCopyWarning" situaish (which I can live with). It also could be temporary until there's another extension dtype ready that's more appropriate for situations where your strings aren't factors (or your factors aren't independent).

Thoughts?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 17, 2020
@TomAugspurger
Copy link
Contributor

What about making the default None and in the presence of a categorical, if the default isn't changed it raises an error with a message that makes users choose True or False.

That sounds fine to me (with the default None being a warning for now, saying that an exception will be raised in the future).

@jreback
Copy link
Contributor

jreback commented Oct 20, 2020

@jseabold yeah sorry haven't gotten to this, but if you can merge master and implement @TomAugspurger suggestion would be great.

@jseabold
Copy link
Contributor Author

Yeah, sounds good.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

@jseabold if you have a chance to merge master and fix this up

@jreback jreback added Categorical Categorical Data Type Deprecate Functionality to remove in pandas and removed Stale labels Nov 26, 2020
@jseabold jseabold force-pushed the depr-observed-default branch 2 times, most recently from ffb3244 to faf8b70 Compare November 30, 2020 16:34
@jseabold jseabold force-pushed the depr-observed-default branch from cedf7b1 to 739833a Compare December 7, 2020 18:44
@jseabold
Copy link
Contributor Author

jseabold commented Dec 7, 2020

I must have some global black config that conflicts with what y'all check with. Every time I save a file, it blackens it and I run into a conflict I need to fix.

@jseabold jseabold force-pushed the depr-observed-default branch from 739833a to 3b59b23 Compare December 7, 2020 19:17
@jseabold jseabold force-pushed the depr-observed-default branch from 3b59b23 to 8526064 Compare December 7, 2020 19:55
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 7, 2020

Regarding the future behaviour we want: always raising whenever you have a categorical in your groupby keys also might not be such a great user experience ..

Thinking about some other possible alternatives:

  • Keep the default of observed=False for a single grouper, but deprecate + eventually change to observed=True for multiple groupers
    • This is basically the behaviour as it was before the observed keyword was introduced. And the memory issues (cartesian product combinatory explosion) only are an issue with multiple groupers, I think?
    • On the other hand, having different default behaviour for this depending on a single vs multiple grouper might also be confusing / surprising, and gives inconsistent behaviour ..
  • Detect if there are unobserved categories, and only raise an error if observed is not explicitly specified in that case
    • This would ensure that for a decent group of use cases we don't annoy users with warnings or errors
    • On the other hand, different behaviour depending on the actual values in the column is also not great (eg it could work fine on a test dataset, but then start failing on a different dataset if categories are missing)

@jseabold
Copy link
Contributor Author

jseabold commented Dec 7, 2020

Ha, yeah... I don't disagree about the UX. Everything about this smells - type dependent keywords, ambiguous desired behavior... Raising seems to be the only thing that refuses the temptation to guess though. In terms of lesser evils, it's good to have a guiding principle.

Maybe like ascending=False, observed=True will just have to be a thing I type many times / day.

I kind of figured Tom would have new types done by the time that this comes down to actually raising an error, though I haven't really thought through how all of that will go.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 7, 2021
@jseabold
Copy link
Contributor Author

jseabold commented Jan 7, 2021

This needs a decision. I'm going to put off merge conflicts chores until then.

@jseabold
Copy link
Contributor Author

Just blew my computer up again. Unstaling this PR.

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Apr 2, 2021
@mroeschke
Copy link
Member

Thanks for the PR @jseabold but it appears that this discussion and PR has sufficiently stalled. It appears that this issue should be addressed in a way but may be better for discussion to resume on the path forward in #30552 first. Closing but we can reopen this PR if this is the path the core devs decide on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: groupby with many empty groups memory blowup
7 participants