-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate groupby/pivot observed=False default #35967
Conversation
hmm i thought we had an issue about |
Will see about avoiding First question is, do you want this PR? Does it need discussion? Second question is, what do y'all suggest I do about the warnings coming in the examples? Suppress them? Fix the examples to use use the new keyword? |
we don't use pytest.warn instead use tm.assert_produces_warning i think the intent of the PR is good - haven't looked closely yet. |
#30552 is the related issue. I'm unsure about how to proceed here. We've overloaded
But that's also much more work that this PR, so like I said, I'm unsure how to proceed. |
Yeah, that makes sense as a better solution, and I was of two minds about whether to do this, but I struggled to find an example for when I'd ever want In the above example, I think I'd just never use Categoricals but I'd want all of their sugar for my Dictionary-encoded type. |
I started fixing up the doc warnings, but I think there's some more I need to think through with crosstab and |
Yeah, this is basically my position. |
56825f3
to
ba00edb
Compare
Rebased to get rid of the merge conflict. Not sure why the coverage tests are saying I added untested lines. |
Going to be a bit of (tedious) work to get the tests and doc builds passing on the warnings as errors runs, let me know whether this is likely to get merged, and I'll come back and fix the tests and look at the crosstab stuff. |
I'd say something like 50-80% probability of being accepted? As you say, it's only sometimes where this behavior is desired for (statistical) categorical columns, and it's never desired for the memory-savings purpose. cc @jankatins, since I suspect this goes back to the original categorical implementation, if you have thoughts. |
Yes, the original usecase I had in mind was a survey with lots of likert like scales: "Strong Disagree ... neutral ... Strong Agree". The original categorical was also build around what R does for factors. All the "problems" started o surface when someone discovered that categoricals save memory and time when dealing with strings. The basic group by was "aggregator(num_col) per cat_column_y" and it should produce the same structure (ordering,number of rows) no matter if cat_column_y contained all values or not (so NA/0), e.g. to get nice plots which look structural similar in a report. For the same reason I would guess it was decided that group bys with two cat columns should show all combinations. Categoricals defaults are (or at least were at the beginning) all geared towards that usecase. If stuff like this (and there were already others) take over it makes sense to simply rename it to something like "DictEncodedArrayBase" and add a new Categorical and a "StringDictEncodedArray" on top of it... :-) |
Just want to gently push back on the (maybe perceived) notion that this is somehow an abuse of categoricals. High-cardinality, non-independent categoricals/factors are definitely a thing, and the default can not only explode memory but also gives non-sensical answers. I think the departure from (the flavors of) SQL(s I use) was more unexpected for me. Buut, as I mentioned in the issue thread, I've definitely wanted both behaviors of this just today. Is there some design philosophy here to fallback to for guidance (like prefer standard SQL semantics or refuse the temptation to guess or something)? I'm not sure just a type with sensible defaults is going to solve the issue for me. E.g., I want the All this said, now that I know and am thinking about these things I think I'll always have to specify observed. Part of me just wants to raise an error if |
For me this feels a bit like the "stringsAsFactors" saga on R: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ :-) |
Ha, indeed. Now you've got me rethinking everything. |
Been thinking about this a bit, given the comments and use cases. I have a proposal that may or may not be a good one. What about making the default None and in the presence of a categorical, if the default isn't changed it raises an error with a message that makes users choose True or False. This would be noisy but would avoid the temptation to guess and whatever "my way is the most typical" bias that could creep in. It wouldn't force a "stringsAsFactors" situaish just an "ugh SettingWithCopyWarning" situaish (which I can live with). It also could be temporary until there's another extension dtype ready that's more appropriate for situations where your strings aren't factors (or your factors aren't independent). Thoughts? |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
That sounds fine to me (with the default |
@jseabold yeah sorry haven't gotten to this, but if you can merge master and implement @TomAugspurger suggestion would be great. |
Yeah, sounds good. |
@jseabold if you have a chance to merge master and fix this up |
ffb3244
to
faf8b70
Compare
cedf7b1
to
739833a
Compare
I must have some global black config that conflicts with what y'all check with. Every time I save a file, it blackens it and I run into a conflict I need to fix. |
739833a
to
3b59b23
Compare
3b59b23
to
8526064
Compare
Regarding the future behaviour we want: always raising whenever you have a categorical in your groupby keys also might not be such a great user experience .. Thinking about some other possible alternatives:
|
Ha, yeah... I don't disagree about the UX. Everything about this smells - type dependent keywords, ambiguous desired behavior... Raising seems to be the only thing that refuses the temptation to guess though. In terms of lesser evils, it's good to have a guiding principle. Maybe like I kind of figured Tom would have new types done by the time that this comes down to actually raising an error, though I haven't really thought through how all of that will go. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
This needs a decision. I'm going to put off merge conflicts chores until then. |
Just blew my computer up again. Unstaling this PR. |
Thanks for the PR @jseabold but it appears that this discussion and PR has sufficiently stalled. It appears that this issue should be addressed in a way but may be better for discussion to resume on the path forward in #30552 first. Closing but we can reopen this PR if this is the path the core devs decide on |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Had a relatively small 70k data frame that I was trying to do a groupby sum on blow up on me today. This was the reason. I had something like zip codes and cities as categoricals, expected SQL-like groupby but instead got a cartesian product of 'cities' and 'zips'. Sounds like there was some previous desire to explore a new default.
Didn't try to do any wild stuff to keep up with the stacklevel depending on where this was called from.