-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix unexpected sort in groupby #17621
BUG: Fix unexpected sort in groupby #17621
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17621 +/- ##
==========================================
- Coverage 91.2% 91.16% -0.04%
==========================================
Files 163 163
Lines 49637 49643 +6
==========================================
- Hits 45269 45259 -10
- Misses 4368 4384 +16
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17621 +/- ##
==========================================
- Coverage 91.27% 91.21% -0.06%
==========================================
Files 163 163
Lines 49765 49770 +5
==========================================
- Hits 45421 45399 -22
- Misses 4344 4371 +27
Continue to review full report at Codecov.
|
pandas/core/groupby.py
Outdated
@@ -2613,6 +2613,13 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True, | |||
|
|||
level = None | |||
key = group_axis | |||
elif key is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hate adding logic here, this function is already impenetrable, can you incorporate this to existing?
assert_frame_equal(result0, expected0) | ||
assert_frame_equal(result1, expected1) | ||
|
||
# axis=1 | ||
|
||
result0 = frame.T.groupby(level=0, axis=1).sum() | ||
result1 = frame.T.groupby(level=1, axis=1).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a couple of these that you changed can you also add the sort=True case (maybe parametrize on sort=)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew note
Hello @Licht-T! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on September 29, 2017 at 11:42 Hours UTC |
576a6cc
to
5567ac1
Compare
@jreback Thanks for your review.
|
pandas/core/groupby.py
Outdated
@@ -2626,6 +2626,14 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True, | |||
elif isinstance(key, BaseGrouper): | |||
return key, [], obj | |||
|
|||
if key is None and isinstance(group_axis, MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you simplify this. e.g. maybe put this coercion higher (before the giant if/then)
pandas/tests/groupby/test_groupby.py
Outdated
@@ -1791,18 +1791,19 @@ def aggfun(ser): | |||
agged2 = df.groupby(keys).aggregate(aggfun) | |||
assert len(agged2.columns) + 1 == len(df.columns) | |||
|
|||
def test_groupby_level(self): | |||
@pytest.mark.parametrize('sort', [True, False]) | |||
def test_groupby_level(self, sort): | |||
frame = self.mframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here as well as a comment
@jreback Thank you for comments. Fixed. |
pandas/core/groupby.py
Outdated
@@ -2586,6 +2586,15 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True, | |||
""" | |||
group_axis = obj._get_axis(axis) | |||
|
|||
if key is None and level is not None and \ | |||
isinstance(group_axis, MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe move this down a bit (under level is not None
). I don't want this to be a bespoke condition. I think you can remove the isinstance check of MultiIndex.
pandas/core/groupby.py
Outdated
level = level[0] | ||
|
||
if is_scalar(level): | ||
key = group_axis.get_level_values(level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a comment on what is going on here. maybe we can incorporate this below as well.
I am trying to remove as many special cases as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -538,6 +538,7 @@ Groupby/Resample/Rolling | |||
- Bug in ``Series.resample(...).apply()`` where an empty ``Series`` modified the source index and did not return the name of a ``Series`` (:issue:`14313`) | |||
- Bug in ``.rolling(...).apply(...)`` with a ``DataFrame`` with a ``DatetimeIndex``, a ``window`` of a timedelta-convertible and ``min_periods >= 1` (:issue:`15305`) | |||
- Bug in ``DataFrame.groupby`` where index and column keys were not recognized correctly when the number of keys equaled the number of elements on the groupby axis (:issue:`16859`) | |||
- Bug in ``DataFrame.groupby`` where the single level selection from ``MultiIndex`` occurs unexpected index sorting (:issue:`17537`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"occurs" -> "incurs"? Or maybe "causes"?
@TomAugspurger Okay. I'll do that. |
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -538,6 +538,7 @@ Groupby/Resample/Rolling | |||
- Bug in ``Series.resample(...).apply()`` where an empty ``Series`` modified the source index and did not return the name of a ``Series`` (:issue:`14313`) | |||
- Bug in ``.rolling(...).apply(...)`` with a ``DataFrame`` with a ``DatetimeIndex``, a ``window`` of a timedelta-convertible and ``min_periods >= 1` (:issue:`15305`) | |||
- Bug in ``DataFrame.groupby`` where index and column keys were not recognized correctly when the number of keys equaled the number of elements on the groupby axis (:issue:`16859`) | |||
- Bug in ``DataFrame.groupby`` where the single level selection from ``MultiIndex`` occurs unexpected index sorting (:issue:`17537`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where a single level selection from a MultiIndex
unexpectedly sorts.
24409e9
to
e4cdd07
Compare
pandas/core/groupby.py
Outdated
# axis of the object | ||
if level is not None: | ||
if not isinstance(group_axis, MultiIndex): | ||
# TODO: These two conditions are almost same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok for now. can you come back in a future PR and see what we can do with all the conditions in this section. getting pretty unweildy (and document as much as possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Okay. I'll do that. These are too complicated to do refactoring in this PR, I think.
pandas/core/groupby.py
Outdated
# TODO: These two conditions are almost same. | ||
# We should combine two. | ||
if isinstance(group_axis, MultiIndex): | ||
if is_list_like(level) and len(level) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this condition I think you can pull out of the MultiIndex check here (as the else is the same condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I am aware of this, but it seems that there are some processes only for non-MultiIndex
in else
. We have to consider carefully whether these are applicable for MultiIndex
.
https://github.com/pandas-dev/pandas/pull/17621/files/e4cdd0726e685b0216056ba224ed363bf1e836f9#diff-720d374f1a709d0075a1f0a02445cd65R2618
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When these are applicable, we also have to check if there is no side effect to subsequent processes.
can you rebase |
455a60b
to
9962d61
Compare
@jreback Rebased. |
@Licht-T can you rebase and push once again, want to get all green here. |
9962d61
to
9b6a3da
Compare
@jreback Now all green! |
thanks @Licht-T |
* 'master' of github.com:pandas-dev/pandas: (188 commits) Separate out _convert_datetime_to_tsobject (pandas-dev#17715) DOC: remove whatsnew note for xref pandas-dev#17131 BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738) DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691) CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735) Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736) TST: add backward compat for offset testing for pickles (pandas-dev#17733) remove unused time conversion funcs (pandas-dev#17711) DEPR: Deprecate convert parameter in take (pandas-dev#17352) BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587) BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153) BUG: Fix unexpected sort in groupby (pandas-dev#17621) DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731) BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654) DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675) Doc improvements for IntervalIndex and Interval (pandas-dev#17714) BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly Last of the timezones funcs (pandas-dev#17669) Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712) update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713) ...
git diff upstream/master -u -- "*.py" | flake8 --diff