-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694
Conversation
so the basic problem was we were not sorting if a MultiIndex was lexsorted. But a lexsorted index, does NOT imply that the levels are monotonic (intra-level). Depending on the construction method they might or might not be. So what this is does is will force a reconstruction (of the MI), which is not actually expensive to do; to ensure that it is ordered correctly when sorting. (which we do in a myriad of places). xref #13431 which I added a test (xfailing). This is a tiny bit more complicated and I think may have to modify the internals a bit. |
Codecov Report
@@ Coverage Diff @@
## master #15694 +/- ##
==========================================
+ Coverage 90.97% 90.99% +0.02%
==========================================
Files 145 145
Lines 49474 49519 +45
==========================================
+ Hits 45007 45060 +53
+ Misses 4467 4459 -8
Continue to review full report at Codecov.
|
@chris-b1 if you'd have a look. |
@@ -1807,6 +1807,13 @@ def get_group_levels(self): | |||
'ohlc': lambda *args: ['open', 'high', 'low', 'close'] | |||
} | |||
|
|||
def _is_builtin_func(self, arg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore this, was actually an unrelated bug as this wasn't defined on BaseGrouper
698e05f
to
a6f352c
Compare
@jreback - I only skimmed the implementation, seems reasonable at first glance. I do think this needs a bigger note in the docs, and maybe should even warn if the reconstruction re-sorts the levels as this is an API change? I'm in favor in the behavior in this PR, but there could be existing code that takes advantage of the customer ordering possible with a mi. e.g. In [21]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
...: levels=[['a', 'b'], ['bb', 'aa']],
...: labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))
In [22]: df
Out[22]:
value
a bb 1
aa 2
b bb 3
aa 4
In [23]: df.sort_index()
Out[23]:
value
a bb 1
aa 2
b bb 3
aa 4 |
@chris-b1 FYI couple of recent pushes as I had some bug fixes. This only reconstructs to actually calculate the indexer. It should not be an API change, except that some sorting before just didn't work. |
@chris-b1 your example maybe with an older version
|
Maybe I'm misunderstanding, but won't
|
see [3] in my example (your index is right, but the values are not). It gets sorted. |
Sorry I mistyped the values. Pulled it down. this is the change in behavior - although master / 0.19.2 In [25]: df.sort_index()
Out[25]:
value
a bb 1
aa 2
b bb 3
aa 4 PR In [3]: df.sort_index()
value
a aa 2
bb 1
b aa 4
bb 3 |
@chris-b1 right that's the bug, they thought it sorted but actually wasn't. ok will add this as a small sub-section to show it. |
so just because it was cool :> I added support (internally) for removing unused level values, ala #2770 here: 50ac461 This is still not user exposed. Though pretty trivially to make a Further I think we could actually call this (its pretty cheap as long as you don't actually have unused levels, with a tiny modification) from a higher level (e.g. in DataFrame / Groupby) and such. This is for another issue though. |
Not to belabor the point, but what I was saying is that someone may have wanted that ordering, it was well defined behavior, if surprising. Seems to have been removed the in current docs, but there used to be a line specifically explaining that lexsorting the index does not always mean lexsorting the level values. (to be clear, I am completely for changing this)
|
revised to replace internal |
pandas/indexes/multi.py
Outdated
|
||
return MultiIndex(levels=levels, labels=labels, sortorder=sortorder, | ||
names=names) | ||
def sort_levels_monotonic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is internal, let's then call it _sort_levels_monotonic
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/indexes/multi.py
Outdated
""" | ||
.. versionadded:: 0.20.0 | ||
|
||
create a new MultiIndex from the current that removesing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removesing -> removes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/indexes/multi.py
Outdated
|
||
def remove_unused_levels(self): | ||
""" | ||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this after the explanation? (the first sentence is what appears in api summary tables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fixed up. will merge tomorrow |
Good to merge. I am thinking we might need to fix #15797 at the same time with this change (I don't mean necessarily in this PR, but the same release). |
yep will address #15797 next week. |
closes #15622
closes #15687
closes #14015
closes #13431
nice bump on Series.sort_index for monotonic