BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

jreback · 2017-03-15T23:09:07Z

closes #15622
closes #15687
closes #14015
closes #13431

nice bump on Series.sort_index for monotonic

    before     after       ratio
  [37e5f78b] [a6f352c0]
-    1.86ms   100.07μs      0.05  timeseries.TimeSeries.time_sort_index_monotonic

jreback · 2017-03-15T23:11:55Z

so the basic problem was we were not sorting if a MultiIndex was lexsorted. But a lexsorted index, does NOT imply that the levels are monotonic (intra-level). Depending on the construction method they might or might not be.

So what this is does is will force a reconstruction (of the MI), which is not actually expensive to do; to ensure that it is ordered correctly when sorting. (which we do in a myriad of places).

xref #13431 which I added a test (xfailing). This is a tiny bit more complicated and I think may have to modify the internals a bit.

codecov · 2017-03-16T02:50:50Z

Codecov Report

Merging #15694 into master will increase coverage by 0.02%.
The diff coverage is 98.24%.

@@            Coverage Diff             @@
##           master   #15694      +/-   ##
==========================================
+ Coverage   90.97%   90.99%   +0.02%     
==========================================
  Files         145      145              
  Lines       49474    49519      +45     
==========================================
+ Hits        45007    45060      +53     
+ Misses       4467     4459       -8

Flag	Coverage Δ
#multiple	`88.75% <98.24%> (+0.02%)`	⬆️
#single	`40.61% <17.54%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/sorting.py	`97.81% <100%> (+0.03%)`	⬆️
pandas/core/frame.py	`97.57% <100%> (ø)`	⬆️
pandas/indexes/multi.py	`96.7% <100%> (+0.1%)`	⬆️
pandas/core/reshape.py	`99.27% <100%> (-0.01%)`	⬇️
pandas/core/groupby.py	`95.54% <100%> (+0.51%)`	⬆️
pandas/core/series.py	`94.89% <85.71%> (-0.08%)`	⬇️
pandas/indexes/base.py	`96.09% <0%> (-0.06%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cfc08c...bd17d2b. Read the comment docs.

jreback · 2017-03-16T12:40:26Z

@chris-b1 if you'd have a look.

jreback · 2017-03-16T12:41:05Z

pandas/core/groupby.py

@@ -1807,6 +1807,13 @@ def get_group_levels(self):
        'ohlc': lambda *args: ['open', 'high', 'low', 'close']
    }

+    def _is_builtin_func(self, arg):


ignore this, was actually an unrelated bug as this wasn't defined on BaseGrouper

chris-b1 · 2017-03-16T14:48:37Z

@jreback - I only skimmed the implementation, seems reasonable at first glance.

I do think this needs a bigger note in the docs, and maybe should even warn if the reconstruction re-sorts the levels as this is an API change? I'm in favor in the behavior in this PR, but there could be existing code that takes advantage of the customer ordering possible with a mi. e.g.

In [21]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
    ...:     levels=[['a', 'b'], ['bb', 'aa']],
    ...:     labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))

In [22]: df
Out[22]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [23]: df.sort_index()
Out[23]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

jreback · 2017-03-16T14:53:53Z

@chris-b1 FYI couple of recent pushes as I had some bug fixes.

This only reconstructs to actually calculate the indexer. It should not be an API change, except that some sorting before just didn't work.

jreback · 2017-03-16T14:55:21Z

@chris-b1 your example maybe with an older version

In [1]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
   ...:     ...:     levels=[['a', 'b'], ['bb', 'aa']],
   ...:     ...:     labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))
   ...: 

In [2]: df
Out[2]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [3]: df.sort_index()
Out[3]: 
      value
a aa      2
  bb      1
b aa      4
  bb      3

In [4]: df.index
Out[4]: 
MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [5]: df.sort_index().index
Out[5]: 
MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
           labels=[[0, 0, 1, 1], [1, 0, 1, 0]])

chris-b1 · 2017-03-16T14:55:21Z

Maybe I'm misunderstanding, but won't [23] above now be this?

In [23]: df.sort_index()
Out[23]: 
      value
a aa      1
  bb      2
b aa      3
  bb      4

jreback · 2017-03-16T15:06:37Z

see [3] in my example (your index is right, but the values are not). It gets sorted.

chris-b1 · 2017-03-16T15:17:34Z

Sorry I mistyped the values. Pulled it down. this is the change in behavior - although [25] (below) looks like a bug, my point was that someone could have been relying on this if they had specified the levels.

master / 0.19.2

In [25]: df.sort_index()
Out[25]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

PR

In [3]: df.sort_index()

      value
a aa      2
  bb      1
b aa      4
  bb      3

jreback · 2017-03-16T15:19:47Z

@chris-b1 right that's the bug, they thought it sorted but actually wasn't. ok will add this as a small sub-section to show it.

jreback · 2017-03-16T15:34:24Z

so just because it was cool :>

I added support (internally) for removing unused level values, ala #2770 here: 50ac461

This is still not user exposed. Though pretty trivially to make a .remove_unused_levels() function (which could just call this).

Further I think we could actually call this (its pretty cheap as long as you don't actually have unused levels, with a tiny modification) from a higher level (e.g. in DataFrame / Groupby) and such.

This is for another issue though.

cc @shoyer @wesm

chris-b1 · 2017-03-16T16:03:44Z

@chris-b1 right that's the bug,

Not to belabor the point, but what I was saying is that someone may have wanted that ordering, it was well defined behavior, if surprising. Seems to have been removed the in current docs, but there used to be a line specifically explaining that lexsorting the index does not always mean lexsorting the level values. (to be clear, I am completely for changing this)

There is an important new method sort_index to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!

http://pandas.pydata.org/pandas-docs/version/0.18.1/advanced.html#the-need-for-sortedness-with-multiindex

jreback · 2017-04-04T15:34:07Z

revised to replace internal _reconstruct with .sort_monotonic() and .remove_unused_levels() (now public). I think this is cleaner; revised docs a bit as well.

@chris-b1 @jorisvandenbossche

jorisvandenbossche · 2017-04-05T20:27:28Z

pandas/indexes/multi.py


-        return MultiIndex(levels=levels, labels=labels, sortorder=sortorder,
-                          names=names)
+    def sort_levels_monotonic(self):


If this is internal, let's then call it _sort_levels_monotonic ?

jorisvandenbossche · 2017-04-05T20:29:33Z

pandas/indexes/multi.py

+        """
+        .. versionadded:: 0.20.0
+
+        create a new MultiIndex from the current that removesing


removesing -> removes

jorisvandenbossche · 2017-04-05T20:30:06Z

pandas/indexes/multi.py

+
+    def remove_unused_levels(self):
+        """
+        .. versionadded:: 0.20.0


Can you put this after the explanation? (the first sentence is what appears in api summary tables)

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

…y doc-example

xref pandas-dev#2770

…public)

doc fixups

jreback · 2017-04-07T00:35:05Z

fixed up.

@chris-b1 @jorisvandenbossche

will merge tomorrow

jorisvandenbossche · 2017-04-07T14:07:31Z

Good to merge.

I am thinking we might need to fix #15797 at the same time with this change (I don't mean necessarily in this PR, but the same release).
For example, in the case of issue #15622 (which is said to be closed by this PR), you would end up with a now visually sorted (that was the bug report, so that is good), but no longer lexsorted frame. So that could lead to errors when indexing.

jreback · 2017-04-07T15:17:16Z

yep will address #15797 next week.

xref pandas-dev#15694 closes pandas-dev#15797

xref #15694 closes #15797

xref pandas-dev#15694 closes pandas-dev#15797

jreback added 2/3 Compat Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 15, 2017

jreback added this to the 0.20.0 milestone Mar 15, 2017

jreback force-pushed the sort3 branch from 615a99f to e7c0c14 Compare March 15, 2017 23:27

jreback removed 2/3 Compat Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design labels Mar 15, 2017

jreback force-pushed the sort3 branch from e7c0c14 to 72bc7d0 Compare March 15, 2017 23:46

jreback changed the title ~~BUG: construct MultiIndex identically from levels/labels when concatting~~ BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels Mar 16, 2017

jreback commented Mar 16, 2017

View reviewed changes

jreback force-pushed the sort3 branch 2 times, most recently from 698e05f to a6f352c Compare March 16, 2017 14:41

jreback force-pushed the sort3 branch from a6f352c to 54c6e93 Compare March 16, 2017 14:51

jreback force-pushed the sort3 branch from ae6b9ec to 50ac461 Compare March 16, 2017 15:53

jreback force-pushed the sort3 branch from c1d7d4a to cf02fcf Compare March 25, 2017 18:02

jreback force-pushed the sort3 branch from cf02fcf to 675a0b6 Compare April 2, 2017 22:59

jreback mentioned this pull request Apr 4, 2017

na_position doesn't work for sort_index() with MultiIndex #15845

Closed

4 tasks

jreback force-pushed the sort3 branch from 675a0b6 to 0115cdb Compare April 4, 2017 15:33

jreback force-pushed the sort3 branch from ee8cb43 to f933366 Compare April 4, 2017 18:32

jorisvandenbossche reviewed Apr 5, 2017

View reviewed changes

jreback force-pushed the sort3 branch from f933366 to b5a7bfc Compare April 6, 2017 13:43

jreback added 11 commits April 6, 2017 20:17

BUG: construct MultiIndex identically from levels/labels when concatting

47c67d6

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

incorrectly raising KeyError rather than UnsortedIndexError, caught b…

7be8941

…y doc-example

support for removing unused levels (internally)

b234bdb

xref pandas-dev#2770

small doc updates

269cb3b

add degenerate test case

3c4ca22

replace _reconstruct with: sort_monotonic, and remove_unused_levels (…

f2ddc9c

…public)

versionadded tags

520c9c1

simpler algo for remove_used_levels

527c3a6

add doc example

48249ab

add doc-strings, rename sort_monotonic -> sort_levels_monotonic

31097fc

rename sort_index_montonic -> _sort_index_monotonic

bd17d2b

doc fixups

jreback force-pushed the sort3 branch from b5a7bfc to bd17d2b Compare April 7, 2017 00:34

jreback closed this in f478e4f Apr 7, 2017

jreback mentioned this pull request Apr 20, 2017

Time-based .rolling() fails with .groupby() #13966

Closed

jreback added a commit to jreback/pandas that referenced this pull request Apr 22, 2017

BUG: fix degenerate MultiIndex sorting

5b382a4

xref pandas-dev#15694 closes pandas-dev#15797

jreback mentioned this pull request Apr 22, 2017

BUG: fix degenerate MultiIndex sorting #16092

Merged

jreback added a commit to jreback/pandas that referenced this pull request Apr 22, 2017

BUG: fix degenerate MultiIndex sorting

80516ff

xref pandas-dev#15694 closes pandas-dev#15797

jreback added a commit that referenced this pull request Apr 22, 2017

BUG: fix degenerate MultiIndex sorting (#16092)

c847884

xref #15694 closes #15797

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017

BUG: fix degenerate MultiIndex sorting (pandas-dev#16092)

ba6de64

xref pandas-dev#15694 closes pandas-dev#15797

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

jreback commented Mar 15, 2017 •

edited

Loading

jreback commented Mar 15, 2017

codecov bot commented Mar 16, 2017 •

edited

Loading

jreback commented Mar 16, 2017

jreback Mar 16, 2017

chris-b1 commented Mar 16, 2017

jreback commented Mar 16, 2017

jreback commented Mar 16, 2017

chris-b1 commented Mar 16, 2017

jreback commented Mar 16, 2017

chris-b1 commented Mar 16, 2017

jreback commented Mar 16, 2017

jreback commented Mar 16, 2017 •

edited

Loading

chris-b1 commented Mar 16, 2017

jreback commented Apr 4, 2017

jorisvandenbossche Apr 5, 2017

jreback Apr 7, 2017

jorisvandenbossche Apr 5, 2017

jreback Apr 7, 2017

jorisvandenbossche Apr 5, 2017

jreback Apr 7, 2017

jreback commented Apr 7, 2017

jorisvandenbossche commented Apr 7, 2017

jreback commented Apr 7, 2017

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

Conversation

jreback commented Mar 15, 2017 • edited Loading

jreback commented Mar 15, 2017

codecov bot commented Mar 16, 2017 • edited Loading

Codecov Report

jreback commented Mar 16, 2017

jreback Mar 16, 2017

Choose a reason for hiding this comment

chris-b1 commented Mar 16, 2017

jreback commented Mar 16, 2017

jreback commented Mar 16, 2017

chris-b1 commented Mar 16, 2017

jreback commented Mar 16, 2017

chris-b1 commented Mar 16, 2017

jreback commented Mar 16, 2017

jreback commented Mar 16, 2017 • edited Loading

chris-b1 commented Mar 16, 2017

jreback commented Apr 4, 2017

jorisvandenbossche Apr 5, 2017

Choose a reason for hiding this comment

jreback Apr 7, 2017

Choose a reason for hiding this comment

jorisvandenbossche Apr 5, 2017

Choose a reason for hiding this comment

jreback Apr 7, 2017

Choose a reason for hiding this comment

jorisvandenbossche Apr 5, 2017

Choose a reason for hiding this comment

jreback Apr 7, 2017

Choose a reason for hiding this comment

jreback commented Apr 7, 2017

jorisvandenbossche commented Apr 7, 2017

jreback commented Apr 7, 2017

jreback commented Mar 15, 2017 •

edited

Loading

codecov bot commented Mar 16, 2017 •

edited

Loading

jreback commented Mar 16, 2017 •

edited

Loading