Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

Closed
wants to merge 11 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 15, 2017

closes #15622
closes #15687
closes #14015
closes #13431

nice bump on Series.sort_index for monotonic

    before     after       ratio
  [37e5f78b] [a6f352c0]
-    1.86ms   100.07μs      0.05  timeseries.TimeSeries.time_sort_index_monotonic

@jreback jreback added 2/3 Compat Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 15, 2017
@jreback jreback added this to the 0.20.0 milestone Mar 15, 2017
@jreback
Copy link
Contributor Author

jreback commented Mar 15, 2017

so the basic problem was we were not sorting if a MultiIndex was lexsorted. But a lexsorted index, does NOT imply that the levels are monotonic (intra-level). Depending on the construction method they might or might not be.

So what this is does is will force a reconstruction (of the MI), which is not actually expensive to do; to ensure that it is ordered correctly when sorting. (which we do in a myriad of places).

xref #13431 which I added a test (xfailing). This is a tiny bit more complicated and I think may have to modify the internals a bit.

@jreback jreback removed 2/3 Compat Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design labels Mar 15, 2017
@codecov
Copy link

codecov bot commented Mar 16, 2017

Codecov Report

Merging #15694 into master will increase coverage by 0.02%.
The diff coverage is 98.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15694      +/-   ##
==========================================
+ Coverage   90.97%   90.99%   +0.02%     
==========================================
  Files         145      145              
  Lines       49474    49519      +45     
==========================================
+ Hits        45007    45060      +53     
+ Misses       4467     4459       -8
Flag Coverage Δ
#multiple 88.75% <98.24%> (+0.02%) ⬆️
#single 40.61% <17.54%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/sorting.py 97.81% <100%> (+0.03%) ⬆️
pandas/core/frame.py 97.57% <100%> (ø) ⬆️
pandas/indexes/multi.py 96.7% <100%> (+0.1%) ⬆️
pandas/core/reshape.py 99.27% <100%> (-0.01%) ⬇️
pandas/core/groupby.py 95.54% <100%> (+0.51%) ⬆️
pandas/core/series.py 94.89% <85.71%> (-0.08%) ⬇️
pandas/indexes/base.py 96.09% <0%> (-0.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cfc08c...bd17d2b. Read the comment docs.

@jreback jreback changed the title BUG: construct MultiIndex identically from levels/labels when concatting BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels Mar 16, 2017
@jreback
Copy link
Contributor Author

jreback commented Mar 16, 2017

@chris-b1 if you'd have a look.

@@ -1807,6 +1807,13 @@ def get_group_levels(self):
'ohlc': lambda *args: ['open', 'high', 'low', 'close']
}

def _is_builtin_func(self, arg):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore this, was actually an unrelated bug as this wasn't defined on BaseGrouper

@jreback jreback force-pushed the sort3 branch 2 times, most recently from 698e05f to a6f352c Compare March 16, 2017 14:41
@chris-b1
Copy link
Contributor

@jreback - I only skimmed the implementation, seems reasonable at first glance.

I do think this needs a bigger note in the docs, and maybe should even warn if the reconstruction re-sorts the levels as this is an API change? I'm in favor in the behavior in this PR, but there could be existing code that takes advantage of the customer ordering possible with a mi. e.g.

In [21]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
    ...:     levels=[['a', 'b'], ['bb', 'aa']],
    ...:     labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))

In [22]: df
Out[22]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [23]: df.sort_index()
Out[23]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

@jreback
Copy link
Contributor Author

jreback commented Mar 16, 2017

@chris-b1 FYI couple of recent pushes as I had some bug fixes.

This only reconstructs to actually calculate the indexer. It should not be an API change, except that some sorting before just didn't work.

@jreback
Copy link
Contributor Author

jreback commented Mar 16, 2017

@chris-b1 your example maybe with an older version

In [1]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
   ...:     ...:     levels=[['a', 'b'], ['bb', 'aa']],
   ...:     ...:     labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))
   ...: 

In [2]: df
Out[2]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [3]: df.sort_index()
Out[3]: 
      value
a aa      2
  bb      1
b aa      4
  bb      3

In [4]: df.index
Out[4]: 
MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [5]: df.sort_index().index
Out[5]: 
MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
           labels=[[0, 0, 1, 1], [1, 0, 1, 0]])

@chris-b1
Copy link
Contributor

Maybe I'm misunderstanding, but won't [23] above now be this?

In [23]: df.sort_index()
Out[23]: 
      value
a aa      1
  bb      2
b aa      3
  bb      4

@jreback
Copy link
Contributor Author

jreback commented Mar 16, 2017

see [3] in my example (your index is right, but the values are not). It gets sorted.

@chris-b1
Copy link
Contributor

Sorry I mistyped the values. Pulled it down. this is the change in behavior - although [25] (below) looks like a bug, my point was that someone could have been relying on this if they had specified the levels.

master / 0.19.2

In [25]: df.sort_index()
Out[25]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

PR

In [3]: df.sort_index()

      value
a aa      2
  bb      1
b aa      4
  bb      3

@jreback
Copy link
Contributor Author

jreback commented Mar 16, 2017

@chris-b1 right that's the bug, they thought it sorted but actually wasn't. ok will add this as a small sub-section to show it.

@jreback
Copy link
Contributor Author

jreback commented Mar 16, 2017

so just because it was cool :>

I added support (internally) for removing unused level values, ala #2770 here: 50ac461

This is still not user exposed. Though pretty trivially to make a .remove_unused_levels() function (which could just call this).

Further I think we could actually call this (its pretty cheap as long as you don't actually have unused levels, with a tiny modification) from a higher level (e.g. in DataFrame / Groupby) and such.

This is for another issue though.

cc @shoyer @wesm

@chris-b1
Copy link
Contributor

@chris-b1 right that's the bug,

Not to belabor the point, but what I was saying is that someone may have wanted that ordering, it was well defined behavior, if surprising. Seems to have been removed the in current docs, but there used to be a line specifically explaining that lexsorting the index does not always mean lexsorting the level values. (to be clear, I am completely for changing this)

There is an important new method sort_index to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!

http://pandas.pydata.org/pandas-docs/version/0.18.1/advanced.html#the-need-for-sortedness-with-multiindex

@jreback
Copy link
Contributor Author

jreback commented Apr 4, 2017

revised to replace internal _reconstruct with .sort_monotonic() and .remove_unused_levels() (now public). I think this is cleaner; revised docs a bit as well.

@chris-b1 @jorisvandenbossche


return MultiIndex(levels=levels, labels=labels, sortorder=sortorder,
names=names)
def sort_levels_monotonic(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is internal, let's then call it _sort_levels_monotonic ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""
.. versionadded:: 0.20.0

create a new MultiIndex from the current that removesing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removesing -> removes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def remove_unused_levels(self):
"""
.. versionadded:: 0.20.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this after the explanation? (the first sentence is what appears in api summary tables)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

fixed up.

@chris-b1 @jorisvandenbossche

will merge tomorrow

@jorisvandenbossche
Copy link
Member

Good to merge.

I am thinking we might need to fix #15797 at the same time with this change (I don't mean necessarily in this PR, but the same release).
For example, in the case of issue #15622 (which is said to be closed by this PR), you would end up with a now visually sorted (that was the bug report, so that is good), but no longer lexsorted frame. So that could lead to errors when indexing.

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

yep will address #15797 next week.

@jreback jreback closed this in f478e4f Apr 7, 2017
jreback added a commit to jreback/pandas that referenced this pull request Apr 22, 2017
jreback added a commit to jreback/pandas that referenced this pull request Apr 22, 2017
jreback added a commit that referenced this pull request Apr 22, 2017
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
4 participants