Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: multi-indexing sorting on axis=1 on >0 levels #14015

Closed
WindJunkie opened this issue Aug 16, 2016 · 12 comments
Closed

BUG: multi-indexing sorting on axis=1 on >0 levels #14015

WindJunkie opened this issue Aug 16, 2016 · 12 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@WindJunkie
Copy link

In a DataFrame with MultiIndex, sorting on the level with date values does not do anything (order remains unchanged). This happens for both row and column indexes. In my case I am starting with a string index, converting that to datetime index, have not tried with datetime values at the start.

Code Sample, a copy-pastable example if possible

df = pd.DataFrame([[1, 2], [6, 7]])
df.columns = pd.MultiIndex.from_tuples([(0, '8/11/2016 12:00:00 AM'), (0, '8/9/2016 12:00:00 AM')], names=['l1', 'Date'])
df.columns.set_levels(df.columns.levels[1].to_datetime(), level=1, inplace=True)
df.sort_index(axis=1, level=1)

Expected Output

Columns ordered by date is the expected output.

Date    2016-08-09 00:00:00 2016-08-11 00:00:00
0   2   1
1   7   6

Actual output are columns in the original sort order (not even lexicographically sorted).

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Aug 17, 2016

xref #13431

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Difficulty Advanced labels Aug 17, 2016
@jreback jreback added this to the Next Major Release milestone Aug 17, 2016
@jreback
Copy link
Contributor

jreback commented Aug 17, 2016

I seem to remember this exact issue, but can't find ATM.

@jreback jreback changed the title Pandas does not sort on date values in multiindex BUG: multi-indexing sorting on axis=1 on >0 levels Aug 17, 2016
@chris-b1
Copy link
Contributor

chris-b1 commented Aug 17, 2016

As discussed in some of those issues, MultiIndex sorting is based on the ordering in the levels. The levels are sorted on construction, but not on re-assignment. So a couple workarounds would be to construct the mi with the converted levels, or the index also has a sort_values() method that sorts by values.

In [22]: df.reindex(columns = df.columns.sort_values())
Out[22]: 
l1            0           
Date 2016-08-09 2016-08-11
0             2          1
1             7          6

@jorisvandenbossche
Copy link
Member

MultiIndex sorting is based on the ordering in the levels.

Is this something we could consider changing?
In my opinion, this behaviour does not make much sense from a user perspective. If you want such behaviour, you can now use explicitly a CategoricalIndex. For most users of MultiIndex, the fact that it is implemented with label/levels (codes/categories) is only an implementation detail.

@shoyer
Copy link
Member

shoyer commented Aug 17, 2016

Rather than basing MultiIndex sorting on something other than sorted levels, what about requiring that each level always be sorted? I believe this is already done by default in every case where pandas constructs the MultiIndex levels, so this would only breaks cases where levels are provided explicitly in the MultiIndex constructor or set using set_levels.

@chris-b1
Copy link
Contributor

That seems fairly reasonable, although the sorting behavior is documented, and I'm sure it would break somebody's code, though probably not too painful to detect/deprecate in 0.19, fix in 0.20 / 1.0?

@shoyer
Copy link
Member

shoyer commented Aug 17, 2016

Yes, assuredly someone relies on the existing behavior, but we could probably deprecate it. In my opinion the fact that levels and labels can be sorted differently is a major source of confusion.

One option with set_level would be to automatically factorize new levels and change the underlying integer codes, too. That's probably not a good idea if someone explicitly wrote MultiIndex(levels, labels), though.

@toobaz
Copy link
Member

toobaz commented Aug 21, 2016

@chris-b1 : are you referring to the sentence "the present implementation of MultiIndex requires that the labels be sorted for some of the slicing"? Actually, I think most people (including me, until few minutes ago) interpret "the labels be sorted" as "the MultiIndex be sorted", not "the initialization arrays for .levels be sorted". And if instead they give the second interpretation, well then they should already be following the design proposed by @shoyer .

So if I'm not missing anything, it would be possible and great to have already in 0.19 the following behaviour: for each component of .levels passed at initialization, if it is not sorted, sort it, rearranging the correspondent component of .labels.

By the way: another case in which bad things currently happen is in a .join() of unsorted dataframes:

In [1]: import pandas as pd
   ...: labels = ['c', 'b']
   ...: comp = pd.DataFrame(index=pd.MultiIndex.from_product([labels, labels],
   ...:                                names =['uid', 'oth']))
   ...: idists = pd.Series(0, index=labels, name='Charles')
   ...: idists.index.name = 'uid'
   ...: cc = comp.join(idists, how='inner').sort_index()
   ...: len(cc.index.levels), cc.index.lexsort_depth, cc.index.is_monotonic
   ...: 
Out[1]: (2, 0, True)

@chris-b1
Copy link
Contributor

I meant this note at the end of paragraph, though I agree that isn't super clear either.

... labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!

@jorisvandenbossche
Copy link
Member

Yes, it is rather hidden in the docs that sorting does sort according to the order of the levels.

I am +1 to change the behaviour of sort to actually sort. But, if we do this by sorting the levels (on initialization, or when sorting), how many people would rely on the actual order of the levels? Eg if you use set_levels, you implicitly rely on the order ...

@shoyer
Copy link
Member

shoyer commented Aug 22, 2016

Looking at the MultiIndex docs, it looks like sort_index() was originally written to ensure that a MultiIndex is "lexsorted" in the way that MultiIndex needs for efficient operations (sorted integer labels a.k.a. codes #13443). I would be much happier using something more explicit like sort_index_codes() for that, though, and reserving sort_index() for actually sorting the index.

I'm less certain now that it's the right thing to always require that levels be sorted. There are some cases where this lets you do different types of indexing efficiently, and right now we expose most of the MultiIndex implementation directly as public API, so people are indeed probably making use of this.

@toobaz
Copy link
Member

toobaz commented Aug 23, 2016

While I understand that both behaviours can be useful, to my eyes it is a bit unnecessary to provide another way (than Categoricals) to impose an order on labels in a level. I'd rather have nice parameters that, for instance, create the required Categoricals on the fly when creating a MultiIndex.from, if a specific ordering is desired.

But if supporting the two different behaviours is considered worth the effort, then maybe a parameter (e.g. codes_sort=False) in sort_index() is sufficient, rather than adding a new method?

What I strongly agree on is that the default behaviour should be changed, despite the backwards incompatibility.

@jreback jreback closed this as completed in f478e4f Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants