`sort=False` option to stack/unstack/pivot #15105

naught101 · 2017-01-11T06:24:45Z

It would be really nice if there was a sort=False option on stack/unstack and pivot. (Preferably the default)

It is reasonably common to have data in non-standard order that actually provides information (in my case, I have model names, and the order of the names denotes complexity of the models). Stacking or unstacking currently loses all of this information, with no way to retrieve it. That does not seem like a sensible default to me.

It would be relatively easy to work around a non-sorted stack/unstack method (using .sort_index). To go the other way is less trivial, requiring the user to store a list of the values in the necessary order.

I actually find it hard to think of a situation where a sort on unstack would be more useful...

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-11T13:44:33Z

can you show an example. These are ordered by the index and not sorted.

shoyer · 2017-01-11T16:56:19Z

Indeed, as @jreback points out we don't sort when stacking or unstacking. Rather, levels are sorted internally in a MultiIndex when a MultiIndex is constructed (e.g., with from_arrays or from_product).

This is a confusing implementation detail that leaks into the public API. The levels of a MultiIndex only look sorted if you look at the low-level MultiIndex repr, not the values in a series/dataframe:

In [4]: index = pd.MultiIndex.from_arrays([['b', 'a'], [1, 0]])

In [5]: index
Out[5]:
MultiIndex(levels=[['a', 'b'], [0, 1]],
           labels=[[1, 0], [1, 0]])

In [6]: index.to_series()
Out[6]:
b  1    (b, 1)
a  0    (a, 0)
dtype: object

See #14903 and #14672 for related discussion.

I see a few alternatives for cleaning this up:

Add a sort_levels=False argument to from_arrays and from_product, to allow not sorting levels at MultiIndex construction time. The downside is that this means that some MultiIndex indexing operations will be slow, which defeats part of the purpose of a MultiIndex, but it could still be nice to have the option.
Add a sort=False or reorder=False option (maybe make this the default?) to stack/unstack/pivot, which would ensure that unstacked columns appear in order of appearance in the MultiIndex rather than sorted order (as you suggest here).
Alternatively, we could do nothing and encourage using Categorical dtype to preserve level order, e.g., use pd.Categorical(['b', 'a'], categories=['b', 'a']) rather than just ['b', 'a']. But this is pretty cumbersome and doesn't make the default behavior any more intuitive.

naught101 · 2017-01-13T00:03:42Z

2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.

shoyer · 2017-01-13T00:06:04Z

2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.

Correct, yes. One downside of this approach is that it is slightly slower to construct the new labels. It requires a pass over the full index using pd.unique, rather than just using the unique set of labels.

naught101 · 2017-01-18T23:01:09Z

There's a good example of this here, I think: https://stackoverflow.com/questions/28686053/pandas-pivot-table-reoganize-order-of-multi-index

jreback · 2017-01-19T01:26:31Z

So this is quite straightforward to provide categorical orderings.
This currently has a small bug, see #15058, but does work (even though indexing generally requires lexsorting, this is compatible). So This is a reasonable soln.

In [31]: index = pd.MultiIndex.from_tuples(list(zip(['a', 'a', 'a', 'b', 'b', 'b'],
    ...:                                       [0, 0, 0, 1, 1, 1],
    ...:                                       ['x', 'xx', 'xxx', 'x', 'xx', 'xxx'])),
    ...:                                       names=['A', 'B', 'C'])
    ...: df = pd.DataFrame(np.random.rand(6, 3), index = index)
    ...: 
    ...: 

In [32]: df
Out[32]: 
                0         1         2
A B C                                
a 0 x    0.179126  0.320740  0.048524
    xx   0.448321  0.755726  0.685202
    xxx  0.058184  0.984778  0.432928
b 1 x    0.279140  0.876715  0.249856
    xx   0.185775  0.403420  0.729933
    xxx  0.133858  0.652050  0.960392

In [33]: df = df.reset_index()

In [34]: df = df.assign(C=df.C.astype('category', categories=['xxx', 'xx', 'x']))

In [35]: df
Out[35]: 
   A  B    C         0         1         2
0  a  0    x  0.179126  0.320740  0.048524
1  a  0   xx  0.448321  0.755726  0.685202
2  a  0  xxx  0.058184  0.984778  0.432928
3  b  1    x  0.279140  0.876715  0.249856
4  b  1   xx  0.185775  0.403420  0.729933
5  b  1  xxx  0.133858  0.652050  0.960392

In [36]: df.dtypes
Out[36]: 
A      object
B       int64
C    category
0     float64
1     float64
2     float64
dtype: object

In [37]: df.sort_values(['A', 'B', 'C'])
Out[37]: 
   A  B    C         0         1         2
2  a  0  xxx  0.058184  0.984778  0.432928
1  a  0   xx  0.448321  0.755726  0.685202
0  a  0    x  0.179126  0.320740  0.048524
5  b  1  xxx  0.133858  0.652050  0.960392
4  b  1   xx  0.185775  0.403420  0.729933
3  b  1    x  0.279140  0.876715  0.249856

michalkahle · 2019-05-24T14:09:28Z

The point is that stack/unstack really sorts the the index. Let's continue with the session by @jreback .

In [7]: df = df.sort_values(['A', 'B', 'C'], ascending=False) 

In [8]: df    
Out[8]: 
   A  B    C         0         1         2
3  b  1    x  0.674635  0.607875  0.356187
4  b  1   xx  0.062986  0.675858  0.895372
5  b  1  xxx  0.974950  0.448647  0.178389
0  a  0    x  0.683399  0.652522  0.549847
1  a  0   xx  0.168813  0.891502  0.385490
2  a  0  xxx  0.881862  0.143344  0.867280

In [9]: df = df.set_index(['A', 'B', 'C'])  

In [10]: df  
Out[10]: 
                0         1         2
A B C                                
b 1 x    0.674635  0.607875  0.356187
    xx   0.062986  0.675858  0.895372
    xxx  0.974950  0.448647  0.178389
a 0 x    0.683399  0.652522  0.549847
    xx   0.168813  0.891502  0.385490
    xxx  0.881862  0.143344  0.867280

In [11]: df.index     
Out[11]: 
MultiIndex(levels=[['a', 'b'], [0, 1], ['xxx', 'xx', 'x']],
           codes=[[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [2, 1, 0, 2, 1, 0]],
           names=['A', 'B', 'C'])

Now, this is the bug mentioned above. Never mind this, let's continue.

In [12]: df = df.unstack()  

In [13]: df    
Out[13]: 
            0                             1                             2                    
C         xxx        xx         x       xxx        xx         x       xxx        xx         x
A B                                                                                          
a 0  0.881862  0.168813  0.683399  0.143344  0.891502  0.652522  0.867280  0.385490  0.549847
b 1  0.974950  0.062986  0.674635  0.448647  0.675858  0.607875  0.178389  0.895372  0.356187

In [14]: df.index  
Out[14]: 
MultiIndex(levels=[['a', 'b'], [0, 1]],
           codes=[[0, 1], [0, 1]],
           names=['A', 'B'])

Now the data frame is sorted by the remaining levels in the index!

This is undocumented behavior. The documentation says only: "The level involved will automatically get sorted."

Also requested in multiple SO questions.

naught101 · 2019-10-14T03:36:45Z

This is quite aggravating with column multiindex. It doesn't seem like there is a work around.

Firstly,there doesn't seem to be an easy way to convert an existing multindex into a categorical one. The best I could come up with is to select a df row, reset_index on the resultant series, and manually create a categorical multi-index index out of all the relevant columns, and then re-assign that to the original. This works, but it would be nice if there was an easy way to do something like df.columns['blah'] = pd.CategoricalIndex(df.columns['blah'], ordered=True) or something.

Secondly, even with a categorical column multi index, the categories are completely ignored. For example, I would like to stack the 'year' level of the columns, and retain the order of the other levels, but the Categoricalness gets lost in transit:

In [1]: final_df.columns                                                                                  
Out[1]: 
MultiIndex([('value',       'sum_risk_cost', 2020, 'national'),
            ('value',       'sum_risk_cost', 2100, 'national'),
            ('value',   'avg_risk_fraction', 2020, 'national'),
            ('value',   'avg_risk_fraction', 2100, 'national'),
            ('value',   'count_uninsurable', 2020, 'national'),
            ('value',   'count_uninsurable', 2100, 'national'),
            ('value', 'percent_uninsurable', 2020, 'national'),
            ('value', 'percent_uninsurable', 2100, 'national'),
            ( 'rank',       'sum_risk_cost', 2020, 'national'),
            ( 'rank',       'sum_risk_cost', 2020,  'nat_10k'),
            ( 'rank',       'sum_risk_cost', 2020,    'state'),
            ( 'rank',       'sum_risk_cost', 2100, 'national'),
            ( 'rank',       'sum_risk_cost', 2100,  'nat_10k'),
            ( 'rank',       'sum_risk_cost', 2100,    'state'),
            ( 'rank',   'avg_risk_fraction', 2020, 'national'),
            ( 'rank',   'avg_risk_fraction', 2020,  'nat_10k'),
            ( 'rank',   'avg_risk_fraction', 2020,    'state'),
            ( 'rank',   'avg_risk_fraction', 2100, 'national'),
            ( 'rank',   'avg_risk_fraction', 2100,  'nat_10k'),
            ( 'rank',   'avg_risk_fraction', 2100,    'state'),
            ( 'rank',   'count_uninsurable', 2020, 'national'),
            ( 'rank',   'count_uninsurable', 2020,  'nat_10k'),
            ( 'rank',   'count_uninsurable', 2020,    'state'),
            ( 'rank',   'count_uninsurable', 2100, 'national'),
            ( 'rank',   'count_uninsurable', 2100,  'nat_10k'),
            ( 'rank',   'count_uninsurable', 2100,    'state'),
            ( 'rank', 'percent_uninsurable', 2020, 'national'),
            ( 'rank', 'percent_uninsurable', 2020,  'nat_10k'),
            ( 'rank', 'percent_uninsurable', 2020,    'state'),
            ( 'rank', 'percent_uninsurable', 2100, 'national'),
            ( 'rank', 'percent_uninsurable', 2100,  'nat_10k'),
            ( 'rank', 'percent_uninsurable', 2100,    'state')],
           names=['type', 'stat', 'year', 'subset'])

In [2]: final_df.columns.get_level_values('type')                                                         
Out[2]: 
CategoricalIndex(['value', 'value', 'value', 'value', 'value', 'value',
                  'value', 'value', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank'],
                 categories=['rank', 'value'], ordered=True, name='type', dtype='category')

In [3]: final_df = final_df.stack('year').sort_index()                                                    

In [4]: final_df.columns.get_level_values('type')                                                         
Out[4]: 
Index(['rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
       'rank', 'rank', 'rank', 'value', 'value', 'value', 'value'],
      dtype='object', name='type')

It also doesn't seem easy to manually store the column orders and re-use them afterwards, die to the missing 'year' level.

Anyone have a suggested work-around for this?

cemanughian · 2021-01-09T00:35:28Z

Is anyone going to fix this?

jreback · 2021-01-09T01:00:20Z

@cemanughian pandas is all volunteer and there are quite a number of open issues

you are welcome to do a pill request - core devs can provide review

DriesSchaumont · 2021-04-23T07:38:49Z

@jreback I am wondering if we need to add a new argument for this? I think we can advice using droplevel and reindex?
If we do want to add this, I can have a look.

import pandas as pd

>>> tuples = list(zip(['zzz', 'xxx', 'ddd', 'zzz', 'aaa', 'zzz', 'aaa'], ['z', 'z', 'z', 'a', 'z', 'x', 'a']))
>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> df = pd.DataFrame({'A': [2, 5, 3, 1, 1, 6, 7], 'B': [4, 3, 5, 10, 5, 3, 8]}, index=index)
>>> df
              A   B
first second
zzz   z       2   4
xxx   z       5   3
ddd   z       3   5
zzz   a       1  10
aaa   z       1   5
zzz   x       6   3
aaa   a       7   8
>>> unstacked = df.unstack()
>>> unstacked
          A               B
second    a    x    z     a    x    z
first
aaa     7.0  NaN  1.0   8.0  NaN  5.0
ddd     NaN  NaN  3.0   NaN  NaN  5.0
xxx     NaN  NaN  5.0   NaN  NaN  3.0
zzz     1.0  6.0  2.0  10.0  3.0  4.0
>>> new_index = index.droplevel(-1).unique()
>>> unstacked.reindex(new_index)
          A               B
second    a    x    z     a    x    z
first
zzz     1.0  6.0  2.0  10.0  3.0  4.0
xxx     NaN  NaN  5.0   NaN  NaN  3.0
ddd     NaN  NaN  3.0   NaN  NaN  5.0
aaa     7.0  NaN  1.0   8.0  NaN  5.0

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 11, 2017

jzwinck mentioned this issue Jun 29, 2018

BUG: unstack() does not always sort index in 0.23 #21675

Closed

MarcoGorelli mentioned this issue May 11, 2020

BUG: unstack() sorts automatically, default is preferably unsorted? #34112

Closed

simonjayhawkins mentioned this issue Jul 19, 2020

ENH: change sort behavior in stack() so it's user-directed #35343

Closed

simonjayhawkins added API Design Bug labels Jul 19, 2020

simonjayhawkins added this to the Contributions Welcome milestone Jul 19, 2020

simonjayhawkins added the MultiIndex label Jul 19, 2020

simonjayhawkins mentioned this issue Jul 19, 2020

BUG: DataFrame.unstack() does not properly sort list of levels #9514

Open

phofl mentioned this issue Apr 6, 2021

QST:Why DataFrame.pivot() will change the order of the rows of DataFrame? #39143

Closed

2 tasks

phofl mentioned this issue Apr 21, 2021

Add keyword sort to pivot_table #40954

Merged

4 tasks

mroeschke added Enhancement and removed API Design Bug labels May 8, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

This was referenced May 17, 2023

ENH: Add sort keyword to stack #53282

Merged

ENH: Add sort keyword to unstack #53298

Merged

mroeschke closed this as completed in #53298 Jun 1, 2023

mroeschke mentioned this issue Jun 20, 2023

BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels #53637

Merged

5 tasks

This was referenced Jun 25, 2023

BUG: DataFrame.stack sometimes sorting the resulting index #53825

Merged

DEPR: sort=True in DataFrame.unstack and Series.unstack #53915

Open

REGR: DataFrame.stack was sometimes sorting resulting index #53969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sort=False` option to stack/unstack/pivot #15105

`sort=False` option to stack/unstack/pivot #15105

naught101 commented Jan 11, 2017

jreback commented Jan 11, 2017

shoyer commented Jan 11, 2017

naught101 commented Jan 13, 2017

shoyer commented Jan 13, 2017

naught101 commented Jan 18, 2017

jreback commented Jan 19, 2017

michalkahle commented May 24, 2019

naught101 commented Oct 14, 2019 •

edited

Loading

cemanughian commented Jan 9, 2021

jreback commented Jan 9, 2021

DriesSchaumont commented Apr 23, 2021

sort=False option to stack/unstack/pivot #15105

sort=False option to stack/unstack/pivot #15105

Comments

naught101 commented Jan 11, 2017

jreback commented Jan 11, 2017

shoyer commented Jan 11, 2017

naught101 commented Jan 13, 2017

shoyer commented Jan 13, 2017

naught101 commented Jan 18, 2017

jreback commented Jan 19, 2017

michalkahle commented May 24, 2019

naught101 commented Oct 14, 2019 • edited Loading

cemanughian commented Jan 9, 2021

jreback commented Jan 9, 2021

DriesSchaumont commented Apr 23, 2021

`sort=False` option to stack/unstack/pivot #15105

`sort=False` option to stack/unstack/pivot #15105

naught101 commented Oct 14, 2019 •

edited

Loading