Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort=False option to stack/unstack/pivot #15105

Closed
naught101 opened this issue Jan 11, 2017 · 11 comments · Fixed by #53298
Closed

sort=False option to stack/unstack/pivot #15105

naught101 opened this issue Jan 11, 2017 · 11 comments · Fixed by #53298
Labels
Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@naught101
Copy link

It would be really nice if there was a sort=False option on stack/unstack and pivot. (Preferably the default)

It is reasonably common to have data in non-standard order that actually provides information (in my case, I have model names, and the order of the names denotes complexity of the models). Stacking or unstacking currently loses all of this information, with no way to retrieve it. That does not seem like a sensible default to me.

It would be relatively easy to work around a non-sorted stack/unstack method (using .sort_index). To go the other way is less trivial, requiring the user to store a list of the values in the necessary order.

I actually find it hard to think of a situation where a sort on unstack would be more useful...

@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

can you show an example. These are ordered by the index and not sorted.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 11, 2017
@shoyer
Copy link
Member

shoyer commented Jan 11, 2017

Indeed, as @jreback points out we don't sort when stacking or unstacking. Rather, levels are sorted internally in a MultiIndex when a MultiIndex is constructed (e.g., with from_arrays or from_product).

This is a confusing implementation detail that leaks into the public API. The levels of a MultiIndex only look sorted if you look at the low-level MultiIndex repr, not the values in a series/dataframe:

In [4]: index = pd.MultiIndex.from_arrays([['b', 'a'], [1, 0]])

In [5]: index
Out[5]:
MultiIndex(levels=[['a', 'b'], [0, 1]],
           labels=[[1, 0], [1, 0]])

In [6]: index.to_series()
Out[6]:
b  1    (b, 1)
a  0    (a, 0)
dtype: object

See #14903 and #14672 for related discussion.

I see a few alternatives for cleaning this up:

  1. Add a sort_levels=False argument to from_arrays and from_product, to allow not sorting levels at MultiIndex construction time. The downside is that this means that some MultiIndex indexing operations will be slow, which defeats part of the purpose of a MultiIndex, but it could still be nice to have the option.
  2. Add a sort=False or reorder=False option (maybe make this the default?) to stack/unstack/pivot, which would ensure that unstacked columns appear in order of appearance in the MultiIndex rather than sorted order (as you suggest here).
  3. Alternatively, we could do nothing and encourage using Categorical dtype to preserve level order, e.g., use pd.Categorical(['b', 'a'], categories=['b', 'a']) rather than just ['b', 'a']. But this is pretty cumbersome and doesn't make the default behavior any more intuitive.

@naught101
Copy link
Author

2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.

@shoyer
Copy link
Member

shoyer commented Jan 13, 2017

2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.

Correct, yes. One downside of this approach is that it is slightly slower to construct the new labels. It requires a pass over the full index using pd.unique, rather than just using the unique set of labels.

@naught101
Copy link
Author

@jreback
Copy link
Contributor

jreback commented Jan 19, 2017

So this is quite straightforward to provide categorical orderings.
This currently has a small bug, see #15058, but does work (even though indexing generally requires lexsorting, this is compatible). So This is a reasonable soln.

In [31]: index = pd.MultiIndex.from_tuples(list(zip(['a', 'a', 'a', 'b', 'b', 'b'],
    ...:                                       [0, 0, 0, 1, 1, 1],
    ...:                                       ['x', 'xx', 'xxx', 'x', 'xx', 'xxx'])),
    ...:                                       names=['A', 'B', 'C'])
    ...: df = pd.DataFrame(np.random.rand(6, 3), index = index)
    ...: 
    ...: 

In [32]: df
Out[32]: 
                0         1         2
A B C                                
a 0 x    0.179126  0.320740  0.048524
    xx   0.448321  0.755726  0.685202
    xxx  0.058184  0.984778  0.432928
b 1 x    0.279140  0.876715  0.249856
    xx   0.185775  0.403420  0.729933
    xxx  0.133858  0.652050  0.960392

In [33]: df = df.reset_index()

In [34]: df = df.assign(C=df.C.astype('category', categories=['xxx', 'xx', 'x']))

In [35]: df
Out[35]: 
   A  B    C         0         1         2
0  a  0    x  0.179126  0.320740  0.048524
1  a  0   xx  0.448321  0.755726  0.685202
2  a  0  xxx  0.058184  0.984778  0.432928
3  b  1    x  0.279140  0.876715  0.249856
4  b  1   xx  0.185775  0.403420  0.729933
5  b  1  xxx  0.133858  0.652050  0.960392

In [36]: df.dtypes
Out[36]: 
A      object
B       int64
C    category
0     float64
1     float64
2     float64
dtype: object

In [37]: df.sort_values(['A', 'B', 'C'])
Out[37]: 
   A  B    C         0         1         2
2  a  0  xxx  0.058184  0.984778  0.432928
1  a  0   xx  0.448321  0.755726  0.685202
0  a  0    x  0.179126  0.320740  0.048524
5  b  1  xxx  0.133858  0.652050  0.960392
4  b  1   xx  0.185775  0.403420  0.729933
3  b  1    x  0.279140  0.876715  0.249856

@michalkahle
Copy link

The point is that stack/unstack really sorts the the index. Let's continue with the session by @jreback .

In [7]: df = df.sort_values(['A', 'B', 'C'], ascending=False) 

In [8]: df    
Out[8]: 
   A  B    C         0         1         2
3  b  1    x  0.674635  0.607875  0.356187
4  b  1   xx  0.062986  0.675858  0.895372
5  b  1  xxx  0.974950  0.448647  0.178389
0  a  0    x  0.683399  0.652522  0.549847
1  a  0   xx  0.168813  0.891502  0.385490
2  a  0  xxx  0.881862  0.143344  0.867280

In [9]: df = df.set_index(['A', 'B', 'C'])  

In [10]: df  
Out[10]: 
                0         1         2
A B C                                
b 1 x    0.674635  0.607875  0.356187
    xx   0.062986  0.675858  0.895372
    xxx  0.974950  0.448647  0.178389
a 0 x    0.683399  0.652522  0.549847
    xx   0.168813  0.891502  0.385490
    xxx  0.881862  0.143344  0.867280

In [11]: df.index     
Out[11]: 
MultiIndex(levels=[['a', 'b'], [0, 1], ['xxx', 'xx', 'x']],
           codes=[[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [2, 1, 0, 2, 1, 0]],
           names=['A', 'B', 'C'])

Now, this is the bug mentioned above. Never mind this, let's continue.

In [12]: df = df.unstack()  

In [13]: df    
Out[13]: 
            0                             1                             2                    
C         xxx        xx         x       xxx        xx         x       xxx        xx         x
A B                                                                                          
a 0  0.881862  0.168813  0.683399  0.143344  0.891502  0.652522  0.867280  0.385490  0.549847
b 1  0.974950  0.062986  0.674635  0.448647  0.675858  0.607875  0.178389  0.895372  0.356187

In [14]: df.index  
Out[14]: 
MultiIndex(levels=[['a', 'b'], [0, 1]],
           codes=[[0, 1], [0, 1]],
           names=['A', 'B'])

Now the data frame is sorted by the remaining levels in the index!

This is undocumented behavior. The documentation says only: "The level involved will automatically get sorted."

Also requested in multiple SO questions.

@naught101
Copy link
Author

naught101 commented Oct 14, 2019

This is quite aggravating with column multiindex. It doesn't seem like there is a work around.

Firstly,there doesn't seem to be an easy way to convert an existing multindex into a categorical one. The best I could come up with is to select a df row, reset_index on the resultant series, and manually create a categorical multi-index index out of all the relevant columns, and then re-assign that to the original. This works, but it would be nice if there was an easy way to do something like df.columns['blah'] = pd.CategoricalIndex(df.columns['blah'], ordered=True) or something.

Secondly, even with a categorical column multi index, the categories are completely ignored. For example, I would like to stack the 'year' level of the columns, and retain the order of the other levels, but the Categoricalness gets lost in transit:

In [1]: final_df.columns                                                                                  
Out[1]: 
MultiIndex([('value',       'sum_risk_cost', 2020, 'national'),
            ('value',       'sum_risk_cost', 2100, 'national'),
            ('value',   'avg_risk_fraction', 2020, 'national'),
            ('value',   'avg_risk_fraction', 2100, 'national'),
            ('value',   'count_uninsurable', 2020, 'national'),
            ('value',   'count_uninsurable', 2100, 'national'),
            ('value', 'percent_uninsurable', 2020, 'national'),
            ('value', 'percent_uninsurable', 2100, 'national'),
            ( 'rank',       'sum_risk_cost', 2020, 'national'),
            ( 'rank',       'sum_risk_cost', 2020,  'nat_10k'),
            ( 'rank',       'sum_risk_cost', 2020,    'state'),
            ( 'rank',       'sum_risk_cost', 2100, 'national'),
            ( 'rank',       'sum_risk_cost', 2100,  'nat_10k'),
            ( 'rank',       'sum_risk_cost', 2100,    'state'),
            ( 'rank',   'avg_risk_fraction', 2020, 'national'),
            ( 'rank',   'avg_risk_fraction', 2020,  'nat_10k'),
            ( 'rank',   'avg_risk_fraction', 2020,    'state'),
            ( 'rank',   'avg_risk_fraction', 2100, 'national'),
            ( 'rank',   'avg_risk_fraction', 2100,  'nat_10k'),
            ( 'rank',   'avg_risk_fraction', 2100,    'state'),
            ( 'rank',   'count_uninsurable', 2020, 'national'),
            ( 'rank',   'count_uninsurable', 2020,  'nat_10k'),
            ( 'rank',   'count_uninsurable', 2020,    'state'),
            ( 'rank',   'count_uninsurable', 2100, 'national'),
            ( 'rank',   'count_uninsurable', 2100,  'nat_10k'),
            ( 'rank',   'count_uninsurable', 2100,    'state'),
            ( 'rank', 'percent_uninsurable', 2020, 'national'),
            ( 'rank', 'percent_uninsurable', 2020,  'nat_10k'),
            ( 'rank', 'percent_uninsurable', 2020,    'state'),
            ( 'rank', 'percent_uninsurable', 2100, 'national'),
            ( 'rank', 'percent_uninsurable', 2100,  'nat_10k'),
            ( 'rank', 'percent_uninsurable', 2100,    'state')],
           names=['type', 'stat', 'year', 'subset'])

In [2]: final_df.columns.get_level_values('type')                                                         
Out[2]: 
CategoricalIndex(['value', 'value', 'value', 'value', 'value', 'value',
                  'value', 'value', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank'],
                 categories=['rank', 'value'], ordered=True, name='type', dtype='category')

In [3]: final_df = final_df.stack('year').sort_index()                                                    

In [4]: final_df.columns.get_level_values('type')                                                         
Out[4]: 
Index(['rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
       'rank', 'rank', 'rank', 'value', 'value', 'value', 'value'],
      dtype='object', name='type')

It also doesn't seem easy to manually store the column orders and re-use them afterwards, die to the missing 'year' level.

Anyone have a suggested work-around for this?

@cemanughian
Copy link

Is anyone going to fix this?

@jreback
Copy link
Contributor

jreback commented Jan 9, 2021

@cemanughian pandas is all volunteer and there are quite a number of open issues

you are welcome to do a pill request - core devs can provide review

@DriesSchaumont
Copy link
Member

@jreback I am wondering if we need to add a new argument for this? I think we can advice using droplevel and reindex?
If we do want to add this, I can have a look.

import pandas as pd

>>> tuples = list(zip(['zzz', 'xxx', 'ddd', 'zzz', 'aaa', 'zzz', 'aaa'], ['z', 'z', 'z', 'a', 'z', 'x', 'a']))
>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> df = pd.DataFrame({'A': [2, 5, 3, 1, 1, 6, 7], 'B': [4, 3, 5, 10, 5, 3, 8]}, index=index)
>>> df
              A   B
first second
zzz   z       2   4
xxx   z       5   3
ddd   z       3   5
zzz   a       1  10
aaa   z       1   5
zzz   x       6   3
aaa   a       7   8
>>> unstacked = df.unstack()
>>> unstacked
          A               B
second    a    x    z     a    x    z
first
aaa     7.0  NaN  1.0   8.0  NaN  5.0
ddd     NaN  NaN  3.0   NaN  NaN  5.0
xxx     NaN  NaN  5.0   NaN  NaN  3.0
zzz     1.0  6.0  2.0  10.0  3.0  4.0
>>> new_index = index.droplevel(-1).unique()
>>> unstacked.reindex(new_index)
          A               B
second    a    x    z     a    x    z
first
zzz     1.0  6.0  2.0  10.0  3.0  4.0
xxx     NaN  NaN  5.0   NaN  NaN  3.0
ddd     NaN  NaN  3.0   NaN  NaN  5.0
aaa     7.0  NaN  1.0   8.0  NaN  5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants