-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort=False
option to stack/unstack/pivot
#15105
Comments
can you show an example. These are ordered by the index and not sorted. |
Indeed, as @jreback points out we don't sort when stacking or unstacking. Rather, levels are sorted internally in a MultiIndex when a MultiIndex is constructed (e.g., with This is a confusing implementation detail that leaks into the public API. The
See #14903 and #14672 for related discussion. I see a few alternatives for cleaning this up:
|
2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think. |
Correct, yes. One downside of this approach is that it is slightly slower to construct the new labels. It requires a pass over the full index using |
There's a good example of this here, I think: https://stackoverflow.com/questions/28686053/pandas-pivot-table-reoganize-order-of-multi-index |
So this is quite straightforward to provide categorical orderings.
|
The point is that
Now, this is the bug mentioned above. Never mind this, let's continue.
Now the data frame is sorted by the remaining levels in the index! This is undocumented behavior. The documentation says only: "The level involved will automatically get sorted." |
This is quite aggravating with column multiindex. It doesn't seem like there is a work around. Firstly,there doesn't seem to be an easy way to convert an existing multindex into a categorical one. The best I could come up with is to select a df row, reset_index on the resultant series, and manually create a categorical multi-index index out of all the relevant columns, and then re-assign that to the original. This works, but it would be nice if there was an easy way to do something like Secondly, even with a categorical column multi index, the categories are completely ignored. For example, I would like to stack the 'year' level of the columns, and retain the order of the other levels, but the Categoricalness gets lost in transit: In [1]: final_df.columns
Out[1]:
MultiIndex([('value', 'sum_risk_cost', 2020, 'national'),
('value', 'sum_risk_cost', 2100, 'national'),
('value', 'avg_risk_fraction', 2020, 'national'),
('value', 'avg_risk_fraction', 2100, 'national'),
('value', 'count_uninsurable', 2020, 'national'),
('value', 'count_uninsurable', 2100, 'national'),
('value', 'percent_uninsurable', 2020, 'national'),
('value', 'percent_uninsurable', 2100, 'national'),
( 'rank', 'sum_risk_cost', 2020, 'national'),
( 'rank', 'sum_risk_cost', 2020, 'nat_10k'),
( 'rank', 'sum_risk_cost', 2020, 'state'),
( 'rank', 'sum_risk_cost', 2100, 'national'),
( 'rank', 'sum_risk_cost', 2100, 'nat_10k'),
( 'rank', 'sum_risk_cost', 2100, 'state'),
( 'rank', 'avg_risk_fraction', 2020, 'national'),
( 'rank', 'avg_risk_fraction', 2020, 'nat_10k'),
( 'rank', 'avg_risk_fraction', 2020, 'state'),
( 'rank', 'avg_risk_fraction', 2100, 'national'),
( 'rank', 'avg_risk_fraction', 2100, 'nat_10k'),
( 'rank', 'avg_risk_fraction', 2100, 'state'),
( 'rank', 'count_uninsurable', 2020, 'national'),
( 'rank', 'count_uninsurable', 2020, 'nat_10k'),
( 'rank', 'count_uninsurable', 2020, 'state'),
( 'rank', 'count_uninsurable', 2100, 'national'),
( 'rank', 'count_uninsurable', 2100, 'nat_10k'),
( 'rank', 'count_uninsurable', 2100, 'state'),
( 'rank', 'percent_uninsurable', 2020, 'national'),
( 'rank', 'percent_uninsurable', 2020, 'nat_10k'),
( 'rank', 'percent_uninsurable', 2020, 'state'),
( 'rank', 'percent_uninsurable', 2100, 'national'),
( 'rank', 'percent_uninsurable', 2100, 'nat_10k'),
( 'rank', 'percent_uninsurable', 2100, 'state')],
names=['type', 'stat', 'year', 'subset'])
In [2]: final_df.columns.get_level_values('type')
Out[2]:
CategoricalIndex(['value', 'value', 'value', 'value', 'value', 'value',
'value', 'value', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'rank', 'rank'],
categories=['rank', 'value'], ordered=True, name='type', dtype='category')
In [3]: final_df = final_df.stack('year').sort_index()
In [4]: final_df.columns.get_level_values('type')
Out[4]:
Index(['rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'value', 'value', 'value', 'value'],
dtype='object', name='type') It also doesn't seem easy to manually store the column orders and re-use them afterwards, die to the missing 'year' level. Anyone have a suggested work-around for this? |
Is anyone going to fix this? |
@cemanughian pandas is all volunteer and there are quite a number of open issues you are welcome to do a pill request - core devs can provide review |
@jreback I am wondering if we need to add a new argument for this? I think we can advice using import pandas as pd
>>> tuples = list(zip(['zzz', 'xxx', 'ddd', 'zzz', 'aaa', 'zzz', 'aaa'], ['z', 'z', 'z', 'a', 'z', 'x', 'a']))
>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> df = pd.DataFrame({'A': [2, 5, 3, 1, 1, 6, 7], 'B': [4, 3, 5, 10, 5, 3, 8]}, index=index)
>>> df
A B
first second
zzz z 2 4
xxx z 5 3
ddd z 3 5
zzz a 1 10
aaa z 1 5
zzz x 6 3
aaa a 7 8
>>> unstacked = df.unstack()
>>> unstacked
A B
second a x z a x z
first
aaa 7.0 NaN 1.0 8.0 NaN 5.0
ddd NaN NaN 3.0 NaN NaN 5.0
xxx NaN NaN 5.0 NaN NaN 3.0
zzz 1.0 6.0 2.0 10.0 3.0 4.0
>>> new_index = index.droplevel(-1).unique()
>>> unstacked.reindex(new_index)
A B
second a x z a x z
first
zzz 1.0 6.0 2.0 10.0 3.0 4.0
xxx NaN NaN 5.0 NaN NaN 3.0
ddd NaN NaN 3.0 NaN NaN 5.0
aaa 7.0 NaN 1.0 8.0 NaN 5.0 |
It would be really nice if there was a
sort=False
option on stack/unstack and pivot. (Preferably the default)It is reasonably common to have data in non-standard order that actually provides information (in my case, I have model names, and the order of the names denotes complexity of the models). Stacking or unstacking currently loses all of this information, with no way to retrieve it. That does not seem like a sensible default to me.
It would be relatively easy to work around a non-sorted stack/unstack method (using .sort_index). To go the other way is less trivial, requiring the user to store a list of the values in the necessary order.
I actually find it hard to think of a situation where a sort on unstack would be more useful...
The text was updated successfully, but these errors were encountered: