BUG: unstack() does not always sort index in 0.23 #21675

jzwinck · 2018-06-29T07:18:15Z

Code Sample

import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()

Problem description

In Pandas 0.20, 0.21, and 0.22, this gave the expected result:

     A    B    C    E
D  NaN  0.0  NaN  NaN
E  0.0  NaN  0.0  0.0

But in Pandas 0.23, the result is not sorted:

     E    C    A    B
E  0.0  0.0  0.0  NaN
D  NaN  NaN  NaN  0.0

The documentation says "The level involved will automatically get sorted", and while I've seen the explanation of confusing implementation details leaking out in #15105 and some other outright bugs in #9514, this seems to be a different bug, and a regression.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Linux
machine: x86_64
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.1
numpy: 1.13.1

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-06-29T22:43:27Z

As mentioned in the first linked comment there's not any explicit ordering going on with an unstack operation and I think it's preferable NOT to do that (i.e. previous behavior was wrong). You as the end user can easily sort after the fact if you wanted to, but wouldn't as easily be able to maintain the original ordering if this operation implicitly sorted for you.

I certainly see where the documentation is misleading on that though - any interest in submitting a PR to update the docs and add a test case to ensure this behavior doesn't regress going forward?

jzwinck · 2018-06-30T00:22:56Z

@WillAyd, #15105 asks for a new sort option to be added to unstack. The documented, specified, and actual behaviors have for many years been that it sorts the index. Suddenly in Pandas 0.23, the behavior changed, without a FutureWarning, without a documentation change, and without a mention in #15105, which suggests the change was an accident. Significant functional changes should be made deliberately, with discussion, not slipped in without notice and then documented a few releases later.

I suggest that the longstanding and clearly documented behavior (sorted unstack) should be restored, and then #15105 can continue to explore new ideas such as adding an option to not sort. If the default behavior is to be changed, a FutureWarning could be used to help users transition.

You have marked this as a Docs issue. But it is a regression in Pandas 0.23, and a functional bug in the code, not a cosmetic one in the docs.

deisdenis · 2018-08-02T22:08:57Z

Hi, @jzwinck @WillAyd
I tried to dig into this issue and realized the following:

actual behaviour of unstack() was not to sort, it was to keep order consistent with declaration of levels, e.g. pandas 0.22.0

import pandas as pd
index = pd.MultiIndex(levels=[['B','A','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()

will result in

     B    A    C    E
D  NaN  0.0  NaN  NaN
E  0.0  NaN  0.0  0.0

in the previous version empty columns were removed in the _Unstacker.get_result()

        # filter out missing levels
        if values.shape[1] > 0:
            col_inds, obs_ids = compress_group_index(self.sorted_labels[-1])
            # rare case, level values not observed
            if len(obs_ids) < self.full_shape[1]:
                inds = (value_mask.sum(0) > 0).nonzero()[0]
                values = algos.take_nd(values, inds, axis=1)
                columns = columns[inds]

columns order has been kept consistent, but in pandas 0.23.0 empty columns removing is a part of Unstacker.init()

self.index = index.remove_unused_levels()

and here is the issue: this method will place rows and columns in the order of provided labels, e.g.

import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,3,4], [0,1,2,4]])
s = pd.Series(0, index).unstack()

will result in

     A    B    C    E
E  0.0  0.0  NaN  0.0
D  NaN  NaN  0.0  NaN

rows are not consistent any more because labels[0] starts from 4, but columns looks sorted because labels[1] is sorted.

I think that this behaviour is really a regression issue and should be fixed as a bug. Initially this code was introduced in #18460 as a part of bug fix solution. I would like to work on this issue, and I'll appreciate any recommendations. As of now I see the following options:

To rollback from using remove_unused_levels(), looks less efficient but more safe, will require to find another way to fix solved bugs.
To update remove_unused_levels(), could lead to behavior changes in other cases, however reshape.py is the only file in pandas lib using this method.

jzwinck · 2019-05-30T07:55:10Z

@jreback Would you mind weighing in on @deisdenis's question just above? Should we rollback from using remove_unused_levels() to fix this regression, or change its implementation?

jreback · 2019-05-30T09:37:07Z

a lot has changed since 0.23
try this on master and see

jiangyue12392 · 2019-06-26T07:13:06Z

@jreback The behaviour is still the same as 0.23 on master

jiangyue12392 · 2019-06-27T06:34:14Z

Hi @deisdenis, I have also looked into this issue. It seems that the function descriptor of remove_unused_levels specifies that the multiIndex order needs to be preserved so no sorting should be done in this function. Maybe an alternative is to add a sort argument for unstack. Are you still interested in this issue? What do you think?

deisdenis · 2019-06-28T19:13:43Z

Hi @jiangyue12392, sounds cool, I like your idea! I'll dig into it.

jzwinck · 2020-04-28T05:53:02Z

@deisdenis Did you make any progress?

deisdenis · 2020-05-09T05:36:40Z

Hey @jzwinck , I'm sorry, but I don't think we need to complicate unstack with an additional parameter after all. I think it's better to keep the initial order and use df.sort_index() explicitly afterward. This approach is also covered by the existing documentation.

WillAyd added Testing pandas testing functions or related to the test suite Docs good first issue labels Jun 29, 2018

jzwinck changed the title ~~unstack() does not always sort index in 0.23~~ BUG: unstack() does not always sort index in 0.23 Jul 2, 2018

pdnm mentioned this issue Jun 22, 2020

Fix DataFrame/Series stack/unstack docs #34927

Merged

5 tasks

WillAyd closed this as completed in #34927 Jun 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: unstack() does not always sort index in 0.23 #21675

BUG: unstack() does not always sort index in 0.23 #21675

jzwinck commented Jun 29, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Jun 29, 2018

jzwinck commented Jun 30, 2018 •

edited

Loading

deisdenis commented Aug 2, 2018

jzwinck commented May 30, 2019

jreback commented May 30, 2019

jiangyue12392 commented Jun 26, 2019

jiangyue12392 commented Jun 27, 2019

deisdenis commented Jun 28, 2019

jzwinck commented Apr 28, 2020

deisdenis commented May 9, 2020

BUG: unstack() does not always sort index in 0.23 #21675

BUG: unstack() does not always sort index in 0.23 #21675

Comments

jzwinck commented Jun 29, 2018 • edited Loading

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Jun 29, 2018

jzwinck commented Jun 30, 2018 • edited Loading

deisdenis commented Aug 2, 2018

jzwinck commented May 30, 2019

jreback commented May 30, 2019

jiangyue12392 commented Jun 26, 2019

jiangyue12392 commented Jun 27, 2019

deisdenis commented Jun 28, 2019

jzwinck commented Apr 28, 2020

deisdenis commented May 9, 2020

jzwinck commented Jun 29, 2018 •

edited

Loading

Output of `pd.show_versions()`

jzwinck commented Jun 30, 2018 •

edited

Loading