Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: unstack() does not always sort index in 0.23 #21675

Closed
jzwinck opened this issue Jun 29, 2018 · 10 comments · Fixed by #34927
Closed

BUG: unstack() does not always sort index in 0.23 #21675

jzwinck opened this issue Jun 29, 2018 · 10 comments · Fixed by #34927
Labels
Docs good first issue Testing pandas testing functions or related to the test suite

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Jun 29, 2018

Code Sample

import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()

Problem description

In Pandas 0.20, 0.21, and 0.22, this gave the expected result:

     A    B    C    E
D  NaN  0.0  NaN  NaN
E  0.0  NaN  0.0  0.0

But in Pandas 0.23, the result is not sorted:

     E    C    A    B
E  0.0  0.0  0.0  NaN
D  NaN  NaN  NaN  0.0

The documentation says "The level involved will automatically get sorted", and while I've seen the explanation of confusing implementation details leaking out in #15105 and some other outright bugs in #9514, this seems to be a different bug, and a regression.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Linux
machine: x86_64
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.1
numpy: 1.13.1

@WillAyd
Copy link
Member

WillAyd commented Jun 29, 2018

As mentioned in the first linked comment there's not any explicit ordering going on with an unstack operation and I think it's preferable NOT to do that (i.e. previous behavior was wrong). You as the end user can easily sort after the fact if you wanted to, but wouldn't as easily be able to maintain the original ordering if this operation implicitly sorted for you.

I certainly see where the documentation is misleading on that though - any interest in submitting a PR to update the docs and add a test case to ensure this behavior doesn't regress going forward?

@WillAyd WillAyd added Testing pandas testing functions or related to the test suite Docs good first issue labels Jun 29, 2018
@jzwinck
Copy link
Contributor Author

jzwinck commented Jun 30, 2018

@WillAyd, #15105 asks for a new sort option to be added to unstack. The documented, specified, and actual behaviors have for many years been that it sorts the index. Suddenly in Pandas 0.23, the behavior changed, without a FutureWarning, without a documentation change, and without a mention in #15105, which suggests the change was an accident. Significant functional changes should be made deliberately, with discussion, not slipped in without notice and then documented a few releases later.

I suggest that the longstanding and clearly documented behavior (sorted unstack) should be restored, and then #15105 can continue to explore new ideas such as adding an option to not sort. If the default behavior is to be changed, a FutureWarning could be used to help users transition.

You have marked this as a Docs issue. But it is a regression in Pandas 0.23, and a functional bug in the code, not a cosmetic one in the docs.

@jzwinck jzwinck changed the title unstack() does not always sort index in 0.23 BUG: unstack() does not always sort index in 0.23 Jul 2, 2018
@deisdenis
Copy link

Hi, @jzwinck @WillAyd
I tried to dig into this issue and realized the following:

  1. actual behaviour of unstack() was not to sort, it was to keep order consistent with declaration of levels, e.g. pandas 0.22.0
import pandas as pd
index = pd.MultiIndex(levels=[['B','A','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()

will result in

     B    A    C    E
D  NaN  0.0  NaN  NaN
E  0.0  NaN  0.0  0.0
  1. in the previous version empty columns were removed in the _Unstacker.get_result()
        # filter out missing levels
        if values.shape[1] > 0:
            col_inds, obs_ids = compress_group_index(self.sorted_labels[-1])
            # rare case, level values not observed
            if len(obs_ids) < self.full_shape[1]:
                inds = (value_mask.sum(0) > 0).nonzero()[0]
                values = algos.take_nd(values, inds, axis=1)
                columns = columns[inds]

columns order has been kept consistent, but in pandas 0.23.0 empty columns removing is a part of Unstacker.init()

self.index = index.remove_unused_levels()

and here is the issue: this method will place rows and columns in the order of provided labels, e.g.

import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,3,4], [0,1,2,4]])
s = pd.Series(0, index).unstack()

will result in

     A    B    C    E
E  0.0  0.0  NaN  0.0
D  NaN  NaN  0.0  NaN

rows are not consistent any more because labels[0] starts from 4, but columns looks sorted because labels[1] is sorted.

I think that this behaviour is really a regression issue and should be fixed as a bug. Initially this code was introduced in #18460 as a part of bug fix solution. I would like to work on this issue, and I'll appreciate any recommendations. As of now I see the following options:

  1. To rollback from using remove_unused_levels(), looks less efficient but more safe, will require to find another way to fix solved bugs.
  2. To update remove_unused_levels(), could lead to behavior changes in other cases, however reshape.py is the only file in pandas lib using this method.

@jzwinck
Copy link
Contributor Author

jzwinck commented May 30, 2019

@jreback Would you mind weighing in on @deisdenis's question just above? Should we rollback from using remove_unused_levels() to fix this regression, or change its implementation?

@jreback
Copy link
Contributor

jreback commented May 30, 2019

a lot has changed since 0.23
try this on master and see

@jiangyue12392
Copy link
Contributor

@jreback The behaviour is still the same as 0.23 on master

@jiangyue12392
Copy link
Contributor

Hi @deisdenis, I have also looked into this issue. It seems that the function descriptor of remove_unused_levels specifies that the multiIndex order needs to be preserved so no sorting should be done in this function. Maybe an alternative is to add a sort argument for unstack. Are you still interested in this issue? What do you think?

@deisdenis
Copy link

Hi @jiangyue12392, sounds cool, I like your idea! I'll dig into it.

@jzwinck
Copy link
Contributor Author

jzwinck commented Apr 28, 2020

@deisdenis Did you make any progress?

@deisdenis
Copy link

Hey @jzwinck , I'm sorry, but I don't think we need to complicate unstack with an additional parameter after all. I think it's better to keep the initial order and use df.sort_index() explicitly afterward. This approach is also covered by the existing documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants