Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values #18265

Closed
tudorprodan opened this issue Nov 13, 2017 · 5 comments · Fixed by #31196
Closed
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@tudorprodan
Copy link

tudorprodan commented Nov 13, 2017

Please run the code below.
Notice how the column values are swapped to the wrong labels.
This is due to stack() failing to preserve the order in the MultiIndex.

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

values = np.arange(5)
data = np.vstack([['b{}'.format(x) for x in values],   # b0, b1, ..
                  ['a{}'.format(x) for x in values]])  # a0, a1, ..
df = pd.DataFrame(data.T, columns=['b', 'a'])
df.columns.name = 'first'

# Call pd.concat to get the 2-level MultiIndex *unsorted* columns.
# The bug seems to happen when having one of these unsorted MultiIndexes.
second_level_dict = {'x': df}
multi_level_df = pd.concat(second_level_dict, axis=1)
multi_level_df.columns.names = ['second', 'first']

# Sort the columns, i.e. [a, b] instead of [b, a].
sorted_cols_df = multi_level_df.reindex(sorted(multi_level_df.columns), axis=1)

print('Before the restack:')
print(sorted_cols_df)

# Stack and unstack, should be the same.
# This is what causes the bug. sorted_cols_df.stack() also exposes the problem
restacked = sorted_cols_df.stack(['first', 'second']).unstack(['first', 'second'])

print()
print('Restacked:')
print(restacked)
print('(Notice the swapped column values)')

Output

$ python pandas_bug.py
Before the restack:
second   x
first    a   b
0       a0  b0
1       a1  b1
2       a2  b2
3       a3  b3
4       a4  b4

Restacked:
first    a   b
second   x   x
0       b0  a0  <-- notice the swapped values
1       b1  a1
2       b2  a2
3       b3  a3
4       b4  a4

Output of pd.show_versions()

I've reproduced this on both 0.21 and 0.20.

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-97-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@tudorprodan tudorprodan changed the title Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values Nov 14, 2017
@gfyoung gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Bug labels Nov 16, 2017
@gfyoung
Copy link
Member

gfyoung commented Nov 16, 2017

@tudorprodan : Thanks for reporting this! Yeah, that does look quite odd. An investigation and PR to patch are welcome!

@tudorprodan
Copy link
Author

@gfyoung : I did start looking into why it's happening, but am not sure what the right way to patch is because:

  • df.stack assumes that the multi-index it's stacking from is sorted, so never checks before rebuilding the values.
  • pd.concat(dict) seems to be the only way I can get a non-value-sorted MultiIndex. If I try to recreate the same index using MultiIndex.from_product for example, it sorts the index values automatically.

So is stack's asumption wrong? Or is MultiIndex always meant to be sorted but that extra step is not performed in concat(dict)?

@jreback
Copy link
Contributor

jreback commented Nov 16, 2017

this is a dupe of #16925

@grauscher
Copy link

grauscher commented Aug 4, 2018

This error still exists in version 0.22.0

@tudorprodan makes a good point.

For me, it seems an alternative way is for df.stack to check if df.columns.is_monotonic or df.columns.is_monotonic_decreasing is True.
In the negative case, it calls df.sort_index(axis=1) before doing the proper stack operation.

Obs: I tried using df.columns.is_lexsorted(), but even when df.columns.is_monotonic or df.columns.is_monotonic_decreasing returned False, the first returned True.

@mroeschke
Copy link
Member

This looks fixed on master. Could use a test

Before the restack:
second   x
first    a   b
0       a0  b0
1       a1  b1
2       a2  b2
3       a3  b3
4       a4  b4

Restacked:
first    a   b
second   x   x
0       a0  b0
1       a1  b1
2       a2  b2
3       a3  b3
4       a4  b4
(Notice the swapped column values)

In [4]: pd.__version__
Out[4]: '0.26.0.dev0+565.g8c5941cd5'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 15, 2019
@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants