BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values #18265

tudorprodan · 2017-11-13T18:38:40Z

Please run the code below.
Notice how the column values are swapped to the wrong labels.
This is due to stack() failing to preserve the order in the MultiIndex.

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

values = np.arange(5)
data = np.vstack([['b{}'.format(x) for x in values],   # b0, b1, ..
                  ['a{}'.format(x) for x in values]])  # a0, a1, ..
df = pd.DataFrame(data.T, columns=['b', 'a'])
df.columns.name = 'first'

# Call pd.concat to get the 2-level MultiIndex *unsorted* columns.
# The bug seems to happen when having one of these unsorted MultiIndexes.
second_level_dict = {'x': df}
multi_level_df = pd.concat(second_level_dict, axis=1)
multi_level_df.columns.names = ['second', 'first']

# Sort the columns, i.e. [a, b] instead of [b, a].
sorted_cols_df = multi_level_df.reindex(sorted(multi_level_df.columns), axis=1)

print('Before the restack:')
print(sorted_cols_df)

# Stack and unstack, should be the same.
# This is what causes the bug. sorted_cols_df.stack() also exposes the problem
restacked = sorted_cols_df.stack(['first', 'second']).unstack(['first', 'second'])

print()
print('Restacked:')
print(restacked)
print('(Notice the swapped column values)')

Output

$ python pandas_bug.py
Before the restack:
second   x
first    a   b
0       a0  b0
1       a1  b1
2       a2  b2
3       a3  b3
4       a4  b4

Restacked:
first    a   b
second   x   x
0       b0  a0  <-- notice the swapped values
1       b1  a1
2       b2  a2
3       b3  a3
4       b4  a4

Output of `pd.show_versions()`

I've reproduced this on both 0.21 and 0.20.

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-97-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-11-16T06:25:15Z

@tudorprodan : Thanks for reporting this! Yeah, that does look quite odd. An investigation and PR to patch are welcome!

tudorprodan · 2017-11-16T11:04:40Z

@gfyoung : I did start looking into why it's happening, but am not sure what the right way to patch is because:

df.stack assumes that the multi-index it's stacking from is sorted, so never checks before rebuilding the values.
pd.concat(dict) seems to be the only way I can get a non-value-sorted MultiIndex. If I try to recreate the same index using MultiIndex.from_product for example, it sorts the index values automatically.

So is stack's asumption wrong? Or is MultiIndex always meant to be sorted but that extra step is not performed in concat(dict)?

jreback · 2017-11-16T11:41:58Z

this is a dupe of #16925

grauscher · 2018-08-04T20:34:47Z

This error still exists in version 0.22.0

@tudorprodan makes a good point.

For me, it seems an alternative way is for df.stack to check if df.columns.is_monotonic or df.columns.is_monotonic_decreasing is True.
In the negative case, it calls df.sort_index(axis=1) before doing the proper stack operation.

Obs: I tried using df.columns.is_lexsorted(), but even when df.columns.is_monotonic or df.columns.is_monotonic_decreasing returned False, the first returned True.

mroeschke · 2019-10-15T03:31:30Z

This looks fixed on master. Could use a test

Before the restack:
second   x
first    a   b
0       a0  b0
1       a1  b1
2       a2  b2
3       a3  b3
4       a4  b4

Restacked:
first    a   b
second   x   x
0       a0  b0
1       a1  b1
2       a2  b2
3       a3  b3
4       a4  b4
(Notice the swapped column values)

In [4]: pd.__version__
Out[4]: '0.26.0.dev0+565.g8c5941cd5'

tudorprodan changed the title ~~Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values~~ BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values Nov 14, 2017

gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Bug labels Nov 16, 2017

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 15, 2019

mroeschke mentioned this issue Jan 22, 2020

TST: More regression tests #31196

Merged

7 tasks

simonjayhawkins added this to the 1.1 milestone Jan 22, 2020

WillAyd closed this as completed in #31196 Jan 24, 2020

pmberkeley mentioned this issue Jul 19, 2020

ENH: change sort behavior in stack() so it's user-directed #35343

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values #18265

BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values #18265

tudorprodan commented Nov 13, 2017 •

edited

Loading

INSTALLED VERSIONS

gfyoung commented Nov 16, 2017

tudorprodan commented Nov 16, 2017

jreback commented Nov 16, 2017

grauscher commented Aug 4, 2018 •

edited

Loading

mroeschke commented Oct 15, 2019

BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values #18265

BUG: Calling DataFrame.stack on an out-of-order column MultiIndex leads to swapped values #18265

Comments

tudorprodan commented Nov 13, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Nov 16, 2017

tudorprodan commented Nov 16, 2017

jreback commented Nov 16, 2017

grauscher commented Aug 4, 2018 • edited Loading

mroeschke commented Oct 15, 2019

tudorprodan commented Nov 13, 2017 •

edited

Loading

Output of `pd.show_versions()`

grauscher commented Aug 4, 2018 •

edited

Loading