When running set_index on a categorical to a MultiIndex, it gets coerced to a string. #15058

thequackdaddy · 2017-01-04T17:39:51Z

Hello!

I apologize if this expected behavior. This is relatively similar to this StackOverflow question.

Code Sample, a copy-pastable example if possible

import pandas as pd

x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
y = pd.Series([1, 2, 1, 2, 1, 2, 1])
z = pd.Series([3, 4, 2, 1, 3, 2, 1])

df = pd.DataFrame({'z': z, 'x': x, 'y':y})
df.set_index(['x', 'y']).sort_index()
df.sort_values('x')

Problem description

I would like to sort and group-by a column in a custom way. In the example above, I've ordered a categorical (it could be a string) in a way that makes intuitive sense. In this example, I want fruits first, followed by dairy, followed by meats.

Expected Output

When the categorical is in a MultiIndex, set_index seems to coerce the categorical to a string before adding it to the index. It would be nicer if pandas kept the categorical ordering for the index.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.8.0.dev0+7e6b94b
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-04T20:25:04Z

so you have a somewhat older version of pandas. so not really sure what you are expecting.

thequackdaddy · 2017-01-04T20:47:50Z

Sorry for confusion! This happens on newer pandas too... made a quick env to demonstrate... This might be a "we don't do that intentionally" question.

I'm trying to sort a table by a categorical for things to be in a more intuitive sense. The data is stored in a few codes as strings, but I can essentially sort the table by the code (and not alphabetically) in a more intuitive order by setting up categorical.

So...

In [1]: import pandas as pd
   ...:
   ...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'da
   ...: iry', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
   ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
   ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
   ...:

In [2]: df = pd.DataFrame({'x': x, 'y':y, 'z': z})

In [3]: df.sort_values('x')
Out[3]:
         x  y  z
0   apples  1  3
4   apples  1  3
1    dairy  2  4
5    dairy  2  2
3     beef  2  1
2  chicken  1  2
6  chicken  1  1

Notice here that when I sorted the table, the categoricals sorted as I wanted them to. apples, then dairy, then beef, then chicken.

In [4]: df =  df.set_index(['x', 'y'])

In [5]: df.sort_index()
Out[5]:
           z
x       y
apples  1  3
        1  3
beef    2  1
chicken 1  2
        1  1
dairy   2  4
        2  2

When I set the index, now it sorts alphabetically (apples, beef, chicken dairy)... its essentially forgot that I set it to categorical and had a special order.

The pandas version is

In [6]: pd.__version__
Out[6]: '0.19.2+0.g825876c.dirty'

commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2+0.g825876c.dirty
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

thequackdaddy · 2017-01-04T21:00:59Z

A simpler example...

In [1]: import pandas as pd
   ...: 
   ...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
   ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
   ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
   ...: 
   ...: df = pd.DataFrame({'z': z, 'x': x, 'y': y})
   ...: df.x.dtype
Out[1]: category

In [2]: df = df.set_index(['x', 'y']).reset_index()

In [3]: df.x.dtype
Out[3]: dtype('O')

So x went from categorical to O because I put it in a MultiIndex.

jreback · 2017-01-04T21:07:49Z

lots of fixed in 0.19.x, encourage you to upgrade

In [114]: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
     ...:    ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
     ...:    ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
     ...:    ...:
     ...:    ...: df = pd.DataFrame({'z': z, 'x': x, 'y': y})
     ...:    ...: df.x.dtype
     ...:
Out[114]: category

In [115]: df = df.set_index(['x', 'y']).reset_index()

In [116]: df.x.dtype
Out[116]: category

In [117]: pd.__version__
Out[117]: '0.19.0+307.g788b6ff.dirty'

jreback · 2017-01-04T21:44:38Z

http://pandas.pydata.org/pandas-docs/stable/categorical.html#sorting-and-order

TomAugspurger · 2017-01-04T22:27:44Z

Is this incorrect?

In [30]: import pandas as pd
    ...:
    ...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'], ordered=True)
    ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
    ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
    ...:
    ...: df = pd.DataFrame({'z': z, 'x': x, 'y':y})
    ...: df.set_index(['x', 'y']).sort_index()
    ...:
Out[30]:
           z
x       y
apples  1  3
        1  3
beef    2  1
chicken 1  2
        1  1
dairy   2  4
        2  2

Notice the ordered=True, when creating the Categorical, which differs from the original post. The index is lex-sorted, when (maybe) it should follow the categorical ordering

           z
x       y
apples  1  3
        1  3
dairy   2  4
        2  2
beef    2  1
chicken 1  2
        1  1

jreback · 2017-01-09T15:24:14Z

@TomAugspurger I agree an ordered Cat should be respected in the .sort_index. Note that this might do all kinds of odd things when you actually try to index into this (as a MultiIndex actually requires lexsorting), xref #14015 . So please open a separate issue for this. (as this one is not about an ordered categorical, but an unordered one.

jorisvandenbossche · 2017-01-09T20:33:09Z

Ordered categorical or not, it should still respect the order of the categories when sorting, so that distinction does not really matter here:


In [20]: c1 = pd.Series(list('adcb')).astype('category', categories=list('acbd'))

In [21]: c1.sort_values()
Out[21]: 
0    a
2    c
3    b
1    d
dtype: category
Categories (4, object): [a, c, b, d]

In [22]: c2 = pd.Series(list('adcb')).astype('category', categories=list('acbd'), ordered=True)

In [23]: c2.sort_values()
Out[23]: 
0    a
2    c
3    b
1    d
dtype: category
Categories (4, object): [a < c < b < d]

In any case, the example Tom showed in his last post is incorrect AFAIK:

In [30]: df2 = df.set_index(['x', 'y'])

In [31]: df2.index
Out[31]: 
MultiIndex(levels=[['apples', 'dairy', 'beef', 'chicken'], [1, 2]],
           labels=[[0, 1, 3, 2, 0, 1, 3], [0, 1, 0, 1, 0, 1, 0]],
           names=['x', 'y'])

In [32]: df2.sort_index()
Out[32]: 
           z
x       y   
apples  1  3
        1  3
beef    2  1
chicken 1  2
        1  1
dairy   2  4
        2  2

In [33]: df2.sort_index().index
Out[33]: 
MultiIndex(levels=[['apples', 'dairy', 'beef', 'chicken'], [1, 2]],
           labels=[[0, 0, 2, 3, 3, 1, 1], [0, 0, 1, 0, 0, 1, 1]],
           names=['x', 'y'])

So even after sorting, the labels of the index are not sorted?

jreback · 2017-01-10T00:59:34Z

so this patch fixes this (and it is still considered lexsorted), but will break a few tests.
I think this 'conversion to lexsortedness' needs to happen inside the index via a method call, so that we are not willy-nilly just converting things.

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index b9290c0..d0e8b8a 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -3304,8 +3304,8 @@ class DataFrame(NDFrame):
 
             # make sure that the axis is lexsorted to start
             # if not we need to reconstruct to get the correct indexer
-            if not labels.is_lexsorted():
-                labels = MultiIndex.from_tuples(labels.values)
+            #if not labels.is_lexsorted():
+            #    labels = MultiIndex.from_tuples(labels.values)
 
             indexer = _lexsort_indexer(labels.labels, orders=ascending,
                                        na_position=na_position)

In [1]: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'], ordered=True)
   ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
   ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
   ...: df = pd.DataFrame({'z': z, 'x': x, 'y':y})
   ...: 

In [2]: df.set_index(['x','y']).sort_index()
Out[2]: 
           z
x       y   
apples  1  3
        1  3
dairy   2  4
        2  2
beef    2  1
chicken 1  2
        1  1

In [3]: df.set_index(['x','y']).sort_index().index.is_lexsorted()
Out[3]: True

TomAugspurger · 2017-01-10T01:01:24Z

@jreback writing a longer post, but did you notice that it's only broken for DataFrame.sort_index?

In [32]: df = pd.DataFrame({'a': np.arange(6), 'l1': pd.Categorical(['a', 'a', 'b', 'b', 'c', 'c'], categories=['c', 'a', 'b'], ordered=True), 'l2': [0, 1, 0, 1, 0, 1]})

In [30]: df.set_index(['l1', 'l2']).a.sort_index()  # Series, correct
Out[30]:
l1  l2
c   0     4
    1     5
a   0     0
    1     1
b   0     2
    1     3
Name: a, dtype: int64

In [31]: df.set_index(['l1', 'l2']).sort_index()  # dataFrame, wrong
Out[31]:
       a
l1 l2
a  0   0
   1   1
b  0   2
   1   3
c  0   4
   1   5

jreback · 2017-01-10T01:02:22Z

yes, the sorting routines are somewhat different for dataframe & series. They should be more unified.

jreback · 2017-01-10T01:04:21Z

actually they are almost identical (except for that change I just made). I think we should simply combine them (this just for sort_index).

arobrien · 2019-04-08T07:25:26Z

The sorting for DataFrame appears to be solved in version 0.24.2:

df = pd.DataFrame({'a': [2,2,1,1], 
                   'b': pd.Categorical(['prime','alternate','alternate','prime'],
                                       categories=['prime','alternate'],ordered=True), 
                   'c': [1,2,3,4]})
df2 = df.set_index(['a','b'])
df2.sort_index()
                   c
a      b             
1      prime       4
       alternate   3
2      prime       1
       alternate   2

I can also get similar results when changing an existing index level to categorical (is there a simpler way to do this?)

df.columns = pd.MultiIndex.from_arrays([
    df.columns.get_level_values(0),
    pd.CategoricalIndex(df.columns.get_level_values(1),categories=['Target','Model','Error'],ordered=True),
])
df.sort_index(axis=1)

arobrien · 2019-04-08T07:30:34Z

@TomAugspurger your DataFrame example now works:

jreback · 2019-10-02T11:37:43Z

This issue just needs a validation test.

jreback closed this as completed Jan 4, 2017

jreback added Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request labels Jan 4, 2017

jreback added this to the No action milestone Jan 4, 2017

jreback added the Usage Question label Jan 4, 2017

TomAugspurger mentioned this issue Jan 10, 2017

MultiIndex with ordered Categorical level should (maybe) respect ordered #15087

Closed

TomAugspurger reopened this Jan 10, 2017

jreback mentioned this issue Jan 19, 2017

sort=False option to stack/unstack/pivot #15105

Closed

jreback removed this from the No action milestone Oct 2, 2019

jreback added the good first issue label Oct 2, 2019

mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Usage Question labels Oct 9, 2019

mroeschke mentioned this issue Jan 20, 2020

TST: Add regression tests for fixed issues #31161

Merged

10 tasks

jreback added this to the 1.1 milestone Jan 20, 2020

mroeschke closed this as completed in #31161 Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When running set_index on a categorical to a MultiIndex, it gets coerced to a string. #15058

When running set_index on a categorical to a MultiIndex, it gets coerced to a string. #15058

thequackdaddy commented Jan 4, 2017

jreback commented Jan 4, 2017

thequackdaddy commented Jan 4, 2017 •

edited

Loading

thequackdaddy commented Jan 4, 2017 •

edited

Loading

jreback commented Jan 4, 2017

jreback commented Jan 4, 2017

TomAugspurger commented Jan 4, 2017

jreback commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

jreback commented Jan 10, 2017

TomAugspurger commented Jan 10, 2017

jreback commented Jan 10, 2017

jreback commented Jan 10, 2017

arobrien commented Apr 8, 2019

arobrien commented Apr 8, 2019

jreback commented Oct 2, 2019

When running set_index on a categorical to a MultiIndex, it gets coerced to a string. #15058

When running set_index on a categorical to a MultiIndex, it gets coerced to a string. #15058

Comments

thequackdaddy commented Jan 4, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Jan 4, 2017

thequackdaddy commented Jan 4, 2017 • edited Loading

thequackdaddy commented Jan 4, 2017 • edited Loading

jreback commented Jan 4, 2017

jreback commented Jan 4, 2017

TomAugspurger commented Jan 4, 2017

jreback commented Jan 9, 2017

jorisvandenbossche commented Jan 9, 2017

jreback commented Jan 10, 2017

TomAugspurger commented Jan 10, 2017

jreback commented Jan 10, 2017

jreback commented Jan 10, 2017

arobrien commented Apr 8, 2019

arobrien commented Apr 8, 2019

jreback commented Oct 2, 2019

Output of `pd.show_versions()`

thequackdaddy commented Jan 4, 2017 •

edited

Loading

thequackdaddy commented Jan 4, 2017 •

edited

Loading