Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running set_index on a categorical to a MultiIndex, it gets coerced to a string. #15058

Closed
thequackdaddy opened this issue Jan 4, 2017 · 15 comments · Fixed by #31161
Closed
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@thequackdaddy
Copy link
Contributor

Hello!

I apologize if this expected behavior. This is relatively similar to this StackOverflow question.

Code Sample, a copy-pastable example if possible

import pandas as pd

x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
y = pd.Series([1, 2, 1, 2, 1, 2, 1])
z = pd.Series([3, 4, 2, 1, 3, 2, 1])

df = pd.DataFrame({'z': z, 'x': x, 'y':y})
df.set_index(['x', 'y']).sort_index()
df.sort_values('x')

Problem description

I would like to sort and group-by a column in a custom way. In the example above, I've ordered a categorical (it could be a string) in a way that makes intuitive sense. In this example, I want fruits first, followed by dairy, followed by meats.

Expected Output

When the categorical is in a MultiIndex, set_index seems to coerce the categorical to a string before adding it to the index. It would be nicer if pandas kept the categorical ordering for the index.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.8.0.dev0+7e6b94b
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

so you have a somewhat older version of pandas. so not really sure what you are expecting.

@thequackdaddy
Copy link
Contributor Author

thequackdaddy commented Jan 4, 2017

Sorry for confusion! This happens on newer pandas too... made a quick env to demonstrate... This might be a "we don't do that intentionally" question.

I'm trying to sort a table by a categorical for things to be in a more intuitive sense. The data is stored in a few codes as strings, but I can essentially sort the table by the code (and not alphabetically) in a more intuitive order by setting up categorical.

So...

In [1]: import pandas as pd
   ...:
   ...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'da
   ...: iry', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
   ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
   ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
   ...:

In [2]: df = pd.DataFrame({'x': x, 'y':y, 'z': z})

In [3]: df.sort_values('x')
Out[3]:
         x  y  z
0   apples  1  3
4   apples  1  3
1    dairy  2  4
5    dairy  2  2
3     beef  2  1
2  chicken  1  2
6  chicken  1  1

Notice here that when I sorted the table, the categoricals sorted as I wanted them to. apples, then dairy, then beef, then chicken.

In [4]: df =  df.set_index(['x', 'y'])

In [5]: df.sort_index()
Out[5]:
           z
x       y
apples  1  3
        1  3
beef    2  1
chicken 1  2
        1  1
dairy   2  4
        2  2

When I set the index, now it sorts alphabetically (apples, beef, chicken dairy)... its essentially forgot that I set it to categorical and had a special order.

The pandas version is

In [6]: pd.__version__
Out[6]: '0.19.2+0.g825876c.dirty'
commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2+0.g825876c.dirty
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@thequackdaddy
Copy link
Contributor Author

thequackdaddy commented Jan 4, 2017

A simpler example...

In [1]: import pandas as pd
   ...: 
   ...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
   ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
   ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
   ...: 
   ...: df = pd.DataFrame({'z': z, 'x': x, 'y': y})
   ...: df.x.dtype
Out[1]: category

In [2]: df = df.set_index(['x', 'y']).reset_index()

In [3]: df.x.dtype
Out[3]: dtype('O')

So x went from categorical to O because I put it in a MultiIndex.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

lots of fixed in 0.19.x, encourage you to upgrade

In [114]: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
     ...:    ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
     ...:    ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
     ...:    ...:
     ...:    ...: df = pd.DataFrame({'z': z, 'x': x, 'y': y})
     ...:    ...: df.x.dtype
     ...:
Out[114]: category

In [115]: df = df.set_index(['x', 'y']).reset_index()

In [116]: df.x.dtype
Out[116]: category

In [117]: pd.__version__
Out[117]: '0.19.0+307.g788b6ff.dirty'

@jreback jreback closed this as completed Jan 4, 2017
@jreback jreback added Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request labels Jan 4, 2017
@jreback jreback added this to the No action milestone Jan 4, 2017
@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

@TomAugspurger
Copy link
Contributor

Is this incorrect?

In [30]: import pandas as pd
    ...:
    ...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'], ordered=True)
    ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
    ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
    ...:
    ...: df = pd.DataFrame({'z': z, 'x': x, 'y':y})
    ...: df.set_index(['x', 'y']).sort_index()
    ...:
Out[30]:
           z
x       y
apples  1  3
        1  3
beef    2  1
chicken 1  2
        1  1
dairy   2  4
        2  2

Notice the ordered=True, when creating the Categorical, which differs from the original post. The index is lex-sorted, when (maybe) it should follow the categorical ordering

           z
x       y
apples  1  3
        1  3
dairy   2  4
        2  2
beef    2  1
chicken 1  2
        1  1

@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

@TomAugspurger I agree an ordered Cat should be respected in the .sort_index. Note that this might do all kinds of odd things when you actually try to index into this (as a MultiIndex actually requires lexsorting), xref #14015 . So please open a separate issue for this. (as this one is not about an ordered categorical, but an unordered one.

@jorisvandenbossche
Copy link
Member

Ordered categorical or not, it should still respect the order of the categories when sorting, so that distinction does not really matter here:


In [20]: c1 = pd.Series(list('adcb')).astype('category', categories=list('acbd'))

In [21]: c1.sort_values()
Out[21]: 
0    a
2    c
3    b
1    d
dtype: category
Categories (4, object): [a, c, b, d]

In [22]: c2 = pd.Series(list('adcb')).astype('category', categories=list('acbd'), ordered=True)

In [23]: c2.sort_values()
Out[23]: 
0    a
2    c
3    b
1    d
dtype: category
Categories (4, object): [a < c < b < d]

In any case, the example Tom showed in his last post is incorrect AFAIK:

In [30]: df2 = df.set_index(['x', 'y'])

In [31]: df2.index
Out[31]: 
MultiIndex(levels=[['apples', 'dairy', 'beef', 'chicken'], [1, 2]],
           labels=[[0, 1, 3, 2, 0, 1, 3], [0, 1, 0, 1, 0, 1, 0]],
           names=['x', 'y'])

In [32]: df2.sort_index()
Out[32]: 
           z
x       y   
apples  1  3
        1  3
beef    2  1
chicken 1  2
        1  1
dairy   2  4
        2  2

In [33]: df2.sort_index().index
Out[33]: 
MultiIndex(levels=[['apples', 'dairy', 'beef', 'chicken'], [1, 2]],
           labels=[[0, 0, 2, 3, 3, 1, 1], [0, 0, 1, 0, 0, 1, 1]],
           names=['x', 'y'])

So even after sorting, the labels of the index are not sorted?

@jreback
Copy link
Contributor

jreback commented Jan 10, 2017

so this patch fixes this (and it is still considered lexsorted), but will break a few tests.
I think this 'conversion to lexsortedness' needs to happen inside the index via a method call, so that we are not willy-nilly just converting things.

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index b9290c0..d0e8b8a 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -3304,8 +3304,8 @@ class DataFrame(NDFrame):
 
             # make sure that the axis is lexsorted to start
             # if not we need to reconstruct to get the correct indexer
-            if not labels.is_lexsorted():
-                labels = MultiIndex.from_tuples(labels.values)
+            #if not labels.is_lexsorted():
+            #    labels = MultiIndex.from_tuples(labels.values)
 
             indexer = _lexsort_indexer(labels.labels, orders=ascending,
                                        na_position=na_position)
In [1]: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'], ordered=True)
   ...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
   ...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
   ...: df = pd.DataFrame({'z': z, 'x': x, 'y':y})
   ...: 

In [2]: df.set_index(['x','y']).sort_index()
Out[2]: 
           z
x       y   
apples  1  3
        1  3
dairy   2  4
        2  2
beef    2  1
chicken 1  2
        1  1

In [3]: df.set_index(['x','y']).sort_index().index.is_lexsorted()
Out[3]: True

@TomAugspurger
Copy link
Contributor

@jreback writing a longer post, but did you notice that it's only broken for DataFrame.sort_index?

In [32]: df = pd.DataFrame({'a': np.arange(6), 'l1': pd.Categorical(['a', 'a', 'b', 'b', 'c', 'c'], categories=['c', 'a', 'b'], ordered=True), 'l2': [0, 1, 0, 1, 0, 1]})

In [30]: df.set_index(['l1', 'l2']).a.sort_index()  # Series, correct
Out[30]:
l1  l2
c   0     4
    1     5
a   0     0
    1     1
b   0     2
    1     3
Name: a, dtype: int64

In [31]: df.set_index(['l1', 'l2']).sort_index()  # dataFrame, wrong
Out[31]:
       a
l1 l2
a  0   0
   1   1
b  0   2
   1   3
c  0   4
   1   5

@jreback
Copy link
Contributor

jreback commented Jan 10, 2017

yes, the sorting routines are somewhat different for dataframe & series. They should be more unified.

@jreback
Copy link
Contributor

jreback commented Jan 10, 2017

actually they are almost identical (except for that change I just made). I think we should simply combine them (this just for sort_index).

@arobrien
Copy link

arobrien commented Apr 8, 2019

The sorting for DataFrame appears to be solved in version 0.24.2:

df = pd.DataFrame({'a': [2,2,1,1], 
                   'b': pd.Categorical(['prime','alternate','alternate','prime'],
                                       categories=['prime','alternate'],ordered=True), 
                   'c': [1,2,3,4]})
df2 = df.set_index(['a','b'])
df2.sort_index()
                   c
a      b             
1      prime       4
       alternate   3
2      prime       1
       alternate   2

I can also get similar results when changing an existing index level to categorical (is there a simpler way to do this?)

df.columns = pd.MultiIndex.from_arrays([
    df.columns.get_level_values(0),
    pd.CategoricalIndex(df.columns.get_level_values(1),categories=['Target','Model','Error'],ordered=True),
])
df.sort_index(axis=1)

@arobrien
Copy link

arobrien commented Apr 8, 2019

@TomAugspurger your DataFrame example now works:

        a
l1  l2	
c   0   4
    1   5
a   0   0
    1   1
b   0   2
    1   3

@jreback jreback removed this from the No action milestone Oct 2, 2019
@jreback
Copy link
Contributor

jreback commented Oct 2, 2019

This issue just needs a validation test.

@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Usage Question labels Oct 9, 2019
@jreback jreback added this to the 1.1 milestone Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants