-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running set_index on a categorical to a MultiIndex, it gets coerced to a string. #15058
Comments
so you have a somewhat older version of pandas. so not really sure what you are expecting. |
Sorry for confusion! This happens on newer pandas too... made a quick env to demonstrate... This might be a "we don't do that intentionally" question. I'm trying to sort a table by a categorical for things to be in a more intuitive sense. The data is stored in a few codes as strings, but I can essentially sort the table by the code (and not alphabetically) in a more intuitive order by setting up categorical. So... In [1]: import pandas as pd
...:
...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'da
...: iry', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
...:
In [2]: df = pd.DataFrame({'x': x, 'y':y, 'z': z})
In [3]: df.sort_values('x')
Out[3]:
x y z
0 apples 1 3
4 apples 1 3
1 dairy 2 4
5 dairy 2 2
3 beef 2 1
2 chicken 1 2
6 chicken 1 1 Notice here that when I sorted the table, the categoricals sorted as I wanted them to. apples, then dairy, then beef, then chicken. In [4]: df = df.set_index(['x', 'y'])
In [5]: df.sort_index()
Out[5]:
z
x y
apples 1 3
1 3
beef 2 1
chicken 1 2
1 1
dairy 2 4
2 2 When I set the index, now it sorts alphabetically (apples, beef, chicken dairy)... its essentially forgot that I set it to categorical and had a special order. The pandas version is In [6]: pd.__version__
Out[6]: '0.19.2+0.g825876c.dirty'
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2+0.g825876c.dirty |
A simpler example... In [1]: import pandas as pd
...:
...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'])
...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
...:
...: df = pd.DataFrame({'z': z, 'x': x, 'y': y})
...: df.x.dtype
Out[1]: category
In [2]: df = df.set_index(['x', 'y']).reset_index()
In [3]: df.x.dtype
Out[3]: dtype('O') So |
lots of fixed in 0.19.x, encourage you to upgrade
|
Is this incorrect? In [30]: import pandas as pd
...:
...: x = pd.Categorical(['apples', 'dairy', 'chicken', 'beef', 'apples', 'dairy', 'chicken'], categories=['apples', 'dairy', 'beef', 'chicken'], ordered=True)
...: y = pd.Series([1, 2, 1, 2, 1, 2, 1])
...: z = pd.Series([3, 4, 2, 1, 3, 2, 1])
...:
...: df = pd.DataFrame({'z': z, 'x': x, 'y':y})
...: df.set_index(['x', 'y']).sort_index()
...:
Out[30]:
z
x y
apples 1 3
1 3
beef 2 1
chicken 1 2
1 1
dairy 2 4
2 2 Notice the z
x y
apples 1 3
1 3
dairy 2 4
2 2
beef 2 1
chicken 1 2
1 1 |
@TomAugspurger I agree an ordered Cat should be respected in the |
Ordered categorical or not, it should still respect the order of the categories when sorting, so that distinction does not really matter here:
In any case, the example Tom showed in his last post is incorrect AFAIK:
So even after sorting, the labels of the index are not sorted? |
so this patch fixes this (and it is still considered lexsorted), but will break a few tests.
|
@jreback writing a longer post, but did you notice that it's only broken for In [32]: df = pd.DataFrame({'a': np.arange(6), 'l1': pd.Categorical(['a', 'a', 'b', 'b', 'c', 'c'], categories=['c', 'a', 'b'], ordered=True), 'l2': [0, 1, 0, 1, 0, 1]})
In [30]: df.set_index(['l1', 'l2']).a.sort_index() # Series, correct
Out[30]:
l1 l2
c 0 4
1 5
a 0 0
1 1
b 0 2
1 3
Name: a, dtype: int64
In [31]: df.set_index(['l1', 'l2']).sort_index() # dataFrame, wrong
Out[31]:
a
l1 l2
a 0 0
1 1
b 0 2
1 3
c 0 4
1 5 |
yes, the sorting routines are somewhat different for dataframe & series. They should be more unified. |
actually they are almost identical (except for that change I just made). I think we should simply combine them (this just for |
The sorting for DataFrame appears to be solved in version 0.24.2: df = pd.DataFrame({'a': [2,2,1,1],
'b': pd.Categorical(['prime','alternate','alternate','prime'],
categories=['prime','alternate'],ordered=True),
'c': [1,2,3,4]})
df2 = df.set_index(['a','b'])
df2.sort_index()
c
a b
1 prime 4
alternate 3
2 prime 1
alternate 2 I can also get similar results when changing an existing index level to categorical (is there a simpler way to do this?) df.columns = pd.MultiIndex.from_arrays([
df.columns.get_level_values(0),
pd.CategoricalIndex(df.columns.get_level_values(1),categories=['Target','Model','Error'],ordered=True),
])
df.sort_index(axis=1) |
@TomAugspurger your DataFrame example now works:
|
This issue just needs a validation test. |
Hello!
I apologize if this expected behavior. This is relatively similar to this StackOverflow question.
Code Sample, a copy-pastable example if possible
Problem description
I would like to sort and group-by a column in a custom way. In the example above, I've ordered a categorical (it could be a string) in a way that makes intuitive sense. In this example, I want fruits first, followed by dairy, followed by meats.
Expected Output
When the categorical is in a MultiIndex,
set_index
seems to coerce the categorical to a string before adding it to the index. It would be nicer if pandas kept the categorical ordering for the index.Output of
pd.show_versions()
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.8.0.dev0+7e6b94b
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: