-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any categorical dtype object's .unique() changes categories #18291
Comments
You can maintain the order of used categories by passing pandas/pandas/core/categorical.py Lines 2018 to 2026 in 63e8527
Seems like this could be better documented. I don't think that docstring is displayed anywhere in the documentation/api reference? The best I could find was an example in a note at the end of the working with categories section of the documentation, and a couple examples in the api reference for |
Indeed, I should have clarified that I'm not claiming the behavior is in contrast with the docs: I just see no justification for it, and - unless there is any which I'm missing - I suggest to change it. For the records: this came from a test in #17897, and in particular from the fact that |
By the way: I think "Return the |
It seems this change was done in #10508 without much discussion. So we should check the interaction with groupby on categorical columns |
I actually raised this before here #15939 (comment), but that discussion was mainly closed by referring to the docs and saying that we could further discuss it in a new issue, which you now opened :-) This one might also be relevant to look at: #15439 (as I said above, we should explore the interaction with groupby) |
Sure this might be a good occasion to discuss (more for consistency than for the actual implementation, which is relatively simple to adapt). Current behavior.
While I wouldn't change the behavior with
1 seems to me a bit inconsistent, and I have a slight preference (for coherence with other dtypes) for 2 over 3 (despite 3. being potentially slightly more efficient). Anyway, in those cases (if any) in which |
Uhm... 3. would reopen #8868 , so we probably want to avoid it. |
Code Sample, a copy-pastable example if possible
same happens with
CategoricalIndex
andSeries
ofdtype=category
(as long asordered=False
).Problem description
The general pandas approach to categoricals seems to change the
.categories
field as seldom as possible. So I see no reason whyunique()
, which by definition does not affect the set of distinct values, should change the categories (reordering them based on appearance, and dropping unused ones).Expected Output
Output of
pd.show_versions()
In [3]: pd.show_versions()
INSTALLED VERSIONS
commit: 63e8527
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8
pandas: 0.22.0.dev0+131.g63e8527d3
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1
The text was updated successfully, but these errors were encountered: