-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Categorical comparison with unordered #16339
BUG: Categorical comparison with unordered #16339
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16339 +/- ##
==========================================
- Coverage 90.36% 90.34% -0.02%
==========================================
Files 161 161
Lines 50897 50905 +8
==========================================
- Hits 45993 45992 -1
- Misses 4904 4913 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16339 +/- ##
=========================================
Coverage ? 90.4%
=========================================
Files ? 161
Lines ? 50975
Branches ? 0
=========================================
Hits ? 46086
Misses ? 4889
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
pandas/tests/test_categorical.py
Outdated
result = c1 == c2 | ||
|
||
assert result[0] | ||
assert not result[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find tm.assert_array_equal(c1 == c2, np.array([True, False]))
clearer in this case (but that is maybe taste :-))
pandas/tests/test_categorical.py
Outdated
def test_unordered_different_order_equal(self): | ||
# https://github.com/pandas-dev/pandas/issues/16014 | ||
c1 = Categorical(['a', 'b'], categories=['a', 'b'], ordered=False) | ||
c2 = Categorical(['a', 'b'], categories=['b', 'a'], ordered=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the tests above are with Series instead of Categorical.
Don't need to change all, but maybe good to test this with a Series as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the new version isn't too weird: https://github.com/pandas-dev/pandas/pull/16339/files#diff-ed4f442894a2f521dfac3193a3a8d8a0R3825
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, otherwise lgtm.
@@ -453,6 +453,14 @@ the original values: | |||
|
|||
np.asarray(cat) > base | |||
|
|||
When you compare two unordered categoricals with the same categories, the order is not considered: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not necessary since it's a bugfix.
pandas/core/categorical.py
Outdated
msg = ("Categoricals can only be compared if " | ||
"'categories' are the same") | ||
if len(self.categories) != len(other.categories): | ||
raise TypeError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could be a more specialized message (e.g the len comparison)
2d47107
to
9910de4
Compare
pandas/tests/test_categorical.py
Outdated
c2 = Categorical(['a', 'c'], categories=['c', 'a'], ordered=False) | ||
with pytest.raises(TypeError) as rec: | ||
c1 == c2 | ||
assert rec.match("Categoricals can only be compared") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should choose a way to do this, this way or the tm.assert_raises_regex
pandas/core/categorical.py
Outdated
na_mask = (self._codes == -1) | (other._codes == -1) | ||
if not self.ordered: | ||
# Comparison uses codes, so align theirs to ours | ||
other_codes = _get_codes_for_values(other, self.categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only needed when the categories are not equal?
Might be good to do a quick perf check for the basic case of comparing categoricals with equal categories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. Just pushed a commit fixing this to only do the recoding when the categories don't match.
msg = ("Categoricals can only be compared if " | ||
"'categories' are the same.") | ||
if len(self.categories) != len(other.categories): | ||
raise TypeError(msg + " Categories are different lengths") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are including the origninal message here, but it its a little awkward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did you have in mind? I like how it's currently "Categoricals can only be compared if 'categories' are the same. Categories are different lengths." Since it's the general problem (different categories) and a specific hint
as to what's wrong (different lengths)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, lgtm.
pandas/core/categorical.py
Outdated
if not (self.ordered == other.ordered): | ||
raise TypeError("Categoricals can only be compared if " | ||
"'ordered' is the same") | ||
na_mask = (self._codes == -1) | (other._codes == -1) | ||
if not self.categories.equals(other.categories): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can do a if not self.ordered and and not self.categories.equals(..)
to avoid doing this check if the ordered case (when the check is not needed, as it was already checked above)
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014
d619a76
to
4ec26d4
Compare
* upstream/master: (48 commits) BUG: Categorical comparison with unordered (pandas-dev#16339) ENH: Adding 'protocol' parameter to 'to_pickle'. PERF: improve MultiIndex get_loc performance (pandas-dev#16346) TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374) TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373) DOC: new oauth token TST: Add test for clip-na (pandas-dev#16369) ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315) MAINT: Add .iml to .gitignore (pandas-dev#16368) BUG/API: Categorical constructor scalar categories (pandas-dev#16340) ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220) PERF: improved clip performance (pandas-dev#16364) DOC: try new token for docs DOC: try with new secure token DOC: add developer section to the docs DEPS: Drop Python 3.4 support (pandas-dev#16303) DOC: remove credential helper DOC: force fetch on build docs DOC: redo dev docs access token DOC: add dataframe construction in merge_asof example (pandas-dev#16348) ...
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014 (cherry picked from commit 91e9e52)
Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.
Closes #16014