Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categorical comparison with unordered #16339

Merged
merged 1 commit into from
May 18, 2017

Conversation

TomAugspurger
Copy link
Contributor

Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014

@TomAugspurger TomAugspurger added the Categorical Categorical Data Type label May 12, 2017
@TomAugspurger TomAugspurger added this to the 0.20.2 milestone May 12, 2017
@codecov
Copy link

codecov bot commented May 12, 2017

Codecov Report

Merging #16339 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16339      +/-   ##
==========================================
- Coverage   90.36%   90.34%   -0.02%     
==========================================
  Files         161      161              
  Lines       50897    50905       +8     
==========================================
- Hits        45993    45992       -1     
- Misses       4904     4913       +9
Flag Coverage Δ
#multiple 88.13% <100%> (ø) ⬆️
#single 40.18% <0%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.9% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.59% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50e95e0...2d47107. Read the comment docs.

@codecov
Copy link

codecov bot commented May 12, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@539de79). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #16339   +/-   ##
=========================================
  Coverage          ?    90.4%           
=========================================
  Files             ?      161           
  Lines             ?    50975           
  Branches          ?        0           
=========================================
  Hits              ?    46086           
  Misses            ?     4889           
  Partials          ?        0
Flag Coverage Δ
#multiple 88.18% <100%> (?)
#single 40.19% <0%> (?)
Impacted Files Coverage Δ
pandas/core/categorical.py 95.91% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 539de79...4ec26d4. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

result = c1 == c2

assert result[0]
assert not result[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find tm.assert_array_equal(c1 == c2, np.array([True, False])) clearer in this case (but that is maybe taste :-))

def test_unordered_different_order_equal(self):
# https://github.com/pandas-dev/pandas/issues/16014
c1 = Categorical(['a', 'b'], categories=['a', 'b'], ordered=False)
c2 = Categorical(['a', 'b'], categories=['b', 'a'], ordered=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the tests above are with Series instead of Categorical.
Don't need to change all, but maybe good to test this with a Series as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, otherwise lgtm.

@@ -453,6 +453,14 @@ the original values:

np.asarray(cat) > base

When you compare two unordered categoricals with the same categories, the order is not considered:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not necessary since it's a bugfix.

msg = ("Categoricals can only be compared if "
"'categories' are the same")
if len(self.categories) != len(other.categories):
raise TypeError(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could be a more specialized message (e.g the len comparison)

c2 = Categorical(['a', 'c'], categories=['c', 'a'], ordered=False)
with pytest.raises(TypeError) as rec:
c1 == c2
assert rec.match("Categoricals can only be compared")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should choose a way to do this, this way or the tm.assert_raises_regex

na_mask = (self._codes == -1) | (other._codes == -1)
if not self.ordered:
# Comparison uses codes, so align theirs to ours
other_codes = _get_codes_for_values(other, self.categories)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only needed when the categories are not equal?

Might be good to do a quick perf check for the basic case of comparing categoricals with equal categories

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Just pushed a commit fixing this to only do the recoding when the categories don't match.

msg = ("Categoricals can only be compared if "
"'categories' are the same.")
if len(self.categories) != len(other.categories):
raise TypeError(msg + " Categories are different lengths")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are including the origninal message here, but it its a little awkward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you have in mind? I like how it's currently "Categoricals can only be compared if 'categories' are the same. Categories are different lengths." Since it's the general problem (different categories) and a specific hint
as to what's wrong (different lengths)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is fine.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, lgtm.

if not (self.ordered == other.ordered):
raise TypeError("Categoricals can only be compared if "
"'ordered' is the same")
na_mask = (self._codes == -1) | (other._codes == -1)
if not self.categories.equals(other.categories):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can do a if not self.ordered and and not self.categories.equals(..) to avoid doing this check if the ordered case (when the check is not needed, as it was already checked above)

Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
@TomAugspurger TomAugspurger merged commit 91e9e52 into pandas-dev:master May 18, 2017
pawroman added a commit to pawroman/pandas that referenced this pull request May 18, 2017
* upstream/master: (48 commits)
  BUG: Categorical comparison with unordered (pandas-dev#16339)
  ENH: Adding 'protocol' parameter to 'to_pickle'.
  PERF: improve MultiIndex get_loc performance (pandas-dev#16346)
  TST: remove pandas-datareader xfail as 0.4.0 works (pandas-dev#16374)
  TST: followup to pandas-dev#16364, catch errstate warnings (pandas-dev#16373)
  DOC: new oauth token
  TST: Add test for clip-na (pandas-dev#16369)
  ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315)
  MAINT: Add .iml to .gitignore (pandas-dev#16368)
  BUG/API: Categorical constructor scalar categories (pandas-dev#16340)
  ENH: Provide dict object for to_dict() pandas-dev#16122 (pandas-dev#16220)
  PERF: improved clip performance (pandas-dev#16364)
  DOC: try new token for docs
  DOC: try with new secure token
  DOC: add developer section to the docs
  DEPS: Drop Python 3.4 support (pandas-dev#16303)
  DOC: remove credential helper
  DOC: force fetch on build docs
  DOC: redo dev docs access token
  DOC: add dataframe construction in merge_asof example (pandas-dev#16348)
  ...
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request May 29, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
(cherry picked from commit 91e9e52)
@TomAugspurger TomAugspurger deleted the unordered-equal branch May 29, 2017 20:44
TomAugspurger added a commit that referenced this pull request May 30, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014
(cherry picked from commit 91e9e52)
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants