Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Datetimelike equality comparisons with Categorical #30699

Closed
jschendel opened this issue Jan 4, 2020 · 12 comments · Fixed by #38986
Closed

BUG: Datetimelike equality comparisons with Categorical #30699

jschendel opened this issue Jan 4, 2020 · 12 comments · Fixed by #38986
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@jschendel
Copy link
Member

Code Sample, a copy-pastable example if possible

Consider the following setup on master:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1576.gdd0d353fb'

In [2]: dti = pd.date_range("2020", periods=3) 
   ...: dti_tz = pd.date_range("2020", periods=3, tz="UTC") 
   ...: tdi = pd.timedelta_range("0 days", periods=3) 
   ...: pi = pd.period_range("2020Q1", periods=3, freq="Q")

Equality comparisons with an equivalent Categorical are incorrect for DatetimeIndex:

In [3]: dti == pd.Categorical(dti)
Out[3]: array([False, False, False])

In [4]: dti_tz == pd.Categorical(dti_tz)
Out[4]: array([False, False, False])

Equality comparisons raise for PeriodIndex:

In [5]: pi == pd.Categorical(pi)
---------------------------------------------------------------------------
ValueError: Value must be Period, string, integer, or datetime

Looks good for TimedeltaIndex:

In [6]: tdi == pd.Categorical(tdi)
Out[6]: array([ True,  True,  True])

The incorrect behavior above is generally consistent when replacing the index with its extension array/Series equivalent, Categorical with CategoricalIndex/Series[Categorical], and == with !=.

The only exception appears to be that a couple cases work when when you have a Categorical/ CategoricalIndex on the RHS and an extension array on the LHS:

In [7]: pd.Categorical(dti) == dti.array
Out[7]: array([ True,  True,  True])

In [8]: pd.CategoricalIndex(pi) == pi.array
Out[8]: array([ True,  True,  True])

Though note that the above does not work for dti_tz.array

Problem description

Equality comparisons for datetimelike arrays/indexes are largely incorrect when comparing to equivalent categoricals. There is some tie in to #19513 but I think this specific component is pretty clear cut.

Expected Output

I'd expect all variants of == to in the examples above to return array([ True, True, True]), and all variants of != to return array([False, False, False]).

cc @jbrockmendel

@jschendel jschendel added Bug Datetime Datetime data dtype Period Period data type Categorical Categorical Data Type labels Jan 4, 2020
@jschendel jschendel added this to the Contributions Welcome milestone Jan 4, 2020
@jschendel
Copy link
Member Author

Worthwhile to note that equality comparisons for non-datetimelike vs. equivalent categoricals looks to be working (intervals being fixed in #30640).

Also looks like #30654 addresses the issue of PeriodIndex/PeriodArray raising and makes the behavior consistent with the DatetimeIndex/DatetimeArray examples.

@jbrockmendel
Copy link
Member

Does it work with the Categorical on the left? i.e. will it be sufficient to make the DTA/PA ops return NotImplemented?

@jschendel
Copy link
Member Author

Does it work with the Categorical on the left?

No, it works in a few additional cases that way but not for all cases, e.g.

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1637.g6d67cf9f0'

In [2]: dti_tz = pd.date_range("2020", periods=3, tz="UTC")

In [3]: pd.Categorical(dti_tz) == dti_tz
Out[3]: array([False, False, False])

In [4]: pd.Categorical(dti_tz) == dti_tz.array
Out[4]: array([False, False, False])

@tanmay1618
Copy link
Contributor

Working on this

@mroeschke
Copy link
Member

Looks like all these work as expected now. Could use tests

In [12]: In [2]: dti = pd.date_range("2020", periods=3)
    ...:    ...: dti_tz = pd.date_range("2020", periods=3, tz="UTC")
    ...:    ...: tdi = pd.timedelta_range("0 days", periods=3)
    ...:    ...: pi = pd.period_range("2020Q1", periods=3, freq="Q")

In [13]: pi == pd.Categorical(pi)
Out[13]: array([ True,  True,  True])

In [14]: dti == pd.Categorical(dti)
Out[14]: array([ True,  True,  True])

In [15]: dti_tz == pd.Categorical(dti_tz)
Out[15]: array([ True,  True,  True])

In [16]: tdi == pd.Categorical(tdi)
Out[16]: array([ True,  True,  True])

In [17]: pd.Categorical(dti) == dti.array
Out[17]: array([ True,  True,  True])

In [18]: pd.CategoricalIndex(pi) == pi.array
Out[18]: array([ True,  True,  True])

In [20]: pd.Categorical(dti_tz) == dti_tz.array
Out[20]: array([ True,  True,  True])

In [21]: pd.Categorical(dti_tz) == dti_tz
Out[21]: array([ True,  True,  True])

In [22]: pd.__version__
Out[22]: '1.1.0.dev0+1536.ga7fb88fd5'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Period Period data type Datetime Datetime data dtype labels May 11, 2020
@jbrockmendel
Copy link
Member

The equality cases are fixed by #34055, but the inequality cases are still outstanding

@mroeschke
Copy link
Member

Appears that the inequality cases are correctly returning an array of False?

In [2]: In [2]: dti = pd.date_range("2020", periods=3)
   ...:    ...: dti_tz = pd.date_range("2020", periods=3, tz="UTC")
   ...:    ...: tdi = pd.timedelta_range("0 days", periods=3)
   ...:    ...: pi = pd.period_range("2020Q1", periods=3, freq="Q")

In [3]: pi != pd.Categorical(pi)
Out[3]: array([False, False, False])

In [4]: dti != pd.Categorical(dti)
Out[4]: array([False, False, False])

In [5]: dti_tz != pd.Categorical(dti_tz)
Out[5]: array([False, False, False])

In [6]: tdi != pd.Categorical(tdi)
Out[6]: array([False, False, False])

In [7]: pd.Categorical(dti) != dti.array
Out[7]: array([False, False, False])

In [8]: pd.CategoricalIndex(pi) != pi.array
Out[8]: array([False, False, False])

In [9]: pd.Categorical(dti_tz) != dti_tz.array
Out[9]: array([False, False, False])

In [10]:  pd.Categorical(dti_tz) != dti_tz
Out[10]: array([False, False, False])

@jbrockmendel
Copy link
Member

i think it depends on if the Categorical is ordered. if not then these should raise

@ondrejzacha
Copy link

take

@ondrejzacha
Copy link

i think it depends on if the Categorical is ordered. if not then these should raise

@jbrockmendel What exact behaviour did you mean to check here? I believe other inequalities from the snippet of @mroeschke are already covered by #34055

@ondrejzacha ondrejzacha removed their assignment Jul 1, 2020
@jbrockmendel
Copy link
Member

i think it depends on if the Categorical is ordered. if not then these should raise

What exact behaviour did you mean to check here?

If a categorical is not ordered, then basically any inequality checks should raise TypeError.

@khintze1
Copy link

take

@jbrockmendel jbrockmendel added the Numeric Operations Arithmetic, Comparison, and Logical operations label Sep 21, 2020
@jreback jreback removed this from the Contributions Welcome milestone Jan 6, 2021
@jreback jreback added this to the 1.3 milestone Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants