BUG: Row-wise comparison between two series always evaluates to all `False` when one series contains `pd.NA` #45599

wl2522 · 2022-01-24T19:36:58Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

a = pd.Series([1, 2, 3])
b = pd.Series([1, 2, pd.NA])

print(a == b)

0    False
1    False
2    False
dtype: bool

print(a.eq(b))

0    False
1    False
2    False
dtype: bool

Issue Description

In my actual use case, I'm performing row-wise comparisons between an integer column and the same column shifted by various periods

column == column.shift(periods=i) for 1 <= i <= 6

to check if a previous row's value is the same as the current row's value.

Because of the behavior described in my example, these comparisons are all incorrectly evaluating into columns filled with all False values, even if there are rows with the same values between both columns.

Note: I did not try reproducing this bug with the main branch version of pandas but I scanned through the list of commit messages from commits pushed since pandas version 1.4.0 and did not notice any that sound like they would address this issue.

Expected Behavior

Since pandas.Series.eq performs an element-wise comparison between each series, I would expect for comparisons involving pd.NA to behave like those which involve np.nan:

c = pd.Series([1.0, 2.0, 3.0])
d = pd.Series([1.0, 2.0, np.nan])
print(c == d)

0     True
1     True
2    False
dtype: bool

print(c.eq(d))

0     True
1     True
2    False
dtype: bool

Installed Versions

INSTALLED VERSIONS
------------------
commit           : bb1f651536508cdfef8550f93ace7849b00046ee
python           : 3.8.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-1059-aws
Version          : #62~18.04.1-Ubuntu SMP Fri Oct 22 21:51:38 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.4.0
numpy            : 1.21.2
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.2.4
setuptools       : 58.0.4
Cython           : 0.29.25
pytest           : 6.2.5
hypothesis       : None
sphinx           : 4.2.0
blosc            : None
feather          : None
xlsxwriter       : 3.0.2
lxml.etree       : 4.7.1
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.29.0
pandas_datareader: None
bs4              : 4.10.0
bottleneck       : 1.3.2
fastparquet      : None
fsspec           : 2022.01.0
gcsfs            : None
matplotlib       : 3.5.0
numba            : 0.51.2
numexpr          : 2.8.1
odfpy            : None
openpyxl         : 3.0.9
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.3
sqlalchemy       : 1.4.27
tables           : 3.6.1
tabulate         : 0.8.9
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
zstandard        : None

The text was updated successfully, but these errors were encountered:

phofl · 2022-01-25T21:19:35Z

The Series with the NA value has dtype object, hence this result is correct. When you specify Int64 dtype, you get the expected result

a = pd.Series([1, 2, 3])
b = pd.Series([1, 2, pd.NA], dtype="Int64")

jbrockmendel · 2022-01-25T21:58:18Z

@phofl i dont think this is cut-and-dry; we'd expect the first two elements to be True. There are a couple of issues about pd.NA not behaving well inside object dtype Series/arrays.

wl2522 · 2022-01-25T22:52:51Z

thank you @phofl for pointing out the dtype difference between the two series!

as @jbrockmendel implied, it seems like there are at least two issues/behaviors happening here:

creating a series that mixes int with pd.NA without specifying dtype automatically creates an object series, which was reported in BUG: mix of int and pd.NA defaults to object dtype #33662 and sounds like it may be changed in the future?
there's some underlying behavior beyond just the dtype difference that i'm not understanding because modifying my example so that it doesn't involve pd.NA gives the expected result:

a = pd.Series([1, 2, 3], dtype="Int64")
c = pd.Series([1, 2, 5], dtype=object)
print(a == c)

0     True
1     True
2    False
dtype: boolean

edit: replacing the 5 in the last row of series c with other types of data such as 'a' or np.nan also give the expected result of True, True, False

jbrockmendel · 2022-01-25T23:58:37Z

creating a series that mixes int with pd.NA without specifying dtype automatically creates an object series

That is expected for the forseeable future.

there's some underlying behavior beyond just the dtype difference that i'm not understanding because modifying my example so that it doesn't involve pd.NA gives the expected result:

I'm pretty sure this is driven by pd.NA's PITA behavior when inside an object-dtype arraylike xref #32931, #33066

wl2522 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2022

phofl closed this as completed Jan 25, 2022

phofl added ExtensionArray Extending pandas with custom dtypes or arrays. Usage Question Numeric Operations Arithmetic, Comparison, and Logical operations and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Row-wise comparison between two series always evaluates to all `False` when one series contains `pd.NA` #45599

BUG: Row-wise comparison between two series always evaluates to all `False` when one series contains `pd.NA` #45599

wl2522 commented Jan 24, 2022

phofl commented Jan 25, 2022

jbrockmendel commented Jan 25, 2022

wl2522 commented Jan 25, 2022 •

edited

Loading

jbrockmendel commented Jan 25, 2022

BUG: Row-wise comparison between two series always evaluates to all False when one series contains pd.NA #45599

BUG: Row-wise comparison between two series always evaluates to all False when one series contains pd.NA #45599

Comments

wl2522 commented Jan 24, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

phofl commented Jan 25, 2022

jbrockmendel commented Jan 25, 2022

wl2522 commented Jan 25, 2022 • edited Loading

jbrockmendel commented Jan 25, 2022

BUG: Row-wise comparison between two series always evaluates to all `False` when one series contains `pd.NA` #45599

BUG: Row-wise comparison between two series always evaluates to all `False` when one series contains `pd.NA` #45599

wl2522 commented Jan 25, 2022 •

edited

Loading