Equality between DataFrames misbehaves if columns contain NaN #18455

toobaz · 2017-11-23T21:31:21Z

Code Sample, a copy-pastable example if possible

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=[3, np.nan, 1])
   ...: 

In [3]: s + s # good
Out[3]: 
       3.0  NaN    1.0
 1.0    -2    -2    -2
NaN     -2    -2    -2
 2.0    -2    -2    -2

In [4]: s == s # bad
Out[4]: 
       3.0 NaN    1.0
 1.0  True  NaN  True
NaN   True  NaN  True
 2.0  True  NaN  True

Problem description

While it is true that np.nan != np.nan, pandas disregards this in indexes (indeed, s.loc[:, np.nan] works), so it should be coherent.

Expected Output

In [4]: s == s
Out[4]: 
       3.0  NaN   1.0
 1.0  True  True  True
NaN   True  True  True
 2.0  True  True  True

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: b45325e
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+201.gb45325e28.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

toobaz · 2017-11-23T21:36:48Z

By the way: it works with MultiIndex columns:

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=pd.MultiIndex.from_product([[2, np.nan], [3, np.nan]]))
   ...:                  

In [3]: s == s
Out[3]: 
         2         NaN      
         3   NaN     3   NaN
 1.0  True  True  True  True
NaN   True  True  True  True
 2.0  True  True  True  True

toobaz · 2017-11-23T21:39:32Z

I spoke too soon:

In [3]: s == s.copy()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-d24aa33f67ac> in <module>()
----> 1 s == s.copy()

/home/nobackup/repo/pandas/pandas/core/ops.py in f(self, other)
   1324     def f(self, other):
   1325         if isinstance(other, ABCDataFrame):  # Another DataFrame
-> 1326             return self._compare_frame(other, func, str_rep)
   1327         elif isinstance(other, ABCSeries):
   1328             return self._combine_series_infer(other, func, try_cast=False)

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare_frame(self, other, func, str_rep, try_cast)
   4008                              'DataFrame objects')
   4009         return self._compare_frame_evaluate(other, func, str_rep,
-> 4010                                             try_cast=try_cast)
   4011 
   4012     def _flex_compare_frame(self, other, func, str_rep, level, try_cast=True):

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare_frame_evaluate(self, other, func, str_rep, try_cast)
   3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
-> 3989             new_data = expressions.evaluate(_compare, str_rep, self, other)
   3990             return self._constructor(data=new_data, index=self.index,
   3991                                      columns=self.columns, copy=False)

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    202     use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)
    203     if use_numexpr:
--> 204         return _evaluate(op, op_str, a, b, **eval_kwargs)
    205     return _evaluate_standard(op, op_str, a, b)
    206 

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    117 
    118     if result is None:
--> 119         result = _evaluate_standard(op, op_str, a, b)
    120 
    121     return result

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     62         _store_test_result(False)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 
     66 

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare(a, b)
   3985 
   3986             def _compare(a, b):
-> 3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
   3989             new_data = expressions.evaluate(_compare, str_rep, self, other)

/home/nobackup/repo/pandas/pandas/core/frame.py in <genexpr>(.0)
   3985 
   3986             def _compare(a, b):
-> 3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
   3989             new_data = expressions.evaluate(_compare, str_rep, self, other)

/home/nobackup/repo/pandas/pandas/core/frame.py in __getitem__(self, key)
   2130             return self._getitem_frame(key)
   2131         elif is_mi_columns:
-> 2132             return self._getitem_multilevel(key)
   2133         else:
   2134             return self._getitem_column(key)

/home/nobackup/repo/pandas/pandas/core/frame.py in _getitem_multilevel(self, key)
   2174 
   2175     def _getitem_multilevel(self, key):
-> 2176         loc = self.columns.get_loc(key)
   2177         if isinstance(loc, (slice, Series, np.ndarray, Index)):
   2178             new_columns = self.columns[loc]

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in get_loc(self, key, method)
   2119             key = _values_from_object(key)
   2120             key = tuple(map(_maybe_str_to_time_stamp, key, self.levels))
-> 2121             return self._engine.get_loc(key)
   2122 
   2123         # -- partial selection or non-unique index

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14965)()
    616         return super(MultiIndexObjectEngine, self).get_indexer(values)
    617 
--> 618     cpdef get_loc(self, object val):
    619 
    620         # convert a MI to an ndarray

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14886)()
    621         if hasattr(val, 'values'):
    622             val = val.values
--> 623         return super(MultiIndexObjectEngine, self).get_loc(val)
    624 
    625 

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5832)()
    137             util.set_value_at(arr, loc, value)
    138 
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):
    141             raise TypeError("'{val}' is an invalid key".format(val=val))

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5678)()
    159 
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):
    163             raise KeyError(val)

/home/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:21018)()
   1263                                        sizeof(uint32_t)) # flags
   1264 
-> 1265     cpdef get_item(self, object val):
   1266         cdef khiter_t k
   1267         if val != val or val is None:

/home/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20972)()
   1271             return self.table.vals[k]
   1272         else:
-> 1273             raise KeyError(val)
   1274 
   1275     cpdef set_item(self, object key, Py_ssize_t val):

KeyError: (2.0, nan)

jreback · 2017-11-24T20:00:56Z

hmm this should work, we already use array_equivalent to compare (inside Index.equals).

In [1]: from pandas.core.dtypes.missing import array_equivalent

In [2]: i = pd.Index([1.0, np.nan, 3])

In [3]: i
Out[3]: Float64Index([1.0, nan, 3.0], dtype='float64')

In [4]: array_equivalent(i.values, i.values)
Out[4]: True

toobaz · 2017-11-24T20:18:31Z

More general question: is it a desired feature or a limitation that equality works only on objects with similarly ordered index/columns? E.g. compare Series(index=[1,2]) + Series(index=[2,1]) (works) with Series(index=[1,2]) == Series(index=[2,1]) (ValueError): the latter could in principle get an indexer, find out the index actually contains the same elements, and hence compare values (clearly at a cost, which however could be easily avoided in those cases in which the index is indeed the same - that is, the change wouldn't hinder performance for current correct use).

jreback · 2017-11-24T22:27:01Z

see long discussion here: #1134

toobaz · 2017-11-24T23:37:47Z

see long discussion here: #1134

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

jreback · 2017-11-25T00:13:58Z

how could different orderings be considered equal?

toobaz · 2017-11-25T00:29:07Z

how could different orderings be considered equal?

My idea would be something like

if (self.index == oth.index).all():
    # compare self and oth, return result
if len(self.index.intersection(oth.index)) == len(self.index):
    # compare self and oth.loc[self.index], return result
raise ValueError

jreback · 2017-11-25T00:50:55Z

i am asking why you think this is a good idea to ignore ordering in an ordered array

toobaz · 2017-11-25T08:45:00Z

Just because aligning is what pandas always does (with arithmetic and logic operations), and hence what users expect.

Or in other words: if our bools had NaNs, I'm pretty sure that equality would have behaved like other operations since the beginning (allowing comparison of any two objects, aligning, keeping the left index, filling with NaNs). Now, I understand the desire to not have == return an object dtype. So what I suggest is just to fix those cases in which we can mimic the standard behavior by still keeping bool dtype: that is, those cases in which set of labels on the left is (a subset of) the set of labels on the right. So:

what currently works would work the same
the cases I'm talking about would be supported at a computational cost which is already borne inside any standard operation (e.g. s1 + s2)
the cases which would still be unsupported would bear this cost too, but they raise an error anyway...

shoyer · 2017-11-25T08:56:14Z

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

I don't think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.

toobaz · 2017-11-25T09:05:02Z

I don't think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.

What kind of operations would be left in category (2)?

shoyer · 2017-11-25T09:09:34Z

.identical() and .equals() check identical labels, in order. I guess that's not quite an operation that requires identical labels, but it's a convenient way to check them.

toobaz · 2017-11-25T09:27:20Z

.identical() and .equals() check identical labels, in order.

Sure... but these are at the object level. In the same way, + (element-wise) aligns, but pd.concat (object-wise) doesn't (I mean: on the non-concatenated axis). The thing is: == is element-wise, and I have in mind the general rule: "all element-wise operations always realign objects with any labels", which following my proposal would keep the only exception "arguments to comparison operators/functions must include in their indexes all labels of self".

By the way, I'm not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs... I didn't propose it just because the change would be bigger and casting bools to objects is sad (unless we use bool categoricals, not sure if @jreback 's idea was purely theoretical), but I would prefer it to the current state (and again, it would not have any impact on currently correct uses).

closes pandas-dev#18455

toobaz · 2018-01-16T04:55:47Z

Reminder: when this is fixed, remove workaround in test

jorisvandenbossche · 2018-01-16T10:14:36Z

By the way, I'm not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs... I didn't propose it just because the change would be bigger and casting bools to objects is sad

I am not sure this was the reason. Because if comparison operations would align, you would 1) align introducing NaNs in the values and 2) compare and where there are NaNs you just get False (just as you would now get with already aligned objects that contains NaNs).
So even if comparisons do alignment you can still get a normal functioning boolean result.

I think one of the reasons to not let the comparisons align was 1) make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well) and 2) people liked the error as a sanity check (as often, when doing a comparison you want to use it for boolean indexing, and then if you get alignment, that might give unexpected results). One example use case that Wes gave: s1[1:] == s2[:1].

toobaz · 2018-01-16T10:34:04Z

So even if comparisons do alignment you can still get a normal functioning boolean result.

Good point: comparison of NaNs is well defined.

make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well)

Exactly

people liked the error as a sanity check

True. My idea of introducing NaNs would have provided this sanity check... but it's just too inconsistent. And while I would rather not have this sanity check, changing it now would be too disruptive.

I still think we could just allow for different order of indexes, in unique indexes with same elements, not to matter.

closes pandas-dev#18455

closes pandas-dev#18455 closes pandas-dev#19646

closes #18455 closes #19646 * TST: removed workaround for #18455 * TST: asv for DataFrame from dict with index/columns

toobaz mentioned this issue Nov 24, 2017

BUG: avoid unnecessary casting when unstacking index with unused levels #18460

Merged

3 tasks

jreback added Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Nov 24, 2017

jreback added this to the 0.22.0 milestone Nov 24, 2017

This was referenced Nov 25, 2017

Initialization of Series from dict disregards np.nan key #18480

Closed

MultiIndex.get_loc misbehaves on NaNs #18485

Closed

Construction of Series from dict containing NaN as key #18496

Merged

toobaz added a commit to toobaz/pandas that referenced this issue Dec 2, 2017

BUG: Fix initialization of DataFrame from dict with NaN as key

731c9a8

closes pandas-dev#18455

toobaz added a commit to toobaz/pandas that referenced this issue Dec 2, 2017

BUG: Fix initialization of DataFrame from dict with NaN as key

8bc8e17

closes pandas-dev#18455

toobaz added a commit to toobaz/pandas that referenced this issue Dec 2, 2017

BUG: Fix initialization of DataFrame from dict with NaN as key

f7447b3

closes pandas-dev#18455

This was referenced Dec 2, 2017

BUG: Fix initialization of DataFrame from dict with NaN as key #18600

Merged

Removing Unused Levels in MultiIndex with NaN values corrupts Index #18616

Closed

toobaz added a commit to toobaz/pandas that referenced this issue Feb 4, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key

e85b207

closes pandas-dev#18455

toobaz added a commit to toobaz/pandas that referenced this issue Feb 5, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key

6f9b502

closes pandas-dev#18455

toobaz added a commit to toobaz/pandas that referenced this issue Feb 5, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key

9a65e8a

closes pandas-dev#18455

toobaz added a commit to toobaz/pandas that referenced this issue Feb 7, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key

07c9ba5

closes pandas-dev#18455

toobaz added a commit to toobaz/pandas that referenced this issue Feb 7, 2018

TST: removed workaround for pandas-dev#18455

0ddd89e

toobaz added a commit to toobaz/pandas that referenced this issue Feb 12, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key

6397859

closes pandas-dev#18455 closes pandas-dev#19646

toobaz added a commit to toobaz/pandas that referenced this issue Feb 12, 2018

TST: removed workaround for pandas-dev#18455

ce8f000

toobaz added a commit to toobaz/pandas that referenced this issue Mar 31, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key

0df64d2

closes pandas-dev#18455 closes pandas-dev#19646

toobaz added a commit to toobaz/pandas that referenced this issue Mar 31, 2018

TST: removed workaround for pandas-dev#18455

bcd6328

toobaz added a commit to toobaz/pandas that referenced this issue Apr 1, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key

fe556d6

closes pandas-dev#18455 closes pandas-dev#19646

toobaz added a commit to toobaz/pandas that referenced this issue Apr 1, 2018

TST: removed workaround for pandas-dev#18455

92fd37f

toobaz closed this as completed in #18600 Apr 1, 2018

toobaz added a commit that referenced this issue Apr 1, 2018

BUG: Fix initialization of DataFrame from dict with NaN as key (#18600)

4efb39f

closes #18455 closes #19646 * TST: removed workaround for #18455 * TST: asv for DataFrame from dict with index/columns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equality between DataFrames misbehaves if columns contain NaN #18455

Equality between DataFrames misbehaves if columns contain NaN #18455

toobaz commented Nov 23, 2017 •

edited

Loading

INSTALLED VERSIONS

toobaz commented Nov 23, 2017

toobaz commented Nov 23, 2017 •

edited

Loading

jreback commented Nov 24, 2017

toobaz commented Nov 24, 2017

jreback commented Nov 24, 2017

toobaz commented Nov 24, 2017

jreback commented Nov 25, 2017

toobaz commented Nov 25, 2017

jreback commented Nov 25, 2017

toobaz commented Nov 25, 2017

shoyer commented Nov 25, 2017

toobaz commented Nov 25, 2017 •

edited

Loading

shoyer commented Nov 25, 2017

toobaz commented Nov 25, 2017 •

edited

Loading

toobaz commented Jan 16, 2018

jorisvandenbossche commented Jan 16, 2018

toobaz commented Jan 16, 2018

Equality between DataFrames misbehaves if columns contain NaN #18455

Equality between DataFrames misbehaves if columns contain NaN #18455

Comments

toobaz commented Nov 23, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

toobaz commented Nov 23, 2017

toobaz commented Nov 23, 2017 • edited Loading

jreback commented Nov 24, 2017

toobaz commented Nov 24, 2017

jreback commented Nov 24, 2017

toobaz commented Nov 24, 2017

jreback commented Nov 25, 2017

toobaz commented Nov 25, 2017

jreback commented Nov 25, 2017

toobaz commented Nov 25, 2017

shoyer commented Nov 25, 2017

toobaz commented Nov 25, 2017 • edited Loading

shoyer commented Nov 25, 2017

toobaz commented Nov 25, 2017 • edited Loading

toobaz commented Jan 16, 2018

jorisvandenbossche commented Jan 16, 2018

toobaz commented Jan 16, 2018

toobaz commented Nov 23, 2017 •

edited

Loading

Output of `pd.show_versions()`

toobaz commented Nov 23, 2017 •

edited

Loading

toobaz commented Nov 25, 2017 •

edited

Loading

toobaz commented Nov 25, 2017 •

edited

Loading