Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equality between DataFrames misbehaves if columns contain NaN #18455

Closed
toobaz opened this issue Nov 23, 2017 · 17 comments · Fixed by #18600
Closed

Equality between DataFrames misbehaves if columns contain NaN #18455

toobaz opened this issue Nov 23, 2017 · 17 comments · Fixed by #18600
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Nov 23, 2017

Code Sample, a copy-pastable example if possible

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=[3, np.nan, 1])
   ...: 

In [3]: s + s # good
Out[3]: 
       3.0  NaN    1.0
 1.0    -2    -2    -2
NaN     -2    -2    -2
 2.0    -2    -2    -2

In [4]: s == s # bad
Out[4]: 
       3.0 NaN    1.0
 1.0  True  NaN  True
NaN   True  NaN  True
 2.0  True  NaN  True

Problem description

While it is true that np.nan != np.nan, pandas disregards this in indexes (indeed, s.loc[:, np.nan] works), so it should be coherent.

Expected Output

In [4]: s == s
Out[4]: 
       3.0  NaN   1.0
 1.0  True  True  True
NaN   True  True  True
 2.0  True  True  True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: b45325e
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+201.gb45325e28.dirty
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz
Copy link
Member Author

toobaz commented Nov 23, 2017

By the way: it works with MultiIndex columns:

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=pd.MultiIndex.from_product([[2, np.nan], [3, np.nan]]))
   ...:                  

In [3]: s == s
Out[3]: 
         2         NaN      
         3   NaN     3   NaN
 1.0  True  True  True  True
NaN   True  True  True  True
 2.0  True  True  True  True

@toobaz
Copy link
Member Author

toobaz commented Nov 23, 2017

I spoke too soon:

In [3]: s == s.copy()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-d24aa33f67ac> in <module>()
----> 1 s == s.copy()

/home/nobackup/repo/pandas/pandas/core/ops.py in f(self, other)
   1324     def f(self, other):
   1325         if isinstance(other, ABCDataFrame):  # Another DataFrame
-> 1326             return self._compare_frame(other, func, str_rep)
   1327         elif isinstance(other, ABCSeries):
   1328             return self._combine_series_infer(other, func, try_cast=False)

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare_frame(self, other, func, str_rep, try_cast)
   4008                              'DataFrame objects')
   4009         return self._compare_frame_evaluate(other, func, str_rep,
-> 4010                                             try_cast=try_cast)
   4011 
   4012     def _flex_compare_frame(self, other, func, str_rep, level, try_cast=True):

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare_frame_evaluate(self, other, func, str_rep, try_cast)
   3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
-> 3989             new_data = expressions.evaluate(_compare, str_rep, self, other)
   3990             return self._constructor(data=new_data, index=self.index,
   3991                                      columns=self.columns, copy=False)

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    202     use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)
    203     if use_numexpr:
--> 204         return _evaluate(op, op_str, a, b, **eval_kwargs)
    205     return _evaluate_standard(op, op_str, a, b)
    206 

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    117 
    118     if result is None:
--> 119         result = _evaluate_standard(op, op_str, a, b)
    120 
    121     return result

/home/nobackup/repo/pandas/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     62         _store_test_result(False)
     63     with np.errstate(all='ignore'):
---> 64         return op(a, b)
     65 
     66 

/home/nobackup/repo/pandas/pandas/core/frame.py in _compare(a, b)
   3985 
   3986             def _compare(a, b):
-> 3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
   3989             new_data = expressions.evaluate(_compare, str_rep, self, other)

/home/nobackup/repo/pandas/pandas/core/frame.py in <genexpr>(.0)
   3985 
   3986             def _compare(a, b):
-> 3987                 return dict((col, func(a[col], b[col])) for col in a.columns)
   3988 
   3989             new_data = expressions.evaluate(_compare, str_rep, self, other)

/home/nobackup/repo/pandas/pandas/core/frame.py in __getitem__(self, key)
   2130             return self._getitem_frame(key)
   2131         elif is_mi_columns:
-> 2132             return self._getitem_multilevel(key)
   2133         else:
   2134             return self._getitem_column(key)

/home/nobackup/repo/pandas/pandas/core/frame.py in _getitem_multilevel(self, key)
   2174 
   2175     def _getitem_multilevel(self, key):
-> 2176         loc = self.columns.get_loc(key)
   2177         if isinstance(loc, (slice, Series, np.ndarray, Index)):
   2178             new_columns = self.columns[loc]

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in get_loc(self, key, method)
   2119             key = _values_from_object(key)
   2120             key = tuple(map(_maybe_str_to_time_stamp, key, self.levels))
-> 2121             return self._engine.get_loc(key)
   2122 
   2123         # -- partial selection or non-unique index

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14965)()
    616         return super(MultiIndexObjectEngine, self).get_indexer(values)
    617 
--> 618     cpdef get_loc(self, object val):
    619 
    620         # convert a MI to an ndarray

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14886)()
    621         if hasattr(val, 'values'):
    622             val = val.values
--> 623         return super(MultiIndexObjectEngine, self).get_loc(val)
    624 
    625 

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5832)()
    137             util.set_value_at(arr, loc, value)
    138 
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):
    141             raise TypeError("'{val}' is an invalid key".format(val=val))

/home/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5678)()
    159 
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):
    163             raise KeyError(val)

/home/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:21018)()
   1263                                        sizeof(uint32_t)) # flags
   1264 
-> 1265     cpdef get_item(self, object val):
   1266         cdef khiter_t k
   1267         if val != val or val is None:

/home/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20972)()
   1271             return self.table.vals[k]
   1272         else:
-> 1273             raise KeyError(val)
   1274 
   1275     cpdef set_item(self, object key, Py_ssize_t val):

KeyError: (2.0, nan)

@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

hmm this should work, we already use array_equivalent to compare (inside Index.equals).

In [1]: from pandas.core.dtypes.missing import array_equivalent

In [2]: i = pd.Index([1.0, np.nan, 3])

In [3]: i
Out[3]: Float64Index([1.0, nan, 3.0], dtype='float64')

In [4]: array_equivalent(i.values, i.values)
Out[4]: True

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Nov 24, 2017
@jreback jreback added this to the 0.22.0 milestone Nov 24, 2017
@toobaz
Copy link
Member Author

toobaz commented Nov 24, 2017

More general question: is it a desired feature or a limitation that equality works only on objects with similarly ordered index/columns? E.g. compare Series(index=[1,2]) + Series(index=[2,1]) (works) with Series(index=[1,2]) == Series(index=[2,1]) (ValueError): the latter could in principle get an indexer, find out the index actually contains the same elements, and hence compare values (clearly at a cost, which however could be easily avoided in those cases in which the index is indeed the same - that is, the change wouldn't hinder performance for current correct use).

@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

see long discussion here: #1134

@toobaz
Copy link
Member Author

toobaz commented Nov 24, 2017

see long discussion here: #1134

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

how could different orderings be considered equal?

@toobaz
Copy link
Member Author

toobaz commented Nov 25, 2017

how could different orderings be considered equal?

My idea would be something like

if (self.index == oth.index).all():
    # compare self and oth, return result
if len(self.index.intersection(oth.index)) == len(self.index):
    # compare self and oth.loc[self.index], return result
raise ValueError

@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

i am asking why you think this is a good idea to ignore ordering in an ordered array

@toobaz
Copy link
Member Author

toobaz commented Nov 25, 2017

Just because aligning is what pandas always does (with arithmetic and logic operations), and hence what users expect.

Or in other words: if our bools had NaNs, I'm pretty sure that equality would have behaved like other operations since the beginning (allowing comparison of any two objects, aligning, keeping the left index, filling with NaNs). Now, I understand the desire to not have == return an object dtype. So what I suggest is just to fix those cases in which we can mimic the standard behavior by still keeping bool dtype: that is, those cases in which set of labels on the left is (a subset of) the set of labels on the right. So:

  • what currently works would work the same
  • the cases I'm talking about would be supported at a computational cost which is already borne inside any standard operation (e.g. s1 + s2)
  • the cases which would still be unsupported would bear this cost too, but they raise an error anyway...

@shoyer
Copy link
Member

shoyer commented Nov 25, 2017

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

I don't think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.

@toobaz
Copy link
Member Author

toobaz commented Nov 25, 2017

I don't think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.

What kind of operations would be left in category (2)?

@shoyer
Copy link
Member

shoyer commented Nov 25, 2017

.identical() and .equals() check identical labels, in order. I guess that's not quite an operation that requires identical labels, but it's a convenient way to check them.

@toobaz
Copy link
Member Author

toobaz commented Nov 25, 2017

.identical() and .equals() check identical labels, in order.

Sure... but these are at the object level. In the same way, + (element-wise) aligns, but pd.concat (object-wise) doesn't (I mean: on the non-concatenated axis). The thing is: == is element-wise, and I have in mind the general rule: "all element-wise operations always realign objects with any labels", which following my proposal would keep the only exception "arguments to comparison operators/functions must include in their indexes all labels of self".

By the way, I'm not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs... I didn't propose it just because the change would be bigger and casting bools to objects is sad (unless we use bool categoricals, not sure if @jreback 's idea was purely theoretical), but I would prefer it to the current state (and again, it would not have any impact on currently correct uses).

@toobaz
Copy link
Member Author

toobaz commented Jan 16, 2018

Reminder: when this is fixed, remove workaround in test

@jorisvandenbossche
Copy link
Member

By the way, I'm not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs... I didn't propose it just because the change would be bigger and casting bools to objects is sad

I am not sure this was the reason. Because if comparison operations would align, you would 1) align introducing NaNs in the values and 2) compare and where there are NaNs you just get False (just as you would now get with already aligned objects that contains NaNs).
So even if comparisons do alignment you can still get a normal functioning boolean result.

I think one of the reasons to not let the comparisons align was 1) make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well) and 2) people liked the error as a sanity check (as often, when doing a comparison you want to use it for boolean indexing, and then if you get alignment, that might give unexpected results). One example use case that Wes gave: s1[1:] == s2[:1].

@toobaz
Copy link
Member Author

toobaz commented Jan 16, 2018

So even if comparisons do alignment you can still get a normal functioning boolean result.

Good point: comparison of NaNs is well defined.

make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well)

Exactly

people liked the error as a sanity check

True. My idea of introducing NaNs would have provided this sanity check... but it's just too inconsistent. And while I would rather not have this sanity check, changing it now would be too disruptive.

I still think we could just allow for different order of indexes, in unique indexes with same elements, not to matter.

toobaz added a commit to toobaz/pandas that referenced this issue Feb 4, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Feb 5, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Feb 5, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Feb 7, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Feb 7, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Feb 12, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Mar 31, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Apr 1, 2018
toobaz added a commit to toobaz/pandas that referenced this issue Apr 1, 2018
toobaz added a commit that referenced this issue Apr 1, 2018
closes #18455
closes #19646

* TST: removed workaround for #18455

* TST: asv for DataFrame from dict with index/columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants