Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.replace fails to replace value #32075

Closed
Tracked by #3
tsoernes opened this issue Feb 18, 2020 · 18 comments · Fixed by #51954
Closed
Tracked by #3

Series.replace fails to replace value #32075

tsoernes opened this issue Feb 18, 2020 · 18 comments · Fixed by #51954
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions replace replace method

Comments

@tsoernes
Copy link

Code Sample, a copy-pastable example if possible

In [93]: ser.eq('nil').sum()
Out[93]: 1

In [94]: ser.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
Out[94]: 1

In [95]: ser.loc[(ser == 'nil').fillna(False)] = pd.NA

In [104]: ser.eq('nil').sum()
Out[104]: 0

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.18-100.fc30.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : nb_NO.UTF-8
LOCALE : nb_NO.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 2.2.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.5.2
tabulate : 0.8.5
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.46.0

@MarcoGorelli
Copy link
Member

Thanks @tsoernes

Could you include a minimal reproducible example? I couldn't reproduce this

In [1]: import pandas as pd                                                                                                                                                                                      
In [2]: ser = pd.DataFrame({'a': ['nil']})                                                                                                                                                                       

In [3]: ser                                                                                                                                                                                                      
Out[3]: 
     a
0  nil

In [4]: ser.replace('nil', pd.NA)                                                                                                                                                                                
Out[4]: 
      a
0  <NA>

In [5]: ser.replace('nil', pd.NA).eq('nil').sum()                                                                                                                                                                
Out[5]: 
a    0
dtype: int64

@tsoernes
Copy link
Author

tsoernes commented Feb 18, 2020

@MarcoGorelli

In the following example, I extract out two rows from the dataframe and recreate it via a dict.
The first row has a 'nil' in the 'amount' column; the second row has no 'nil'.
If I only recreate the dataframe with the first row, then the problem does not show itself.
When going via JSON, the problem does not persist.

In [282]: di = df.loc[[26123, 26122]].to_dict()

In [283]: di
Out[283]: 
<redacted since problem isolated later in thread>

In [284]: df = pd.DataFrame.from_dict(di)

In [285]: df.eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
Out[285]: 
id                  0
properties          0
url                 0
started_at          0
ended_at            0
investee_id         0
lead_investor_id    0
created_at          0
updated_at          0
round               0
amount              1
currency            0
dtype: int64

In [286]: df.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
Out[289]: 
id                  0
properties          0
url                 0
started_at          0
ended_at            0
investee_id         0
lead_investor_id    0
created_at          0
updated_at          0
round               0
amount              1
currency            0
dtype: int64

@MarcoGorelli MarcoGorelli added the Needs Info Clarification about behavior needed to assess issue label Feb 18, 2020
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 18, 2020

@tsoernes OK, got it - it fails when there is another row with pd.NA:

>>> ser = pd.DataFrame({'a': ['nil', pd.NA]}) 
>>> ser.replace('nil', 'anything else')                                                                                                                                                                  
/home/SERILOCAL/m.gorelli/pandas/pandas/core/missing.py:48: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
      a
0   nil
1  <NA>

@MarcoGorelli MarcoGorelli added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Info Clarification about behavior needed to assess issue labels Feb 18, 2020
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 18, 2020

Problem's related to this:

>>> np.array([['nil', pd.NA]]) == 'nil'
False
>>> np.array([['nil', 'anything else']]) == 'nil'
array([[ True, False]])

@AnnaDaglis
Copy link
Contributor

take

@AnnaDaglis
Copy link
Contributor

AnnaDaglis commented Mar 3, 2020

@MarcoGorelli
Is there any particular approach here you could suggest to fix the bug? There is no issue when using np.nan, not pd.NA.

>>> np.array([['nil', np.nan]]) == 'nil'
array([[ True, False]])

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Mar 3, 2020 via email

@AnnaDaglis
Copy link
Contributor

AnnaDaglis commented Mar 5, 2020

The problem in this case lies in pandas.core.missing, mask_missing(arr, values_to_mask) function, the following clause:

if is_numeric_v_string_like(arr, x):
    # GH#29553 prevent numpy deprecation warnings
    mask = False
else:
    mask = arr == x

The problem arises here arr == x, which fails when arr contains pd.NA, as pointed out above.

@TomAugspurger @jorisvandenbossche
Do you have any suggestions how to tackle this bug? I noticed that handling pd.NA behaviour has been discussed in other issues, e.g. this one: #32265

@TomAugspurger
Copy link
Contributor

I'm not sure offhand, sorry. But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask? Some of the methods in core.arrays.boolean may be helpful to look through.

@tsoernes
Copy link
Author

tsoernes commented Mar 9, 2020

Can we get this for 1.0.2 that would be awesome

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 9, 2020

Note that this is working fine with string dtype:

In [4]: ser = pd.DataFrame({'a': ['nil', pd.NA]}, dtype="string") 

In [5]: ser.replace('nil', 'anything else')
Out[5]: 
               a
0  anything else
1           <NA>

So it is specifically for NA in object dtype (something we actually don't really test yet, I think).

The problem arises here arr == x, which fails when arr contains pd.NA, as pointed out above.

This might be related to the comparison deprecations in numpy (there are other cases where comparisons return a scalar True/False, instead of doing it element-wise), you also see the deprecation warning.
Now in general, numpy might not be able to handle this: the comparison with pd.NA returns pd.NA, so it cannot be put in a boolean ndarray (which numpy expects to do for a comparison, I suppose). So the deprecation might be correct that it will fail in the future.

If we want to handle pd.NA in object dtype better, we will need to start using masks as well, and not rely on numpy behaviour.

For example, also this is wrong:

In [9]: pd.Series([1, pd.NA], dtype=object) >= 1
Out[9]: 
0     True
1    False
dtype: bool

(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)

@simonjayhawkins
Copy link
Member

(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)

#32931

@AnnaDaglis AnnaDaglis removed their assignment Mar 27, 2020
@mroeschke mroeschke added the replace replace method label Apr 28, 2020
@tsibley
Copy link

tsibley commented Jul 30, 2020

@jorisvandenbossche

Note that this is working fine with string dtype:

As of 1.1.0 at least, it's not working fine:

>>> import pandas as pd
>>> pd.__version__
'1.1.0'

>>> s = pd.Series(["a", "", "b", pd.NA], dtype="string")
>>> s
0       a
1        
2       b
3    <NA>
dtype: string

>>> s.replace("", pd.NA)
.../lib/python3.6/site-packages/pandas/core/missing.py:49: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
0       a
1        
2       b
3    <NA>
dtype: object

@tsibley
Copy link

tsibley commented Jul 30, 2020

Note however, that the one-dict-arg form of replace does appear to work (!):

>>> s.replace({"": pd.NA})
0       a
1    <NA>
2       b
3    <NA>
dtype: object

but as described in #35268, it changes the dtype of the Series from whatever it was to object.

I'd (naively, perhaps) expect two-arg replace and one-dict-arg replace to both use the same codepath internally!

@tsibley
Copy link

tsibley commented Jul 30, 2020

But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask?

This makes sense to me and matches my debugging. I'm testing this patch now:

diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..be1632315 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask = False
             else:
-                mask = arr == x
+                mask = (arr == x).fillna(False)
 
             # if x is a string and arr is not, then we get False and we must
             # expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask |= False
             else:
-                mask |= arr == x
+                mask |= (arr == x).fillna(False)
 
     if na_mask.any():
         if mask is None:

@tsibley
Copy link

tsibley commented Jul 31, 2020

The patch above doesn't work (unsurprisingly) because sometimes arr is a Pandas array and sometimes it's a NumPy array, but only Pandas arrays have a fillna method.

The patch below seems to work with some manual testing and the result of ./test_fast.sh is the same before/after applying it:

diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..f8280c0a2 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask = False
             else:
-                mask = arr == x
+                mask = np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x
 
             # if x is a string and arr is not, then we get False and we must
             # expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask |= False
             else:
-                mask |= arr == x
+                mask |= np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x
 
     if na_mask.any():
         if mask is None:

This replaces pd.NA values with NumPy NaN and NaT values, which do compare with == correctly and produce a boolean vector instead of a boolean scalar.

I want to write an actual set of tests cases for this bug for various dtypes though before submitting a PR. I believe the patch above may still fail for pd.Period types.

@MarcusJellinghaus
Copy link

MarcusJellinghaus commented Oct 3, 2020

Here is another example - at least I think it is the same bug:

import numpy as np
series1 = pd.Series([""]).replace("", np.nan) # works
print(series1.values)  # [nan]
series2 = pd.Series(["", pd.NA]).replace("", np.nan) # fails
print(series2.values)  # ['' <NA>]
series3 = pd.Series(["", pd.NA]).replace(pd.NA, np.nan).replace("", np.nan) # possible workaround
print(series3.values)  # [nan nan] 

Maybe this would helps to fix the bug.

It would be great if you could fix the bug.

@phofl
Copy link
Member

phofl commented Jan 18, 2023

This works now, may need tests

@phofl phofl removed the Bug label Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions replace replace method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants