Series.replace fails to replace value #32075

tsoernes · 2020-02-18T12:21:26Z

Code Sample, a copy-pastable example if possible

In [93]: ser.eq('nil').sum()
Out[93]: 1

In [94]: ser.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
Out[94]: 1

In [95]: ser.loc[(ser == 'nil').fillna(False)] = pd.NA

In [104]: ser.eq('nil').sum()
Out[104]: 0

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.18-100.fc30.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : nb_NO.UTF-8
LOCALE : nb_NO.UTF-8

pandas : 1.0.1
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.2
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 2.2.3
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.2.2
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.10
tables : 3.5.2
tabulate : 0.8.5
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.2
numba : 0.46.0

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-02-18T12:33:01Z

Thanks @tsoernes

Could you include a minimal reproducible example? I couldn't reproduce this

In [1]: import pandas as pd                                                                                                                                                                                      
In [2]: ser = pd.DataFrame({'a': ['nil']})                                                                                                                                                                       

In [3]: ser                                                                                                                                                                                                      
Out[3]: 
     a
0  nil

In [4]: ser.replace('nil', pd.NA)                                                                                                                                                                                
Out[4]: 
      a
0  <NA>

In [5]: ser.replace('nil', pd.NA).eq('nil').sum()                                                                                                                                                                
Out[5]: 
a    0
dtype: int64

tsoernes · 2020-02-18T12:59:07Z

@MarcoGorelli

In the following example, I extract out two rows from the dataframe and recreate it via a dict.
The first row has a 'nil' in the 'amount' column; the second row has no 'nil'.
If I only recreate the dataframe with the first row, then the problem does not show itself.
When going via JSON, the problem does not persist.

In [282]: di = df.loc[[26123, 26122]].to_dict()

In [283]: di
Out[283]: 
<redacted since problem isolated later in thread>

In [284]: df = pd.DataFrame.from_dict(di)

In [285]: df.eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
Out[285]: 
id                  0
properties          0
url                 0
started_at          0
ended_at            0
investee_id         0
lead_investor_id    0
created_at          0
updated_at          0
round               0
amount              1
currency            0
dtype: int64

In [286]: df.replace('nil', pd.NA).eq('nil').sum()
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/missing.py:47: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
/home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
Out[289]: 
id                  0
properties          0
url                 0
started_at          0
ended_at            0
investee_id         0
lead_investor_id    0
created_at          0
updated_at          0
round               0
amount              1
currency            0
dtype: int64

MarcoGorelli · 2020-02-18T13:26:14Z

@tsoernes OK, got it - it fails when there is another row with pd.NA:

>>> ser = pd.DataFrame({'a': ['nil', pd.NA]}) 
>>> ser.replace('nil', 'anything else')                                                                                                                                                                  
/home/SERILOCAL/m.gorelli/pandas/pandas/core/missing.py:48: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
      a
0   nil
1  <NA>

MarcoGorelli · 2020-02-18T13:30:44Z

Problem's related to this:

>>> np.array([['nil', pd.NA]]) == 'nil'
False
>>> np.array([['nil', 'anything else']]) == 'nil'
array([[ True, False]])

AnnaDaglis · 2020-02-26T17:13:10Z

take

AnnaDaglis · 2020-03-03T13:59:51Z

@MarcoGorelli
Is there any particular approach here you could suggest to fix the bug? There is no issue when using np.nan, not pd.NA.

>>> np.array([['nil', np.nan]]) == 'nil'
array([[ True, False]])

MarcoGorelli · 2020-03-03T22:40:38Z

Hi Anna - gonna be honest, I don't know :) I had a quick go at this one but couldn't figure it out. If you have any specific questions I'd imagine you could tag one of the core team members (e.g. Tom Augspurger) and ask.

…

On Tue, 3 Mar 2020 13:59 Anna Daglis, ***@***.***> wrote: @MarcoGorelli <https://github.com/MarcoGorelli> Is there any particular approach here you could suggest to fix the bug? There is no issue when using np.nan, not pd.NA. >>> np.array([['nil', np.nan]]) == 'nil' array([[ True, False]]) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#32075?email_source=notifications&email_token=AH7QVMAQXWDFDIOEHDX3T7TRFUENRA5CNFSM4KXDNN42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSZQI#issuecomment-593964225>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH7QVMHQJKUXY46CWPZW2NDRFUENRANCNFSM4KXDNN4Q> .

AnnaDaglis · 2020-03-05T10:05:54Z

The problem in this case lies in pandas.core.missing, mask_missing(arr, values_to_mask) function, the following clause:

if is_numeric_v_string_like(arr, x):
    # GH#29553 prevent numpy deprecation warnings
    mask = False
else:
    mask = arr == x

The problem arises here arr == x, which fails when arr contains pd.NA, as pointed out above.

@TomAugspurger @jorisvandenbossche
Do you have any suggestions how to tackle this bug? I noticed that handling pd.NA behaviour has been discussed in other issues, e.g. this one: #32265

TomAugspurger · 2020-03-05T12:33:05Z

I'm not sure offhand, sorry. But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask? Some of the methods in core.arrays.boolean may be helpful to look through.

tsoernes · 2020-03-09T10:49:30Z

Can we get this for 1.0.2 that would be awesome

jorisvandenbossche · 2020-03-09T19:27:28Z

Note that this is working fine with string dtype:

In [4]: ser = pd.DataFrame({'a': ['nil', pd.NA]}, dtype="string") 

In [5]: ser.replace('nil', 'anything else')
Out[5]: 
               a
0  anything else
1           <NA>

So it is specifically for NA in object dtype (something we actually don't really test yet, I think).

The problem arises here arr == x, which fails when arr contains pd.NA, as pointed out above.

This might be related to the comparison deprecations in numpy (there are other cases where comparisons return a scalar True/False, instead of doing it element-wise), you also see the deprecation warning.
Now in general, numpy might not be able to handle this: the comparison with pd.NA returns pd.NA, so it cannot be put in a boolean ndarray (which numpy expects to do for a comparison, I suppose). So the deprecation might be correct that it will fail in the future.

If we want to handle pd.NA in object dtype better, we will need to start using masks as well, and not rely on numpy behaviour.

For example, also this is wrong:

In [9]: pd.Series([1, pd.NA], dtype=object) >= 1
Out[9]: 
0     True
1    False
dtype: bool

(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)

simonjayhawkins · 2020-03-23T15:24:02Z

(we should probably open a separate issue about "pd.NA in object dtype", and the best way forward here generally)

#32931

tsibley · 2020-07-30T21:55:32Z

@jorisvandenbossche

Note that this is working fine with string dtype:

As of 1.1.0 at least, it's not working fine:

>>> import pandas as pd
>>> pd.__version__
'1.1.0'

>>> s = pd.Series(["a", "", "b", pd.NA], dtype="string")
>>> s
0       a
1        
2       b
3    <NA>
dtype: string

>>> s.replace("", pd.NA)
.../lib/python3.6/site-packages/pandas/core/missing.py:49: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
0       a
1        
2       b
3    <NA>
dtype: object

tsibley · 2020-07-30T21:58:17Z

Note however, that the one-dict-arg form of replace does appear to work (!):

>>> s.replace({"": pd.NA})
0       a
1    <NA>
2       b
3    <NA>
dtype: object

but as described in #35268, it changes the dtype of the Series from whatever it was to object.

I'd (naively, perhaps) expect two-arg replace and one-dict-arg replace to both use the same codepath internally!

tsibley · 2020-07-30T23:19:40Z

But it seems like NA values should be considered False there, so perhaps fill it with something other than the value to mask?

This makes sense to me and matches my debugging. I'm testing this patch now:

diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..be1632315 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask = False
             else:
-                mask = arr == x
+                mask = (arr == x).fillna(False)
 
             # if x is a string and arr is not, then we get False and we must
             # expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask |= False
             else:
-                mask |= arr == x
+                mask |= (arr == x).fillna(False)
 
     if na_mask.any():
         if mask is None:

tsibley · 2020-07-31T17:54:41Z

The patch above doesn't work (unsurprisingly) because sometimes arr is a Pandas array and sometimes it's a NumPy array, but only Pandas arrays have a fillna method.

The patch below seems to work with some manual testing and the result of ./test_fast.sh is the same before/after applying it:

diff --git a/pandas/core/missing.py b/pandas/core/missing.py
index 7802c5cbd..f8280c0a2 100644
--- a/pandas/core/missing.py
+++ b/pandas/core/missing.py
@@ -46,7 +46,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask = False
             else:
-                mask = arr == x
+                mask = np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x
 
             # if x is a string and arr is not, then we get False and we must
             # expand the mask to size arr.shape
@@ -57,7 +57,7 @@ def mask_missing(arr, values_to_mask):
                 # GH#29553 prevent numpy deprecation warnings
                 mask |= False
             else:
-                mask |= arr == x
+                mask |= np.where(~isna(arr), arr, np.full_like(arr, np.nan)) == x
 
     if na_mask.any():
         if mask is None:

This replaces pd.NA values with NumPy NaN and NaT values, which do compare with == correctly and produce a boolean vector instead of a boolean scalar.

I want to write an actual set of tests cases for this bug for various dtypes though before submitting a PR. I believe the patch above may still fail for pd.Period types.

MarcusJellinghaus · 2020-10-03T16:24:59Z

Here is another example - at least I think it is the same bug:

import numpy as np
series1 = pd.Series([""]).replace("", np.nan) # works
print(series1.values)  # [nan]
series2 = pd.Series(["", pd.NA]).replace("", np.nan) # fails
print(series2.values)  # ['' <NA>]
series3 = pd.Series(["", pd.NA]).replace(pd.NA, np.nan).replace("", np.nan) # possible workaround
print(series3.values)  # [nan nan]

Maybe this would helps to fix the bug.

It would be great if you could fix the bug.

phofl · 2023-01-18T22:16:05Z

This works now, may need tests

MarcoGorelli added the Needs Info Clarification about behavior needed to assess issue label Feb 18, 2020

MarcoGorelli added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Info Clarification about behavior needed to assess issue labels Feb 18, 2020

github-actions bot assigned AnnaDaglis Feb 26, 2020

jorisvandenbossche mentioned this issue Mar 11, 2020

BUG: Replace in string series with NA #32621

Closed

jorisvandenbossche mentioned this issue Mar 20, 2020

DataFrame.replace fails to replace value when columns are specified and only non-replacement columns contain pd.NA #32838

Closed

simonjayhawkins mentioned this issue Mar 23, 2020

pd.NA in object dtype #32931

Open

AnnaDaglis removed their assignment Mar 27, 2020

mroeschke added the replace replace method label Apr 28, 2020

richierocks mentioned this issue Apr 27, 2022

BUG: Replacing categorical values with NA raises "boolean value of NA is ambiguous" error #46884

Closed

3 tasks

phofl removed the Bug label Jan 18, 2023

phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jan 18, 2023

phofl mentioned this issue Jan 18, 2023

TST: Fixed issues that need tests noatamir/pyladies-berlin-sprints#3

Open

17 tasks

liang3zy22 mentioned this issue Mar 14, 2023

Add replace test for nil gh32075 #51954

Merged

3 tasks

mroeschke closed this as completed in #51954 Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.replace fails to replace value #32075

Series.replace fails to replace value #32075

tsoernes commented Feb 18, 2020

INSTALLED VERSIONS

MarcoGorelli commented Feb 18, 2020

tsoernes commented Feb 18, 2020 •

edited

Loading

MarcoGorelli commented Feb 18, 2020 •

edited

Loading

MarcoGorelli commented Feb 18, 2020 •

edited

Loading

AnnaDaglis commented Feb 26, 2020

AnnaDaglis commented Mar 3, 2020 •

edited

Loading

MarcoGorelli commented Mar 3, 2020 via email

AnnaDaglis commented Mar 5, 2020 •

edited

Loading

TomAugspurger commented Mar 5, 2020

tsoernes commented Mar 9, 2020

jorisvandenbossche commented Mar 9, 2020 •

edited

Loading

simonjayhawkins commented Mar 23, 2020

tsibley commented Jul 30, 2020

tsibley commented Jul 30, 2020 •

edited

Loading

tsibley commented Jul 30, 2020 •

edited

Loading

tsibley commented Jul 31, 2020 •

edited

Loading

MarcusJellinghaus commented Oct 3, 2020 •

edited

Loading

phofl commented Jan 18, 2023

Series.replace fails to replace value #32075

Series.replace fails to replace value #32075

Comments

tsoernes commented Feb 18, 2020

Code Sample, a copy-pastable example if possible

Output of pd.show_versions()

INSTALLED VERSIONS

MarcoGorelli commented Feb 18, 2020

tsoernes commented Feb 18, 2020 • edited Loading

MarcoGorelli commented Feb 18, 2020 • edited Loading

MarcoGorelli commented Feb 18, 2020 • edited Loading

AnnaDaglis commented Feb 26, 2020

AnnaDaglis commented Mar 3, 2020 • edited Loading

MarcoGorelli commented Mar 3, 2020 via email

AnnaDaglis commented Mar 5, 2020 • edited Loading

TomAugspurger commented Mar 5, 2020

tsoernes commented Mar 9, 2020

jorisvandenbossche commented Mar 9, 2020 • edited Loading

simonjayhawkins commented Mar 23, 2020

tsibley commented Jul 30, 2020

tsibley commented Jul 30, 2020 • edited Loading

tsibley commented Jul 30, 2020 • edited Loading

tsibley commented Jul 31, 2020 • edited Loading

MarcusJellinghaus commented Oct 3, 2020 • edited Loading

phofl commented Jan 18, 2023

Output of `pd.show_versions()`

tsoernes commented Feb 18, 2020 •

edited

Loading

MarcoGorelli commented Feb 18, 2020 •

edited

Loading

MarcoGorelli commented Feb 18, 2020 •

edited

Loading

AnnaDaglis commented Mar 3, 2020 •

edited

Loading

AnnaDaglis commented Mar 5, 2020 •

edited

Loading

jorisvandenbossche commented Mar 9, 2020 •

edited

Loading

tsibley commented Jul 30, 2020 •

edited

Loading

tsibley commented Jul 30, 2020 •

edited

Loading

tsibley commented Jul 31, 2020 •

edited

Loading

MarcusJellinghaus commented Oct 3, 2020 •

edited

Loading