DataFrame.update crashes with overwrite=False when NaT present #16713

olizhu · 2017-06-16T14:31:52Z

Code Sample

df1 = DataFrame({'A': [1,None], 'B':[to_datetime('abc', errors='coerce'),to_datetime('2016-01-01')]})
df2 = DataFrame({'A': [2,3]})
df1.update(df2, overwrite=False)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a766b5317aac> in <module>()
      1 df1 = DataFrame({'A': [1,None], 'B':[to_datetime('abc', errors='coerce'),to_datetime('2016-01-01')]})
      2 df2 = DataFrame({'A': [2,3]})
----> 3 df1.update(df2, overwrite=False)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/frame.py in update(self, other, join, overwrite, filter_func, raise_conflict)
   3897
   3898             self[col] = expressions.where(mask, this, that,
-> 3899                                           raise_on_error=True)
   3900
   3901     # ----------------------------------------------------------------------

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in where(cond, a, b, raise_on_error, use_numexpr)
    229
    230     if use_numexpr:
--> 231         return _where(cond, a, b, raise_on_error=raise_on_error)
    232     return _where_standard(cond, a, b, raise_on_error=raise_on_error)
    233

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in _where_numexpr(cond, a, b, raise_on_error)
    152
    153     if result is None:
--> 154         result = _where_standard(cond, a, b, raise_on_error)
    155
    156     return result

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in _where_standard(cond, a, b, raise_on_error)
    127 def _where_standard(cond, a, b, raise_on_error=True):
    128     return np.where(_values_from_object(cond), _values_from_object(a),
--> 129                     _values_from_object(b))
    130
    131

TypeError: invalid type promotion

Problem description

A similar problem as in issue #15593 which was fixed in pandas version 0.20.2, NaT values anywhere in the DataFrame still throws the following exception: TypeError: invalid type promotion

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 2.9.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.24
numpy: 1.13.0
scipy: 0.17.1
xarray: None
IPython: 6.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.1.0
tables: 3.4.2
numexpr: 2.6.2
feather: 0.3.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.5.1
html5lib: 0.999999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

olizhu · 2017-06-16T14:37:27Z

I've just tested some more and it seems that the error occurs whenever there is a null object in a column containing datetimes. So replacing NaT with NaN still has the same error.

TomAugspurger · 2017-06-16T15:26:37Z

So when we reindex df2 like df1 we end up with different dtypes

In [22]: df2.reindex_like(df1).dtypes
Out[22]:
A      int64
B    float64
dtype: object

I wonder if we could add a parameter to reindex_like for controlling the dtypes of columns that are created. I could see that being broadly useful.

sboltz · 2020-10-26T22:08:58Z

I just encountered this issue and was wondering if there were any updates or workarounds for it? Thanks.

IanFLee · 2020-11-30T07:07:55Z

I'm also having this issue in sklearn.preprocessing with StandardScaler(). It definitely seems to be a datetime issue so I've just dropped that column for the type being, but eventually I'll need it back, so fingers crossed.

wxing11 · 2022-10-26T00:33:10Z

take

wxing11 · 2022-10-30T05:49:24Z

Hi, new contributor here so please correct me if I'm wrong!

This seems to be caused by situations where the Dataframe to be updated has a Datetime column with NaT values and the input Dataframe has either

A matching column by index but of a type that isn't Datetime/Object. I assume an error here is expected.
No matching column by index, so the call to reindex_like in the update function creates a column that isn't of type Datetime/Object. (The example case above)

Since in the situation of the second case the created column is full of only NA values, would it be reasonable to solve this by just adding a check to the function that if a column is full of only NA values, to skip the updating of that column?

I created a PR with an implementation of this as well as a couple new test cases including the one introduced above.

MarcoGorelli · 2022-11-01T18:10:33Z

I wonder if we could add a parameter to reindex_like for controlling the dtypes of columns that are created

How would this work? Would the dtype be taken from the other DataFrame in reindex? Because if so, one issue would be with null columns getting converted to bool:

In [1]: pd.Series([np.nan, np.nan]).astype(bool)
Out[1]: 
0    True
1    True
dtype: bool

Alternatively, there could be an option to exclude null columns could be excluded from the result of reindex_like, but then that would still require an update to

pandas/pandas/core/frame.py

Lines 8196 to 8198 in 2d126dd

    
           this = self[col]._values 
        
           that = other[col]._values

to skip over columns which aren't in both this and that

At the moment, I'm struggling to see a simpler solution that that proposed in #49395
cc @mroeschke (as you'd commented on the PR)

mroeschke · 2022-11-02T20:57:06Z

Maybe generally a full reindex_like is not needed generally, as only the shared columns should be updated?

wxing11 · 2022-11-03T00:42:37Z

I pushed a new commit to my PR that only reindexes rows and then skips non matching columns. Does that seem right for what you were saying?

olizhu changed the title ~~DataFrame.update crashes with overwrite=True when NaT present~~ DataFrame.update crashes with overwrite=False when NaT present Jun 16, 2017

TomAugspurger added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Jun 16, 2017

TomAugspurger added this to the Next Major Release milestone Jun 16, 2017

TomAugspurger added Difficulty Intermediate labels Jun 16, 2017

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

github-actions bot assigned wxing11 Oct 26, 2022

wxing11 mentioned this issue Oct 30, 2022

BUG: Fixing DataFrame.Update crashes when NaT present #49395

Merged

5 tasks

MarcoGorelli closed this as completed in #49395 Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.update crashes with overwrite=False when NaT present #16713

DataFrame.update crashes with overwrite=False when NaT present #16713

olizhu commented Jun 16, 2017 •

edited by TomAugspurger

Loading

olizhu commented Jun 16, 2017

TomAugspurger commented Jun 16, 2017

sboltz commented Oct 26, 2020

IanFLee commented Nov 30, 2020

wxing11 commented Oct 26, 2022

wxing11 commented Oct 30, 2022

MarcoGorelli commented Nov 1, 2022

mroeschke commented Nov 2, 2022

wxing11 commented Nov 3, 2022

DataFrame.update crashes with overwrite=False when NaT present #16713

DataFrame.update crashes with overwrite=False when NaT present #16713

Comments

olizhu commented Jun 16, 2017 • edited by TomAugspurger Loading

Code Sample

Problem description

Output of pd.show_versions()

olizhu commented Jun 16, 2017

TomAugspurger commented Jun 16, 2017

sboltz commented Oct 26, 2020

IanFLee commented Nov 30, 2020

wxing11 commented Oct 26, 2022

wxing11 commented Oct 30, 2022

MarcoGorelli commented Nov 1, 2022

mroeschke commented Nov 2, 2022

wxing11 commented Nov 3, 2022

olizhu commented Jun 16, 2017 •

edited by TomAugspurger

Loading

Output of `pd.show_versions()`