Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.update crashes with overwrite=False when NaT present #16713

Closed
olizhu opened this issue Jun 16, 2017 · 9 comments · Fixed by #49395
Closed

DataFrame.update crashes with overwrite=False when NaT present #16713

olizhu opened this issue Jun 16, 2017 · 9 comments · Fixed by #49395
Assignees
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@olizhu
Copy link

olizhu commented Jun 16, 2017

Code Sample

df1 = DataFrame({'A': [1,None], 'B':[to_datetime('abc', errors='coerce'),to_datetime('2016-01-01')]})
df2 = DataFrame({'A': [2,3]})
df1.update(df2, overwrite=False)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a766b5317aac> in <module>()
      1 df1 = DataFrame({'A': [1,None], 'B':[to_datetime('abc', errors='coerce'),to_datetime('2016-01-01')]})
      2 df2 = DataFrame({'A': [2,3]})
----> 3 df1.update(df2, overwrite=False)

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/frame.py in update(self, other, join, overwrite, filter_func, raise_conflict)
   3897
   3898             self[col] = expressions.where(mask, this, that,
-> 3899                                           raise_on_error=True)
   3900
   3901     # ----------------------------------------------------------------------

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in where(cond, a, b, raise_on_error, use_numexpr)
    229
    230     if use_numexpr:
--> 231         return _where(cond, a, b, raise_on_error=raise_on_error)
    232     return _where_standard(cond, a, b, raise_on_error=raise_on_error)
    233

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in _where_numexpr(cond, a, b, raise_on_error)
    152
    153     if result is None:
--> 154         result = _where_standard(cond, a, b, raise_on_error)
    155
    156     return result

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/computation/expressions.py in _where_standard(cond, a, b, raise_on_error)
    127 def _where_standard(cond, a, b, raise_on_error=True):
    128     return np.where(_values_from_object(cond), _values_from_object(a),
--> 129                     _values_from_object(b))
    130
    131

TypeError: invalid type promotion

Problem description

A similar problem as in issue #15593 which was fixed in pandas version 0.20.2, NaT values anywhere in the DataFrame still throws the following exception: TypeError: invalid type promotion

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 2.9.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.24
numpy: 1.13.0
scipy: 0.17.1
xarray: None
IPython: 6.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.1.0
tables: 3.4.2
numexpr: 2.6.2
feather: 0.3.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.5.1
html5lib: 0.999999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@olizhu olizhu changed the title DataFrame.update crashes with overwrite=True when NaT present DataFrame.update crashes with overwrite=False when NaT present Jun 16, 2017
@olizhu
Copy link
Author

olizhu commented Jun 16, 2017

I've just tested some more and it seems that the error occurs whenever there is a null object in a column containing datetimes. So replacing NaT with NaN still has the same error.

@TomAugspurger TomAugspurger added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Jun 16, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jun 16, 2017
@TomAugspurger
Copy link
Contributor

So when we reindex df2 like df1 we end up with different dtypes

In [22]: df2.reindex_like(df1).dtypes
Out[22]:
A      int64
B    float64
dtype: object

I wonder if we could add a parameter to reindex_like for controlling the dtypes of columns that are created. I could see that being broadly useful.

@sboltz
Copy link

sboltz commented Oct 26, 2020

I just encountered this issue and was wondering if there were any updates or workarounds for it? Thanks.

@IanFLee
Copy link

IanFLee commented Nov 30, 2020

I'm also having this issue in sklearn.preprocessing with StandardScaler(). It definitely seems to be a datetime issue so I've just dropped that column for the type being, but eventually I'll need it back, so fingers crossed.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@wxing11
Copy link
Contributor

wxing11 commented Oct 26, 2022

take

@wxing11
Copy link
Contributor

wxing11 commented Oct 30, 2022

Hi, new contributor here so please correct me if I'm wrong!

This seems to be caused by situations where the Dataframe to be updated has a Datetime column with NaT values and the input Dataframe has either

  1. A matching column by index but of a type that isn't Datetime/Object. I assume an error here is expected.
  2. No matching column by index, so the call to reindex_like in the update function creates a column that isn't of type Datetime/Object. (The example case above)

Since in the situation of the second case the created column is full of only NA values, would it be reasonable to solve this by just adding a check to the function that if a column is full of only NA values, to skip the updating of that column?

I created a PR with an implementation of this as well as a couple new test cases including the one introduced above.

@MarcoGorelli
Copy link
Member

I wonder if we could add a parameter to reindex_like for controlling the dtypes of columns that are created

How would this work? Would the dtype be taken from the other DataFrame in reindex? Because if so, one issue would be with null columns getting converted to bool:

In [1]: pd.Series([np.nan, np.nan]).astype(bool)
Out[1]: 
0    True
1    True
dtype: bool

Alternatively, there could be an option to exclude null columns could be excluded from the result of reindex_like, but then that would still require an update to

pandas/pandas/core/frame.py

Lines 8196 to 8198 in 2d126dd

this = self[col]._values
that = other[col]._values

to skip over columns which aren't in both this and that

At the moment, I'm struggling to see a simpler solution that that proposed in #49395
cc @mroeschke (as you'd commented on the PR)

@mroeschke
Copy link
Member

Maybe generally a full reindex_like is not needed generally, as only the shared columns should be updated?

@wxing11
Copy link
Contributor

wxing11 commented Nov 3, 2022

I pushed a new commit to my PR that only reindexes rows and then skips non matching columns. Does that seem right for what you were saying?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants