Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: _convert_and_box_cache raise ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True #26097

Merged
merged 31 commits into from
Jul 3, 2019

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Apr 15, 2019

  • closes #N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Bug was found there: #26043 (comment)
Reproduction:

import pandas as pd
N = 10
half = int(N / 2)
ts_string_1 = 'March 1, 2018 12:00:00+0400'
ts_string_2 = 'March 1, 2018 12:00:00+0500'
diff_offset = [ts_string_1] * half + [ts_string_2] * half
pd.to_datetime(diff_offset, cache=True)

@pep8speaks
Copy link

pep8speaks commented Apr 15, 2019

Hello @anmyachev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-03 11:44:41 UTC

@jreback
Copy link
Contributor

jreback commented Apr 15, 2019

always tests first!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs tests and a whatsnew note

@codecov
Copy link

codecov bot commented Apr 15, 2019

Codecov Report

Merging #26097 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26097      +/-   ##
==========================================
- Coverage   91.96%   91.95%   -0.01%     
==========================================
  Files         175      175              
  Lines       52412    52414       +2     
==========================================
- Hits        48199    48197       -2     
- Misses       4213     4217       +4
Flag Coverage Δ
#multiple 90.51% <100%> (ø) ⬆️
#single 40.73% <0%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 84.68% <100%> (+0.08%) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23966ed...b725de6. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 15, 2019

Codecov Report

Merging #26097 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26097      +/-   ##
==========================================
- Coverage   91.87%   91.86%   -0.02%     
==========================================
  Files         180      180              
  Lines       50927    50714     -213     
==========================================
- Hits        46790    46589     -201     
+ Misses       4137     4125      -12
Flag Coverage Δ
#multiple 90.46% <100%> (-0.11%) ⬇️
#single 41.13% <73.68%> (-0.96%) ⬇️
Impacted Files Coverage Δ
pandas/core/index.py 100% <100%> (ø) ⬆️
pandas/core/tools/datetimes.py 85.22% <100%> (+0.16%) ⬆️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/plotting/_misc.py 59.49% <0%> (-5.38%) ⬇️
pandas/plotting/_matplotlib/converter.py 58.43% <0%> (-5.24%) ⬇️
pandas/io/excel/_openpyxl.py 84.71% <0%> (-3.23%) ⬇️
pandas/core/config_init.py 92.91% <0%> (-3.17%) ⬇️
pandas/io/formats/printing.py 85.56% <0%> (-1.65%) ⬇️
pandas/core/internals/managers.py 95.21% <0%> (-0.95%) ⬇️
... and 75 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ec7c9e...1cc469e. Read the comment docs.

pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
@vnlitvinov
Copy link
Contributor

@mroeschke we've replaced bandaid fix with the one that applies same boxing logic whenever caching is on or off.

@jreback we've added a test which checks this case for both cache=True and cache=False (and which fails prior to our fix).

whatsnew entry will be added a bit later.

@mroeschke
Copy link
Member

Could you describe your fix? Was it not boxing in the correct location or was the error parameter inconsistent?

@anmyachev anmyachev force-pushed the fix_bug_convert_and_box_cache branch from d665747 to 7900437 Compare April 17, 2019 12:35
@vnlitvinov
Copy link
Contributor

Could you describe your fix?

There was different boxing logic for cache=True and cache=False:

  • In cache=False case it boxed to DatetimeIndex when it was possible to box to it (regardless of what errors was), otherwise it boxed to Index.
  • In cache=True case it boxed to DatetimeIndex when errors was ignore only (regardless of whether it was possible to box to DatetimeIndex).

So the fix is to align the boxing logic for both cache=True and cache=False cases by using same boxing method.

@gfyoung gfyoung added the Datetime Datetime data dtype label Apr 18, 2019
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/tests/indexes/datetimes/test_tools.py Outdated Show resolved Hide resolved
pandas/tests/indexes/datetimes/test_tools.py Outdated Show resolved Hide resolved
errors, convertor):
arg = [convertor(date1), convertor(date2)] * 5 + suffix

def _get_answer(cache):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should not catch this error, rather for errors='raise' you use an assert_raises

Copy link
Contributor

@vnlitvinov vnlitvinov Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is, errors='raise' does not guarantee it raises - that is, if to_datetime input is valid it won't raise.
What I wanted to verify with this test is that to_datetime() behaviour is identical regardless of whether it had cache=True or cache=False.

For this particular test scenario I don't even care if, say, errors='ignore' raises a ValueError as long as it is raised consistently with and without cache. It is up to other tests to verify that errors parameter is doing what it should, this test only checks for integrity over cache.

P.S. Renamed this test to test_to_datetime_cache_invariance to underline its purpose more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want a try/except in a test; that just obscures what the test is checking; as I said if you expect it to raise then make it explici.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried re-working try..except away, but I feel that it's somewhat impossible unless I hardcode some of current Pandas behaviour in the test itself.

My current intentions are like this - pd.to_datetime() must behave the same when cache=True and cache=False, where "same" means "both return pd.Index" or "both raise same exception".

So far I'm struggling with removal of try..except to cover this intention due to the fact that pd.to_datetime can raise an exception not only when errors="raise" (or it can not raise an exception if hypothesis randomly generates equal timezones).

Example:

import pandas as pd
from datetime import datetime as dt
import pytz
pd.to_datetime([dt(2000,1,1),dt(2000,1,1,tzinfo=pytz.utc)], errors='coerce')

results in:

Traceback (most recent call last):
  File "somewhere\pandas\core\arrays\datetimes.py", line 1850, in objects_to_datetime64ns
    values, tz_parsed = conversion.datetime_to_datetime64(data)
  File "pandas\_libs\tslibs\conversion.pyx", line 185, in pandas._libs.tslibs.conversion.datetime_to_datetime64
    raise ValueError('Cannot mix tz-aware with '
ValueError: Cannot mix tz-aware with tz-naive values

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "somewhere\pandas\util\_decorators.py", line 188, in wrapper
    return func(*args, **kwargs)
  File "somewhere\pandas\core\tools\datetimes.py", line 642, in to_datetime
    result = convert_listlike(arg, box, format)
  File "somewhere\pandas\core\tools\datetimes.py", line 333, in _convert_listlike_datetimes
    allow_object=True)
  File "somewhere\pandas\core\arrays\datetimes.py", line 1855, in objects_to_datetime64ns
    raise e
  File "somewhere\pandas\core\arrays\datetimes.py", line 1846, in objects_to_datetime64ns
    require_iso8601=require_iso8601
  File "pandas\_libs\tslib.pyx", line 460, in pandas._libs.tslib.array_to_datetime
    cpdef array_to_datetime(ndarray[object] values, str errors='raise',
  File "pandas\_libs\tslib.pyx", line 537, in pandas._libs.tslib.array_to_datetime
    raise ValueError('Tz-aware datetime.datetime '
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said if you expect it to raise then make it explici.

All that I said above is to illustrate that I actually cannot expect it to raise or not to raise, as right now the behaviour of pd.to_datetime() regarding of exceptions can be described as "chaotic good" :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the error='raise' case to another test then. we never do this in a test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what about exceptions which might be raised when errors != "raise"? I feel this is overextending what the test is checking - when I describe what it must raise or that it musn't raise I start checking behaviour of pd.to_datetime() for certain parameters, while the only check I wanted is that it behaves the same way with cache on/off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that this code is very hard to grok when errors=‘raise’

I agree that both cache=True or False must raise in this case, but I want these in a separate test where you assert that they raise

the issue is that catching a ValueError in a test can accidentally catch other things and has multiple times in the past in some test cases which turned out badly

hence I don’t allow this ever

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I get what you want, but I seem to fail to convey my own point.

What you describe is testing that to_datetime() behaves correctly when given these or that parameter compositions. Those tests are needed, they're worthwhile, etc. But they're not what I am adding here. What I want to test is something like "when given random parameters, to_datetime() must show identical results for cache=True and cache=False", so that turning caching on or off is merely a question of whether we have enough memory to speed up stuff (not a question of having a behaviour change)!

This check also verifies that potential bugs in the logic when it raises even if it shouldn't, or if it doesn't raise when it should, are treated equally by cache. Finding those bugs is not the purpose of this particular test!
This includes e.g. cases when to_datetime raises an exception when errors="coerce", and I don't think I should be trying to fix all the errors in this function within this PR.

All in all, in this test I'm not interested ever if to_datetime does what it should. It might be shooting a rocket to the Moon for all I care, but it must be shooting same rocket to the same Moon both with cache on and off!
Or, more precisely here, it might be raising some random ValueError, but, unless it raises different ValueErrors for on/off cache, I consider it fine. Again, checking to_datetime correctness is not the point of this particular test.

Copy link
Contributor

@vnlitvinov vnlitvinov Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that said, would it be easier to understand if I extract this _get_answer function to be a module-level helper instead of a nested function (with a proper docstring to boot)? Probably named like _to_datetime_noexcept.

@vnlitvinov vnlitvinov force-pushed the fix_bug_convert_and_box_cache branch from cd1701e to 6b5227e Compare April 19, 2019 13:48
@vnlitvinov
Copy link
Contributor

Note to reviewers: I've rebased off master to get rid of testing failures that were fixed in master after this PR was initially made.

pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
errors, convertor):
arg = [convertor(date1), convertor(date2)] * 5 + suffix

def _get_answer(cache):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want a try/except in a test; that just obscures what the test is checking; as I said if you expect it to raise then make it explici.

pandas/tests/indexes/datetimes/test_tools.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 0.25.0 milestone Apr 20, 2019
@vnlitvinov vnlitvinov force-pushed the fix_bug_convert_and_box_cache branch 2 times, most recently from 7dfa8d4 to 26a2594 Compare April 22, 2019 12:47
@vnlitvinov
Copy link
Contributor

Current failures seem to be introduced by rebasing on master, see #26187 and #26189

pandas/_typing.py Outdated Show resolved Hide resolved
pandas/core/index.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/_typing.py Outdated Show resolved Hide resolved
pandas/tests/indexes/datetimes/test_tools.py Outdated Show resolved Hide resolved
errors, convertor):
arg = [convertor(date1), convertor(date2)] * 5 + suffix

def _get_answer(cache):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the error='raise' case to another test then. we never do this in a test

pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/core/tools/datetimes.py Outdated Show resolved Hide resolved
pandas/tests/indexes/datetimes/test_tools.py Outdated Show resolved Hide resolved
errors, convertor):
arg = [convertor(date1), convertor(date2)] * 5 + suffix

def _get_answer(cache):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that this code is very hard to grok when errors=‘raise’

I agree that both cache=True or False must raise in this case, but I want these in a separate test where you assert that they raise

the issue is that catching a ValueError in a test can accidentally catch other things and has multiple times in the past in some test cases which turned out badly

hence I don’t allow this ever

@vnlitvinov
Copy link
Contributor

Another example of to_datetime being chaotic:

>>> pd.to_datetime(['20100101 0100+1', '20100101 0000+0'],errors='raise',cache=True)
Index([2010-01-01 01:00:00+01:00, 2010-01-01 00:00:00+00:00], dtype='object')
>>> pd.to_datetime(['20100101 0100+1', '20100101 0000+0'],errors='raise',cache=False)
Index([2010-01-01 01:00:00+01:00, 2010-01-01 00:00:00+00:00], dtype='object')
>>> pd.to_datetime(['20100101 0100+1', '20100101 0100+1', '20100101 0000+0'],errors='raise',cache=True)
Traceback (most recent call last):
  File "C:\Vass\Work\cuDF\pandas-dev\pandas\core\arrays\datetimes.py", line 1860, in objects_to_datetime64ns
    values, tz_parsed = conversion.datetime_to_datetime64(data)
  File "pandas\_libs\tslibs\conversion.pyx", line 185, in pandas._libs.tslibs.conversion.datetime_to_datetime64
    raise ValueError('Array must be all same time zone')
ValueError: Array must be all same time zone
>>> pd.to_datetime(['20100101 0100+1', '20100101 0100+1', '20100101 0000+0'],errors='raise',cache=False)
Index([2010-01-01 01:00:00+01:00, 2010-01-01 01:00:00+01:00,
       2010-01-01 00:00:00+00:00],
      dtype='object')

I'm starting to feel that this discussion (and the test I've added) is unearthing something that is much bigger than initial error reported.

@jreback
Copy link
Contributor

jreback commented May 12, 2019

can you merge master and update

@anmyachev anmyachev force-pushed the fix_bug_convert_and_box_cache branch from d1078e6 to 1cc469e Compare July 2, 2019 11:09
@anmyachev
Copy link
Contributor Author

@jreback ping on green

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex a question

return Index(result, name=name)
else:
return DatetimeIndex(result, name=name)
return _box_as_indexlike(result, tz=None, name=name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmyachev do we have a test like this?

you are passing tz in the call below

@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

lgtm. ping on green.

@anmyachev
Copy link
Contributor Author

@jreback ping on green

@jreback jreback merged commit 9bee334 into pandas-dev:master Jul 3, 2019
@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

thanks @anmyachev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants