-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: add test coverage for maybe_promote #23982
Conversation
Hello @h-vetinari! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-10-11 13:29:07 UTC |
pandas/conftest.py
Outdated
|
||
|
||
@pytest.fixture(params=TIME_DTYPES) | ||
def time_dtype(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a good fixture name, nor correct as these are not times but datetimes and timedeltas which are very separate
what was the regression? |
pandas/core/dtypes/cast.py
Outdated
@@ -264,9 +289,17 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
|
|||
# returns tuple of (dtype, fill_value) | |||
if issubclass(dtype.type, np.datetime64): | |||
fill_value = tslibs.Timestamp(fill_value).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixes a regression that was silently introduced in #23796 due to lack of tests
what was the regression?
The try/catch around these were removed due to ecfe824#r235838683, but the Timestamp/Timedelta constructor choke when fed with strings/objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what did this break from a user POV? if it didn't then don't revert this, it should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't say it broke anything from a user perspective, just that #23796 broke something that wasn't tested (namely, trying to upcast from datetime/timedelta
to string/object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these should probably be using: is_datetime64_dtype(dtype)
and is_timedelta64_dtype(dtype)
as more consistent with the rest of the codebase.
I am also not sure whether much of this routine (maybe_promote) is actually used in a meaningful way. Hence I don't think these checks are needed currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not sure whether much of this routine (maybe_promote) is actually used in a meaningful way. Hence I don't think these checks are needed currently.
There's a list of places it appear in in #23833, but one of the key points is in maybe_upcast_putmask
, which I want to use as the backend for .update
in #23192, see #23606, #22358 etc.
Codecov Report
@@ Coverage Diff @@
## master #23982 +/- ##
==========================================
- Coverage 93.06% 91.74% -1.32%
==========================================
Files 192 173 -19
Lines 49507 52856 +3349
==========================================
+ Hits 46075 48495 +2420
- Misses 3432 4361 +929
Continue to review full report at Codecov.
|
pandas/conftest.py
Outdated
|
||
|
||
@pytest.fixture(params=DATETIME_DTYPES + TIMEDELTA_DTYPES) | ||
def datetimedelta_dtype(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
these never can be in the same fixture
They share almost the same code path, and have the same tests. If I don't have a joint fixture, the two tests would be copied essentially verbatim.
Also, they're already together e.g. in the `any_numpy_dtype` fixture (and also similar to how signed and unsigned are together in `any_integer_dtype`).
But if having two copies each of the two tests (two for datetime, two for timedelta) is your preference - no problem.
|
@jreback |
I realised that the current state of the PR was actually not that consistent, in that the statement of the docstring was not accurately reflected in what was being tested (that's because most of the current Since would not have been a very consistent state of affairs, I'm now aiming the tests exactly at the desired state of affairs, rather than having a few less xfails. I also realised that datetime and timedelta should not be compatible with each other (i.e. always upcast to object), and adapted that as well. |
we’re are way over for 0.24 changes will be merged if they can be accommodated |
Milestone says until end of December: https://github.com/pandas-dev/pandas/milestone/55 What I meant is: Please take a look as soon as you can. |
@h-vetinari i keep moving the milestone. What this means is I am not adding anything to the milestone until / until its exactly ready to merge. |
Fair enough. In any case, if this PR is merged, the path to #23192 is pretty clear/quick. |
pandas/core/dtypes/cast.py
Outdated
@@ -264,9 +289,17 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
|
|||
# returns tuple of (dtype, fill_value) | |||
if issubclass(dtype.type, np.datetime64): | |||
fill_value = tslibs.Timestamp(fill_value).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what did this break from a user POV? if it didn't then don't revert this, it should fail.
pandas/tests/dtypes/test_cast.py
Outdated
fill_value = fill_value[0] | ||
elif box_dtype == object: | ||
fill_value = fill_value.astype(box_dtype) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too big of an add here, make a new file. you move the existing tests to
pandas/tests/dtypes/cast/test_cast.py then add this. but needs to be split up this is too massive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing tests are already in pandas/tests/dtypes/cast/test_cast.py
. Do you want me to make a new file pandas/tests/dtypes/cast/test_promote.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing are in pandas/tests/dtypes/test_cast.py, so make a subdir 'cast' and move the original, then add new in test_promote in that subdir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing
pandas/core/dtypes/cast.py
Outdated
@@ -264,9 +289,17 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
|
|||
# returns tuple of (dtype, fill_value) | |||
if issubclass(dtype.type, np.datetime64): | |||
fill_value = tslibs.Timestamp(fill_value).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't say it broke anything from a user perspective, just that #23796 broke something that wasn't tested (namely, trying to upcast from datetime/timedelta
to string/object
pandas/tests/dtypes/test_cast.py
Outdated
fill_value = fill_value[0] | ||
elif box_dtype == object: | ||
fill_value = fill_value.astype(box_dtype) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing tests are already in pandas/tests/dtypes/cast/test_cast.py
. Do you want me to make a new file pandas/tests/dtypes/cast/test_promote.py
?
Shifted the modules to a new subfolder as desired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding 700 xfailed testsis not great here. simply comment them out for now. i know you actually want to promote based on the dtype fo the fill value, but to be honest numpy already does this and i don't think we actually need to do it. I would rather show where this is not working correctly first.
I actually want to completely remove this routine (or rather make it useful), so its possible all of these tests are needed. but would do some more investigating on the use (or lack thereof) of this routine first.
pandas/core/dtypes/cast.py
Outdated
@@ -264,9 +289,17 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
|
|||
# returns tuple of (dtype, fill_value) | |||
if issubclass(dtype.type, np.datetime64): | |||
fill_value = tslibs.Timestamp(fill_value).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these should probably be using: is_datetime64_dtype(dtype)
and is_timedelta64_dtype(dtype)
as more consistent with the rest of the codebase.
I am also not sure whether much of this routine (maybe_promote) is actually used in a meaningful way. Hence I don't think these checks are needed currently.
@@ -0,0 +1,732 @@ | |||
# -*- coding: utf-8 -*- | |||
|
|||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs updating
or (expected_na_value is np.nan and result_na_value is np.nan)) | ||
|
||
|
||
def test_maybe_promote_none(any_numpy_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be the same as the above case, we always promote to np.nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a left-over from the currently very inconsistent behaviour of not sometimes returning the value of fill_value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding 700 xfailed tests is not great here. simply comment them out for now.
You always ask for focused PRs and "tests first", so I'm doing that here...
i know you actually want to promote based on the dtype fo the fill value, but to be honest numpy already does this and i don't think we actually need to do it.
Numpy is fraught with silent under- & overflow issues in the ints, and raising errors when safe casting is not possible. That's what I'm trying to solve here (again, I'm actually aiming at fixing maybe_upcast_putmask
, but this method is used as an auxiliary function, and IMO worth fixing).
I would rather show where this is not working correctly first.
There's a long list of examples in the OP and #23833.
I actually want to completely remove this routine (or rather make it useful), so its possible all of these tests are needed. but would do some more investigating on the use (or lack thereof) of this routine first.
This is used is a few crucial places, so not sure how easy it would be to get rid of.
pandas/core/dtypes/cast.py
Outdated
@@ -264,9 +289,17 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
|
|||
# returns tuple of (dtype, fill_value) | |||
if issubclass(dtype.type, np.datetime64): | |||
fill_value = tslibs.Timestamp(fill_value).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not sure whether much of this routine (maybe_promote) is actually used in a meaningful way. Hence I don't think these checks are needed currently.
There's a list of places it appear in in #23833, but one of the key points is in maybe_upcast_putmask
, which I want to use as the backend for .update
in #23192, see #23606, #22358 etc.
or (expected_na_value is np.nan and result_na_value is np.nan)) | ||
|
||
|
||
def test_maybe_promote_none(any_numpy_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a left-over from the currently very inconsistent behaviour of not sometimes returning the value of fill_value
.
@jreback |
The doc job had a 500 error - could someone restart it please? (@gfyoung ?) |
@jreback |
@h-vetinari this has too many xfailed tests. I am not convinced we actualy need this many tests on this routine. if you want to purse this, ok, but pls comment out 90% of them, this just causes the logs to be polluted. |
nice, not much left, thanks |
Much closer, that's true! Thanks a lot for your work on this. There's however still a couple of places where the parametrization of |
Yah. My attempts to get the box=True cases working have been much less successful than the other cases, so current thought is to try to rule out their necessity (pretty sure we discussed this elsewhere). Aside from rebasing this and the other one, if you'd like to stake out small areas to fix bit by bit, I'll try to keep focused on non-overlapping areas (currently only open PR on this is #28833) |
We discussed this here. I agree that it's tempting to want to rip out the array path, but then I'd say we also need a plan how this can be supported for cases where this is or will be needed (like some of the
Currently I'm a bit swamped, so not sure when I'll get to it. Hopefully on the weekend. |
@jbrockmendel |
@h-vetinari just merged a few things, can you rebase once again, otherwise lgtm. |
@jreback |
haha thank @jbrockmendel |
Indeed. :) |
alright @h-vetinari time to figure out the boxed cases. want to propose a gameplan? |
The first and most important decision is if we want to support the array case with I think it's beneficial to keep the logic in one place (or at least fewer places: there's similar things happening with If we want to keep the array-case, then my first suggestion would be something like I did in #25425 - split the cases into If the decision is to not support the array case, we'd just have to rip out or xfail some |
This seems like a good time to refactor out something like Another bite-size PR that would be welcome would be adding a good docstring. Do you have fixes in mind for the remaining xfailed cases? bite-sized PRs for those would be welcome. AFAICT the ndarray use cases of maybe_promote are a) pathological object-dtype cases where an ndarray is being treated like a scalar, and b) future use cases you describe. Am I missing any ways a user could get there at the moment? |
git diff upstream/master -u -- "*.py" | flake8 --diff
The errors I found for
core.dtypes.cast.maybe_promote
in #23833 were just scratching the surface. I spent about a week on this, because I had to figure out the right behaviour and how to test it simultaneously. I also spent quite some time in making the xfails as sharp as possible, so that they can almost verbatim be turned into counter-examples.This surely needs a good review, especially for the datetimelike stuff, where I'm not 100% sure whether the interaction with float/integers should cast to object, but it would be my gut feeling.
In any case, this uncovered lots of errors, of which I've added some samples below the fold.
The larger points are the following (after which the tests have been modelled):
maybe_promote(dtype, fill_value)
fundamentally asks whether it's necessary to upcastdtype
to accommodate a scalarfill_value
.int8 -> int16
instead ofint8 -> int64
. (currently broken asint + int -> float
, always)fill_value
, as a sort of vectorized operation for the scalar case (this is used like that in several places, e.g.maybe_upcast_putmask
)maybe_promote
should IMO not simply absorb the dtype offill_value
(otherwise, might as well call italways_promote
...), but should be equally prudent whether actual upcasting is necessary.infer_dtype
(at least in the case of object-typedfill_value
)int -> int
, for example.return_na_value
-kwarg to determine whether a missing value marker should be returned at all (many places in the code don't care about the second output, see BUG/Internals: maybe_promote #23833).all_numpy_dtypes
that got introduced in TST/CLN: series.duplicated; parametrisation; fix warning #21899, but it's a lie - it doesn't cover (nearly) all numpy dtypes. It's currently only used in one test, and so I've adapted it here to be able to actually test all numpy dtypes. Nevertheless, I'll split off this part in a separate PR (xref coming).Here's an incomplete extract of the many things that are going wrong currently: