-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: to_timedelta precision issues with floating data #25651
REGR: to_timedelta precision issues with floating data #25651
Conversation
pandas/core/arrays/timedeltas.py
Outdated
data = (coeff * data).astype(np.int64).view('timedelta64[ns]') | ||
data[mask] = iNaT | ||
# object_to_td64ns has custom logic for float -> int conversion | ||
# to avoid precision issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have the same perf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is slower. But it is more correct (it is written specifically to handle this case), and it is what was used before 0.24.0 anyway.
I assume we might be able to port the similar logic here to be more performant (to not work element by element, knowing we only have floats), but I would personally leave that for 0.25.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well the prior fix was for performance and this is a very narrow minor case
so you are causing a rather large perf regression by changing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls shown asv before / after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting floats to timedelta is a narrow use case anyway. If you care about performance but not care about precision, you can convert to integers yourself.
Codecov Report
@@ Coverage Diff @@
## master #25651 +/- ##
==========================================
- Coverage 91.26% 91.26% -0.01%
==========================================
Files 173 173
Lines 52968 52965 -3
==========================================
- Hits 48340 48337 -3
Misses 4628 4628
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25651 +/- ##
==========================================
+ Coverage 91.29% 91.29% +<.01%
==========================================
Files 173 173
Lines 52961 52965 +4
==========================================
+ Hits 48349 48354 +5
+ Misses 4612 4611 -1
Continue to review full report at Codecov.
|
So this has a performance impact for sure, and a quite big one apparently (around 300-400 x).
Compared to 0.23.x and before, there is no performance regression. But let me look if the logic is easy to port to float branch of
We don't have any benchmarks passing floating data to |
e479c04
to
39b15aa
Compare
@jreback in the last commit (39b15aa), you can see what it would look like when using the same float-handling logic from timedeltas.pyx in This makes that the conversion is much faster compared to the old (pre 0.24.0) path, but should have the same result:
On master, the The only thing I didn't do yet is a proper way for writing |
pandas/_libs/tslibs/timedeltas.pyx
Outdated
@@ -246,6 +246,47 @@ def array_to_timedelta64(object[:] values, unit='ns', errors='raise'): | |||
return iresult.base # .base to access underlying np.ndarray | |||
|
|||
|
|||
def precision_from_unit(object unit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this instead
make a new function that is cpdef that calls cast_from_unit, then you don't remove the inline nature from cast_from_unit and you can call it from python w/o re-writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that the current cast_from_unit
does more, I only need the first part of it (that defines m
and p
, the rest of the function calculates something with m
and p
and returns that result).
I tried putting the common logic in a separate function, but it starts to become a bit more ugly to write it in such a way that it can still be used in the cdef inline one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a bit convoluted way to share the if/else part, but I don't see an easy other way to share that part, except by not having cast_from_unit
a cdef function (but that has other implications elsewhere), or otherwise just duplicating the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use cpdef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you cpdef inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. but inline for this prob doesn't make much difference. its the cast_from_unit where it actually matters a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you do an except? -1
for a cpdef that returns a tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually. you are only calling this once? then make it a def is just fine (and error checkng is easy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new code, I am calling this only once. But the existing cast_from_unit
is used in several places, and is also included in timedeltas.pxd as it used in other places in tslibs, so needs to stay a cdef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use a cpdef as asked. It's only used in places where a similar cpdef parse_timedelta_unit
is also used, so the python interaction is indeed probably not a problem.
pandas/_libs/tslibs/timedeltas.pyx
Outdated
int64_t m | ||
int p | ||
|
||
cdef inline int _precision_from_unit(object unit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you going thru all of this trouble? use cpdef
doc/source/whatsnew/v0.24.2.rst
Outdated
@@ -31,6 +31,7 @@ Fixed Regressions | |||
- Fixed regression in ``IntervalDtype`` construction where passing an incorrect string with 'Interval' as a prefix could result in a ``RecursionError``. (:issue:`25338`) | |||
- Fixed regression in creating a period-dtype array from a read-only NumPy array of period objects. (:issue:`25403`) | |||
- Fixed regression in :class:`Categorical`, where constructing it from a categorical ``Series`` and an explicit ``categories=`` that differed from that in the ``Series`` created an invalid object which could trigger segfaults. (:issue:`25318`) | |||
- Fixed regression in :func:`to_timedelta` loosing precision when converting floating data to ``Timedelta`` data (:issue:`25077`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loosing -> losing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. can u just run some quick timings to see if anything else is affected by this - it’s a heavily used routine
I did some quick timings of |
ok then - lgtm thanks |
…h floating data
I also just ran the timedelta benchmarks for this branch compared to master, and it says nothing changed significantly (although I don't know if we are benchmarking a critical path that relies heavily on the changed function) |
Closes #25077