REGR: to_timedelta precision issues with floating data #25651

jorisvandenbossche · 2019-03-11T09:37:20Z

jreback · 2019-03-11T10:01:13Z

pandas/core/arrays/timedeltas.py

-        data = (coeff * data).astype(np.int64).view('timedelta64[ns]')
-        data[mask] = iNaT
+        # object_to_td64ns has custom logic for float -> int conversion
+        # to avoid precision issues


does this have the same perf?

No, it is slower. But it is more correct (it is written specifically to handle this case), and it is what was used before 0.24.0 anyway.

I assume we might be able to port the similar logic here to be more performant (to not work element by element, knowing we only have floats), but I would personally leave that for 0.25.0.

well the prior fix was for performance and this is a very narrow minor case

so you are causing a rather large perf regression by changing this

pls shown asv before / after

Converting floats to timedelta is a narrow use case anyway. If you care about performance but not care about precision, you can convert to integers yourself.

codecov · 2019-03-11T10:13:56Z

Codecov Report

Merging #25651 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25651      +/-   ##
==========================================
- Coverage   91.26%   91.26%   -0.01%     
==========================================
  Files         173      173              
  Lines       52968    52965       -3     
==========================================
- Hits        48340    48337       -3     
  Misses       4628     4628

Flag	Coverage Δ
#multiple	`89.83% <100%> (-0.01%)`	⬇️
#single	`41.7% <100%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/timedeltas.py	`88.13% <100%> (-0.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f886139...4e36a9b. Read the comment docs.

codecov · 2019-03-11T10:13:56Z

Codecov Report

Merging #25651 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25651      +/-   ##
==========================================
+ Coverage   91.29%   91.29%   +<.01%     
==========================================
  Files         173      173              
  Lines       52961    52965       +4     
==========================================
+ Hits        48349    48354       +5     
+ Misses       4612     4611       -1

Flag	Coverage Δ
#multiple	`89.87% <100%> (ø)`	⬆️
#single	`41.73% <83.33%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/timedeltas.py	`88.29% <100%> (+0.09%)`	⬆️
pandas/util/testing.py	`89.08% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21769e9...053df8d. Read the comment docs.

jorisvandenbossche · 2019-03-11T11:15:05Z

So this has a performance impact for sure, and a quite big one apparently (around 300-400 x).

so you are causing a rather large perf regression by changing this

Compared to 0.23.x and before, there is no performance regression. But let me look if the logic is easy to port to float branch of sequence_to_td64ns.
(and I doubt this change was originally done with performance in mind, in any case that is not mentioned in the PR #23539)

pls shown asv before / after

We don't have any benchmarks passing floating data to to_timedelta / TimedeltaIndex

jorisvandenbossche · 2019-03-11T12:57:05Z

@jreback in the last commit (39b15aa), you can see what it would look like when using the same float-handling logic from timedeltas.pyx in sequence_to_td64ns.

This makes that the conversion is much faster compared to the old (pre 0.24.0) path, but should have the same result:

In [1]: arr = np.arange(0, 1, 1e-6)

In [9]: %timeit pd.core.arrays.timedeltas.sequence_to_td64ns(arr, unit='s')[0] 
21.8 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit pd.core.arrays.timedeltas.objects_to_td64ns(arr, unit='s')  <-- basically what was being used < 0.24
2.84 s ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On master, the sequence_to_td64ns takes only around 8 ms, so this PR would still be a 2-3 times slowdown compared to master, but still give around 100 x speedup compared to < 0.24.0.

The only thing I didn't do yet is a proper way for writing precision_from_unit. Currently, it is basically a copy paste from the cdef inline function cast_from_unit below it. But this is not usable from python, and making it non-inline might have performance implications.

jreback · 2019-03-11T13:00:13Z

pandas/_libs/tslibs/timedeltas.pyx

@@ -246,6 +246,47 @@ def array_to_timedelta64(object[:] values, unit='ns', errors='raise'):
    return iresult.base  # .base to access underlying np.ndarray


+def precision_from_unit(object unit):


do this instead

make a new function that is cpdef that calls cast_from_unit, then you don't remove the inline nature from cast_from_unit and you can call it from python w/o re-writing.

So the problem is that the current cast_from_unit does more, I only need the first part of it (that defines m and p, the rest of the function calculates something with m and p and returns that result).

I tried putting the common logic in a separate function, but it starts to become a bit more ugly to write it in such a way that it can still be used in the cdef inline one.

I added a bit convoluted way to share the if/else part, but I don't see an easy other way to share that part, except by not having cast_from_unit a cdef function (but that has other implications elsewhere), or otherwise just duplicating the code

you need to use cpdef.

Can you cpdef inline?

no. but inline for this prob doesn't make much difference. its the cast_from_unit where it actually matters a lot.

How do you do an except? -1 for a cpdef that returns a tuple?

actually. you are only calling this once? then make it a def is just fine (and error checkng is easy)

In the new code, I am calling this only once. But the existing cast_from_unit is used in several places, and is also included in timedeltas.pxd as it used in other places in tslibs, so needs to stay a cdef

Updated to use a cpdef as asked. It's only used in places where a similar cpdef parse_timedelta_unit is also used, so the python interaction is indeed probably not a problem.

jreback · 2019-03-11T14:49:57Z

pandas/_libs/tslibs/timedeltas.pyx

-        int64_t m
-        int p
-
+cdef inline int _precision_from_unit(object unit,


why are you going thru all of this trouble? use cpdef

TomAugspurger · 2019-03-11T14:11:06Z

doc/source/whatsnew/v0.24.2.rst

@@ -31,6 +31,7 @@ Fixed Regressions
 - Fixed regression in ``IntervalDtype`` construction where passing an incorrect string with 'Interval' as a prefix could result in a ``RecursionError``. (:issue:`25338`)
 - Fixed regression in creating a period-dtype array from a read-only NumPy array of period objects. (:issue:`25403`)
 - Fixed regression in :class:`Categorical`, where constructing it from a categorical ``Series`` and an explicit ``categories=`` that differed from that in the ``Series`` created an invalid object which could trigger segfaults. (:issue:`25318`)
+- Fixed regression in :func:`to_timedelta` loosing precision when converting floating data to ``Timedelta`` data (:issue:`25077`).


loosing -> losing

jorisvandenbossche · 2019-03-11T17:29:28Z

cc @jbrockmendel

jreback

looks good. can u just run some quick timings to see if anything else is affected by this - it’s a heavily used routine

jorisvandenbossche · 2019-03-11T23:22:39Z

can u just run some quick timings to see if anything else is affected by this - it’s a heavily used routine

I did some quick timings of pd._libs.tslibs.timedeltas.array_to_timedelta64(arr, unit='s') (with an all float array, the extreme case, which in practice will not go through this path any more) and of pd.Timedelta("4 days 3 hours 2 minutes 1 second"), and for those, I don't see a significant difference.

jreback · 2019-03-11T23:25:42Z

ok then - lgtm thanks

…h floating data

jorisvandenbossche · 2019-03-12T12:56:18Z

I also just ran the timedelta benchmarks for this branch compared to master, and it says nothing changed significantly (although I don't know if we are benchmarking a critical path that relies heavily on the changed function)

… data (#25687)

REGR: to_timedelta precision issues with floating data

4e36a9b

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type labels Mar 11, 2019

jorisvandenbossche added this to the 0.24.2 milestone Mar 11, 2019

jreback requested changes Mar 11, 2019

View reviewed changes

jorisvandenbossche added 2 commits March 11, 2019 13:15

Merge remote-tracking branch 'upstream/master' into to_timedelta-float

fb67db4

POC using proper logic

39b15aa

jorisvandenbossche force-pushed the to_timedelta-float branch from e479c04 to 39b15aa Compare March 11, 2019 12:50

jreback reviewed Mar 11, 2019

View reviewed changes

jorisvandenbossche added 2 commits March 11, 2019 14:32

one possible way to share implementation

338a652

clean-up error handling

5cc3c39

jreback requested changes Mar 11, 2019

View reviewed changes

TomAugspurger reviewed Mar 11, 2019

View reviewed changes

jorisvandenbossche added 3 commits March 11, 2019 23:48

make it a cpdef

943888b

typo whatsnew

74c3e32

Merge remote-tracking branch 'upstream/master' into to_timedelta-float

053df8d

jreback reviewed Mar 11, 2019

View reviewed changes

jreback approved these changes Mar 12, 2019

View reviewed changes

jreback merged commit bace4d0 into pandas-dev:master Mar 12, 2019

meeseeksmachine mentioned this pull request Mar 12, 2019

Backport PR #25651 on branch 0.24.x (REGR: to_timedelta precision issues with floating data) #25687

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 12, 2019

Backport PR pandas-dev#25651: REGR: to_timedelta precision issues wit…

dc66c9b

…h floating data

jorisvandenbossche deleted the to_timedelta-float branch March 12, 2019 12:53

jreback pushed a commit that referenced this pull request Mar 12, 2019

Backport PR #25651: REGR: to_timedelta precision issues with floating…

c53c9d1

… data (#25687)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: to_timedelta precision issues with floating data #25651

REGR: to_timedelta precision issues with floating data #25651

jorisvandenbossche commented Mar 11, 2019

jreback Mar 11, 2019

jorisvandenbossche Mar 11, 2019 •

edited

Loading

jreback Mar 11, 2019

jreback Mar 11, 2019

jorisvandenbossche Mar 11, 2019

codecov bot commented Mar 11, 2019

codecov bot commented Mar 11, 2019 •

edited

Loading

jorisvandenbossche commented Mar 11, 2019

jorisvandenbossche commented Mar 11, 2019

jreback Mar 11, 2019

jorisvandenbossche Mar 11, 2019

jorisvandenbossche Mar 11, 2019

jreback Mar 11, 2019

TomAugspurger Mar 11, 2019

jreback Mar 11, 2019

jorisvandenbossche Mar 11, 2019 •

edited

Loading

jreback Mar 11, 2019

jorisvandenbossche Mar 11, 2019

jorisvandenbossche Mar 11, 2019

jreback Mar 11, 2019

TomAugspurger Mar 11, 2019

jorisvandenbossche commented Mar 11, 2019

jreback left a comment

jorisvandenbossche commented Mar 11, 2019

jreback commented Mar 11, 2019

jorisvandenbossche commented Mar 12, 2019

		@@ -246,6 +246,47 @@ def array_to_timedelta64(object[:] values, unit='ns', errors='raise'):
		return iresult.base # .base to access underlying np.ndarray


		def precision_from_unit(object unit):

REGR: to_timedelta precision issues with floating data #25651

REGR: to_timedelta precision issues with floating data #25651

Conversation

jorisvandenbossche commented Mar 11, 2019

Choose a reason for hiding this comment

jorisvandenbossche Mar 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 11, 2019

Codecov Report

codecov bot commented Mar 11, 2019 • edited Loading

Codecov Report

jorisvandenbossche commented Mar 11, 2019

jorisvandenbossche commented Mar 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Mar 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 11, 2019

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 11, 2019

jreback commented Mar 11, 2019

jorisvandenbossche commented Mar 12, 2019

jorisvandenbossche Mar 11, 2019 •

edited

Loading

codecov bot commented Mar 11, 2019 •

edited

Loading

jorisvandenbossche Mar 11, 2019 •

edited

Loading