Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: to_timedelta precision issues with floating data #25651

Merged
merged 8 commits into from
Mar 12, 2019

Conversation

jorisvandenbossche
Copy link
Member

Closes #25077

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type labels Mar 11, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.24.2 milestone Mar 11, 2019
data = (coeff * data).astype(np.int64).view('timedelta64[ns]')
data[mask] = iNaT
# object_to_td64ns has custom logic for float -> int conversion
# to avoid precision issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have the same perf?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is slower. But it is more correct (it is written specifically to handle this case), and it is what was used before 0.24.0 anyway.

I assume we might be able to port the similar logic here to be more performant (to not work element by element, knowing we only have floats), but I would personally leave that for 0.25.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well the prior fix was for performance and this is a very narrow minor case

so you are causing a rather large perf regression by changing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls shown asv before / after

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting floats to timedelta is a narrow use case anyway. If you care about performance but not care about precision, you can convert to integers yourself.

@codecov
Copy link

codecov bot commented Mar 11, 2019

Codecov Report

Merging #25651 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25651      +/-   ##
==========================================
- Coverage   91.26%   91.26%   -0.01%     
==========================================
  Files         173      173              
  Lines       52968    52965       -3     
==========================================
- Hits        48340    48337       -3     
  Misses       4628     4628
Flag Coverage Δ
#multiple 89.83% <100%> (-0.01%) ⬇️
#single 41.7% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/timedeltas.py 88.13% <100%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f886139...4e36a9b. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 11, 2019

Codecov Report

Merging #25651 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25651      +/-   ##
==========================================
+ Coverage   91.29%   91.29%   +<.01%     
==========================================
  Files         173      173              
  Lines       52961    52965       +4     
==========================================
+ Hits        48349    48354       +5     
+ Misses       4612     4611       -1
Flag Coverage Δ
#multiple 89.87% <100%> (ø) ⬆️
#single 41.73% <83.33%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/timedeltas.py 88.29% <100%> (+0.09%) ⬆️
pandas/util/testing.py 89.08% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21769e9...053df8d. Read the comment docs.

@jorisvandenbossche
Copy link
Member Author

So this has a performance impact for sure, and a quite big one apparently (around 300-400 x).

so you are causing a rather large perf regression by changing this

Compared to 0.23.x and before, there is no performance regression. But let me look if the logic is easy to port to float branch of sequence_to_td64ns.
(and I doubt this change was originally done with performance in mind, in any case that is not mentioned in the PR #23539)

pls shown asv before / after

We don't have any benchmarks passing floating data to to_timedelta / TimedeltaIndex

@jorisvandenbossche
Copy link
Member Author

@jreback in the last commit (39b15aa), you can see what it would look like when using the same float-handling logic from timedeltas.pyx in sequence_to_td64ns.

This makes that the conversion is much faster compared to the old (pre 0.24.0) path, but should have the same result:

In [1]: arr = np.arange(0, 1, 1e-6)

In [9]: %timeit pd.core.arrays.timedeltas.sequence_to_td64ns(arr, unit='s')[0] 
21.8 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit pd.core.arrays.timedeltas.objects_to_td64ns(arr, unit='s')  <-- basically what was being used < 0.24
2.84 s ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On master, the sequence_to_td64ns takes only around 8 ms, so this PR would still be a 2-3 times slowdown compared to master, but still give around 100 x speedup compared to < 0.24.0.

The only thing I didn't do yet is a proper way for writing precision_from_unit. Currently, it is basically a copy paste from the cdef inline function cast_from_unit below it. But this is not usable from python, and making it non-inline might have performance implications.

@@ -246,6 +246,47 @@ def array_to_timedelta64(object[:] values, unit='ns', errors='raise'):
return iresult.base # .base to access underlying np.ndarray


def precision_from_unit(object unit):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this instead

make a new function that is cpdef that calls cast_from_unit, then you don't remove the inline nature from cast_from_unit and you can call it from python w/o re-writing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem is that the current cast_from_unit does more, I only need the first part of it (that defines m and p, the rest of the function calculates something with m and p and returns that result).

I tried putting the common logic in a separate function, but it starts to become a bit more ugly to write it in such a way that it can still be used in the cdef inline one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bit convoluted way to share the if/else part, but I don't see an easy other way to share that part, except by not having cast_from_unit a cdef function (but that has other implications elsewhere), or otherwise just duplicating the code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use cpdef.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you cpdef inline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. but inline for this prob doesn't make much difference. its the cast_from_unit where it actually matters a lot.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you do an except? -1 for a cpdef that returns a tuple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually. you are only calling this once? then make it a def is just fine (and error checkng is easy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new code, I am calling this only once. But the existing cast_from_unit is used in several places, and is also included in timedeltas.pxd as it used in other places in tslibs, so needs to stay a cdef

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use a cpdef as asked. It's only used in places where a similar cpdef parse_timedelta_unit is also used, so the python interaction is indeed probably not a problem.

int64_t m
int p

cdef inline int _precision_from_unit(object unit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you going thru all of this trouble? use cpdef

@@ -31,6 +31,7 @@ Fixed Regressions
- Fixed regression in ``IntervalDtype`` construction where passing an incorrect string with 'Interval' as a prefix could result in a ``RecursionError``. (:issue:`25338`)
- Fixed regression in creating a period-dtype array from a read-only NumPy array of period objects. (:issue:`25403`)
- Fixed regression in :class:`Categorical`, where constructing it from a categorical ``Series`` and an explicit ``categories=`` that differed from that in the ``Series`` created an invalid object which could trigger segfaults. (:issue:`25318`)
- Fixed regression in :func:`to_timedelta` loosing precision when converting floating data to ``Timedelta`` data (:issue:`25077`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loosing -> losing

@jorisvandenbossche
Copy link
Member Author

cc @jbrockmendel

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can u just run some quick timings to see if anything else is affected by this - it’s a heavily used routine

@jorisvandenbossche
Copy link
Member Author

can u just run some quick timings to see if anything else is affected by this - it’s a heavily used routine

I did some quick timings of pd._libs.tslibs.timedeltas.array_to_timedelta64(arr, unit='s') (with an all float array, the extreme case, which in practice will not go through this path any more) and of pd.Timedelta("4 days 3 hours 2 minutes 1 second"), and for those, I don't see a significant difference.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2019

ok then - lgtm thanks

@jreback jreback merged commit bace4d0 into pandas-dev:master Mar 12, 2019
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 12, 2019
@jorisvandenbossche jorisvandenbossche deleted the to_timedelta-float branch March 12, 2019 12:53
@jorisvandenbossche
Copy link
Member Author

I also just ran the timedelta benchmarks for this branch compared to master, and it says nothing changed significantly (although I don't know if we are benchmarking a critical path that relies heavily on the changed function)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants