Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement business_start/end cases for shift_months #18489

Merged
merged 5 commits into from
Nov 27, 2017

Conversation

jbrockmendel
Copy link
Member

Unifies apply_index implementations for MonthEnd/MonthBegin, plus extends them to BMonthEnd and BMonthBegin.

Unifies onOffset implementations for QuarterEnd/BQuarterEnd, plus extends them to QuarterBegin/BQuarterBegin.

Implements a cdef version of monthrange.

@jbrockmendel
Copy link
Member Author

(will run asv shortly)


from numpy cimport int64_t, int32_t


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this calendar.pxd ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid overlap with stdlib names. The thought behind ccalendar is by analogy with chardet --> cchardet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that's actually a problem, I guess ccalendar is ok.

@cython.boundscheck(False)
@cython.cdivision
cdef int dayofweek(int y, int m, int d) nogil:
"""Sakamoto's method, from wikipedia"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string

@cython.wraparound(False)
@cython.boundscheck(False)
cpdef monthrange(int64_t year, Py_ssize_t month):
cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string



cdef int is_leapyear(int64_t year) nogil:
"""Returns 1 if the given year is a leap year, 0 otherwise."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string

from numpy cimport int64_t, int32_t


cpdef monthrange(int64_t year, Py_ssize_t month)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so monthrange is actually used in a few places. changing to use this one? (similar with other methods you defined), e.g. its defined in tslib.pyx now (as well as normalize_date)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I figured I'd way to update the other usages until after you pointed it out, keep the diff smaller for the first round.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I'd rather see the far reaching changes (if any) now. Ideally in a PR you change a small number of things, but clean them up everywhere (e.g. replace usages of monthrange). sure its not always possible, but then its not 'half-done'

# ----------------------------------------------------------------------
# Constants

# Slightly more performant cython lookups than a 2D table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then add that the first 12 are non-leap years and second are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@jreback jreback added Clean Frequency DateOffsets labels Nov 25, 2017
@jbrockmendel
Copy link
Member Author

OK, I've addressed all the comments locally, am running benchmarks (several times) and fixed the travis flake problem. Doing some extra cleanup and parametrization of the tests to make sure the affected offset methods are covered, will update later.

@@ -929,8 +929,9 @@ def name(self):
if self.isAnchored:
return self.rule_code
else:
month = liboffsets._int_to_month[self.n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob should de-privatize these in offsets (_int_to_month)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@jbrockmendel
Copy link
Member Author

OK, just confirmed: the existing tests do not hit the new apply_index methods, and the speedup is pretty enormous.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

what is the perf issue here?

@jbrockmendel
Copy link
Member Author

what is the perf issue here?

I'm still running asvs to try to pin it down. AFAICT something in ccalendar must be less performant than the version in src/datetime

taskset 4 asv continuous -f 1.1 -E virtualenv master HEAD -b offset -b timeseries -b period -b timedelta
[...]
       before           after         ratio
     [be66ef83]       [b244bab9]
+      2.50±0.3ms       4.81±0.4ms     1.93  period.Algorithms.time_drop_duplicates('series')
+     2.81±0.02ms      5.27±0.07ms     1.88  period.Algorithms.time_value_counts('series')
+        398±20ms          460±1ms     1.16  offset.SemiMonthOffset.time_begin_apply_index
+         401±1ms          461±2ms     1.15  offset.SemiMonthOffset.time_end_apply_index
+         406±6ms          458±2ms     1.13  offset.SemiMonthOffset.time_begin_incr_rng
-      22.8±0.5μs      20.4±0.06μs     0.89  timeseries.DatetimeAccessor.time_dt_accessor
-         126±6μs        112±0.3μs     0.89  timeseries.DatetimeIndex.time_unique
-           1.56s            1.38s     0.89  offset.ApplyIndex.time_apply_series(<BusinessYearBegin: month=1>)
-        68.0±3μs       60.1±0.2μs     0.88  period.PeriodUnaryMethods.time_now('min')
-      79.1±0.1μs       68.6±0.2μs     0.87  period.PeriodProperties.time_start_time('M')
-        25.7±2ms         22.0±0ms     0.86  timeseries.DatetimeIndex.time_to_pydatetime
-         180±5ns        154±0.6ns     0.85  timedelta.TimedeltaProperties.time_timedelta_nanoseconds
-           1.61s            1.37s     0.85  offset.ApplyIndex.time_apply_series(<BusinessYearEnd: month=12>)
-     11.3±0.07μs      9.55±0.08μs     0.85  offset.YearBegin.time_timeseries_year_apply
-        24.8±2μs       20.8±0.1μs     0.84  offset.SemiMonthOffset.time_begin_decr
-           25.2s            21.1s     0.84  gil.nogil_datetime_fields.time_period_to_datetime
-      26.9±0.2μs      22.1±0.09μs     0.82  period.PeriodUnaryMethods.time_now('M')
-           1.74s            1.42s     0.82  offset.ApplyIndex.time_apply_series(<BusinessQuarterEnd: startingMonth=3>)
-        1.04±0μs          848±2ns     0.81  period.PeriodProperties.time_minute('min')
-           9.62s            7.76s     0.81  gil.nogil_datetime_fields.time_datetime_to_period
-      9.17±0.3ms      7.31±0.09ms     0.80  timeseries.DatetimeIndex.time_dti_tz_factorize
-        15.3±1ms      12.2±0.06ms     0.80  timeseries.DatetimeIndex.time_infer_freq_none
-           1.46s       14.5±0.3ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthEnd>)
-           1.48s      13.8±0.08ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthBegin>)
-           1.50s      11.9±0.09ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthBegin>)
-           1.54s       12.3±0.2ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthEnd>)

b244bab9 is after updating all the cimports as suggested, except for one in fields that I reverted to try track down the regression.

@codecov
Copy link

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18489 into master will decrease coverage by 0.01%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18489      +/-   ##
==========================================
- Coverage   91.33%   91.32%   -0.02%     
==========================================
  Files         163      163              
  Lines       49801    49773      -28     
==========================================
- Hits        45487    45454      -33     
- Misses       4314     4319       +5
Flag Coverage Δ
#multiple 89.12% <95.23%> (-0.02%) ⬇️
#single 40.72% <28.57%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 96.94% <95.23%> (-0.09%) ⬇️
pandas/core/indexes/interval.py 92.69% <0%> (-0.53%) ⬇️
pandas/core/indexes/datetimes.py 95.52% <0%> (-0.2%) ⬇️
pandas/core/generic.py 95.73% <0%> (-0.06%) ⬇️
pandas/core/indexes/base.py 96.4% <0%> (-0.02%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.01%) ⬇️
pandas/core/strings.py 98.46% <0%> (-0.01%) ⬇️
pandas/core/internals.py 94.47% <0%> (-0.01%) ⬇️
pandas/core/panel.py 97.14% <0%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d101064...d3ff628. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 26, 2017

Codecov Report

Merging #18489 into master will decrease coverage by 50.56%.
The diff coverage is 28.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #18489       +/-   ##
===========================================
- Coverage   91.33%   40.77%   -50.57%     
===========================================
  Files         163      163               
  Lines       49801    49796        -5     
===========================================
- Hits        45487    20304    -25183     
- Misses       4314    29492    +25178
Flag Coverage Δ
#multiple ?
#single 40.77% <28.57%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 44.04% <28.57%> (-52.99%) ⬇️
pandas/tools/hashing.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/io/formats/style.py 0% <0%> (-100%) ⬇️
pandas/parser.py 0% <0%> (-100%) ⬇️
pandas/lib.py 0% <0%> (-100%) ⬇️
pandas/io/json/json.py 0% <0%> (-100%) ⬇️
pandas/types/concat.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.08%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.28%) ⬇️
... and 112 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d101064...c503c53. Read the comment docs.

@jbrockmendel
Copy link
Member Author

Separating the ccalendar stuff from the new apply_index and onOffset methods. The newly-pushed commit should be an unambiguous win.

asv continuous -f 1.1 -E virtualenv master HEAD -b offset -b period -b timedelta -b timeseries
[...]
     [d1010643]       [d3ff6286]
+     27.5±0.09μs         36.5±2μs     1.33  offset.CBDayHolidays.time_custom_bday_cal_decr
+         499±1ms         612±20ms     1.23  offset.ApplyIndex.time_apply_series(<QuarterEnd: startingMonth=3>)
+     18.5±0.04μs       22.2±0.8μs     1.20  offset.Day.time_timeseries_day_apply
+         467±2ms          554±2ms     1.19  offset.SemiMonthOffset.time_end_apply_index
+      22.1±0.1μs       25.0±0.2μs     1.13  offset.SemiMonthOffset.time_end_decr_n
+      24.3±0.1μs       27.5±0.1μs     1.13  offset.CBDayHolidays.time_custom_bday_cal_incr_n
+         911±4ns      1.02±0.06μs     1.12  period.PeriodProperties.time_is_leap_year('M')
+     21.0±0.09μs         23.4±1μs     1.11  period.PeriodUnaryMethods.time_now('M')
+       107±0.2ns         118±10ns     1.10  timedelta.TimedeltaProperties.time_timedelta_seconds
-           1.73s            1.53s     0.89  timeseries.Iteration.time_iter_periodindex
-     3.93±0.04ms      3.47±0.02ms     0.88  period.PeriodIndexConstructor.time_from_pydatetime('D')
-     7.60±0.07μs      6.60±0.02μs     0.87  timeseries.DatetimeIndex.time_timestamp_tzinfo_cons
-           1.62s            1.41s     0.87  offset.ApplyIndex.time_apply_series(<BusinessYearEnd: month=12>)
-         530±9μs        459±0.6μs     0.87  period.Algorithms.time_value_counts('index')
-         536±7ms          458±3ms     0.85  offset.SemiMonthOffset.time_begin_apply_index
-      1.03±0.1μs          866±7ns     0.84  period.PeriodProperties.time_day('min')
-        133±10μs       111±0.05μs     0.83  timeseries.DatetimeIndex.time_unique
-     1.05±0.04μs          873±1ns     0.83  period.PeriodProperties.time_dayofyear('min')
-      20.1±0.3μs      16.4±0.09μs     0.82  timeseries.AsOf.time_asof_single_early
-      4.39±0.1ms      3.31±0.01ms     0.75  timeseries.ToDatetime.time_cache_true_with_unique_seconds_and_unit
-           24.0s            14.2s     0.59  gil.nogil_datetime_fields.time_datetime_to_period
-           1.46s       20.8±0.8ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthEnd>)
-           1.55s       20.9±0.1ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthBegin>)
-           1.61s       19.6±0.4ms     0.01  offset.ApplyIndex.time_apply_index(<BusinessMonthBegin>)
-           1.79s       21.6±0.2ms     0.01  offset.ApplyIndex.time_apply_series(<BusinessMonthEnd>)
-         668±2μs      6.93±0.02μs     0.01  offset.OnOffset.time_on_offset(<BusinessQuarterBegin: startingMonth=3>)
-         628±2μs      6.10±0.04μs     0.01  offset.OnOffset.time_on_offset(<QuarterBegin: startingMonth=3>)

Running again. The 100x improvements are real. The slowdowns I expect to be noise.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

what path was BusinessMonthEnd for example taking before that made it a lot slower? all in python?

@jbrockmendel
Copy link
Member Author

The base class applyindex raises, at which point it falls back to point wise addition

@jreback jreback added the Performance Memory or execution speed performance label Nov 27, 2017
@jreback jreback added this to the 0.22.0 milestone Nov 27, 2017
@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

this needs a note for performance in whats for 0.22.0. ping when pushed as this is green already.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

add an additional line (even though you have one for #18218). in general any PR that is perf related should get an entry (or you can include that PR number on an existing entry). cleaning/reorg PR's that don't touch the user don't need ones.

@jbrockmendel
Copy link
Member Author

ping

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

thanks!

@jreback jreback merged commit f745e52 into pandas-dev:master Nov 27, 2017
@jbrockmendel jbrockmendel deleted the tslibs-offsets6 branch December 8, 2017 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Frequency DateOffsets Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants