-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN: ASV groupby benchmarks #18611
CLN: ASV groupby benchmarks #18611
Conversation
Codecov Report
@@ Coverage Diff @@
## master #18611 +/- ##
==========================================
+ Coverage 91.44% 91.45% +<.01%
==========================================
Files 157 157
Lines 51449 51449
==========================================
+ Hits 47048 47051 +3
+ Misses 4401 4398 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18611 +/- ##
==========================================
- Coverage 91.6% 91.58% -0.02%
==========================================
Files 153 153
Lines 51253 51253
==========================================
- Hits 46950 46941 -9
- Misses 4303 4312 +9
Continue to review full report at Codecov.
|
asv_bench/benchmarks/groupby.py
Outdated
|
||
param_names = ['df'] | ||
size = 10**6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wouldn't you setup_cache here?
d935a37
to
63122dc
Compare
Good catch @jreback. I was able to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a bunch of comments, thanks a lot for this cleaning work!
'object_large': Series( | ||
tm.makeStringIndex(10000).take( | ||
np.random.randint(0, 10000, size=size)))} | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still runs it for each benchmark in the class, instead of previously only once (I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setup_cache
will still only run once:
If the setup is especially expensive, the setup_cache method may be used instead, which only performs the setup calculation once and then caches the result to disk. It is run only once also for repeated benchmarks and profiling, unlike setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the wording is not very clear. I interpreted it to not run between repetitions of the same benchmark, not for multiple benchmarks in the same class
asv_bench/benchmarks/groupby.py
Outdated
|
||
|
||
#---------------------------------------------------------------------- | ||
# single key, long, integer key | ||
class Incidies(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also not really clear what this is testing or why it is named like this?
asv_bench/benchmarks/groupby.py
Outdated
|
||
def time_sum(self): | ||
self.df.groupby(self.labels).sum() | ||
def time_datetime_indicies(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in indicies -> indices
asv_bench/benchmarks/groupby.py
Outdated
def time_sum(self): | ||
self.df.groupby(self.labels).sum() | ||
def time_datetime_indicies(self): | ||
self.ts.groupby([self.year, self.month, self.day]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this was doing len(..)
of the object I think? That is an important difference :-) (as the above is lazy, doesn't do any grouping yet)
asv_bench/benchmarks/groupby.py
Outdated
'ints': np.random.randint(0, 1000, size=n), | ||
'obj': obj, | ||
'offsets': offsets}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also use setup_cache here?
asv_bench/benchmarks/groupby.py
Outdated
def time_groupby_multi_different_numpy_functions(self): | ||
self.df.groupby(['key1', 'key2']).agg({'value1': np.mean, 'value2': np.var, 'value3': np.sum}) | ||
def time_agg_builtins_multi_col(self, df): | ||
df.groupby(['jim', 'joe']).agg([sum, min, max]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integrate this one with the previous AggMultiColFuncs
one?
asv_bench/benchmarks/groupby.py
Outdated
# GH 14338 | ||
goal_time = 0.2 | ||
params = [period_range, date_range, partial(date_range, tz='US/Central')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use strings here and then make a dict string -> function in setup?
(now you don't get nice param value names in the output)
asv_bench/benchmarks/groupby.py
Outdated
# Series.value_counts | ||
|
||
class series_value_counts(object): | ||
class PivotTable(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this to reshape.py
def time_cumsum(self): | ||
self.df.groupby('key').cumsum() | ||
|
||
def time_shift(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are those moved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are essentially identical to the benchmarks in GroupByMethods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
63122dc
to
b81baf5
Compare
Thanks for the review @jorisvandenbossche, addressed your comments. |
Thanks! |
Ran flake8 and replaced star imports
Moved
series_value_counts
toseries_methods.py
Used
params
withparam_names
andsetup_cache
where possible.