-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: reducing dtype checking overhead in groupby #44738
PERF: reducing dtype checking overhead in groupby #44738
Conversation
yeah looks nice, i agree should try to refactor is_* to make them more strict is a good idea |
The dtype.kind checks i definitely like. How big a difference does the caching make? |
try: | ||
return arr_or_dtype.kind in "uifcb" | ||
except AttributeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment to the effect of "fastpath/not a dtype object"?
will this change how is_numeric_dtype treats EA dtypes?
is_datetimelike = needs_i8_conversion(dtype) | ||
# is_datetimelike = needs_i8_conversion(dtype) | ||
is_datetimelike = dtype_kind in ["m", "M"] or ( | ||
dtype_kind == "O" and dtype.type is Period |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point we have an ndarray, so PeriodDtype shouldn't be possible i think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche can you respond to a few small comments here? this should be pretty easy to merge and will be a nice perf bump
@@ -490,21 +495,26 @@ def _call_cython_op( | |||
orig_values = values | |||
|
|||
dtype = values.dtype | |||
is_numeric = is_numeric_dtype(dtype) | |||
dtype_kind = dtype.kind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my own edification, how big a difference does this make? i.e. should i get into the habit of doing this?
prob could work if you can rebase @jorisvandenbossche |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
this would be nice to do |
closing as stale. but feel free to reopen when ready. |
This is not to be merged as is, but rather to illustrate how the generic dtype checking methods we have cause some overhead in certain operations (or at least in the benchmark cases we have), and how this is also relatively easy to solve.
Using the
GroupManyLabels.time_sum
case (a wide dataframe):So a combination of 1) more specialized dtype checks and 2) caching some repeated dtype+op-dependent helpers quickly gives a decent speedup (and there is some more room for improvement with similar changes).
So at some point I think we should look a bit more into our
is_..
checks and have eg specialized versions that eg assume you already have a dtype (I know we discussed this before, not directly finding a relevant issue), and use those throughout the codebase where we know we have a dtype object (might be a good issue for someone starting to dive into the code base).