Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: reducing dtype checking overhead in groupby #44738

Conversation

jorisvandenbossche
Copy link
Member

This is not to be merged as is, but rather to illustrate how the generic dtype checking methods we have cause some overhead in certain operations (or at least in the benchmark cases we have), and how this is also relatively easy to solve.

Using the GroupManyLabels.time_sum case (a wide dataframe):

ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = pd.DataFrame(data)
df_am = df._as_manager('array').copy()

In [3]: %timeit df_am.groupby(labels).sum()
38.2 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- master
26.6 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- PR

So a combination of 1) more specialized dtype checks and 2) caching some repeated dtype+op-dependent helpers quickly gives a decent speedup (and there is some more room for improvement with similar changes).

So at some point I think we should look a bit more into our is_.. checks and have eg specialized versions that eg assume you already have a dtype (I know we discussed this before, not directly finding a relevant issue), and use those throughout the codebase where we know we have a dtype object (might be a good issue for someone starting to dive into the code base).

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Dec 3, 2021
@jreback
Copy link
Contributor

jreback commented Dec 3, 2021

yeah looks nice, i agree should try to refactor is_* to make them more strict is a good idea

@jbrockmendel
Copy link
Member

The dtype.kind checks i definitely like. How big a difference does the caching make?

try:
return arr_or_dtype.kind in "uifcb"
except AttributeError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment to the effect of "fastpath/not a dtype object"?

will this change how is_numeric_dtype treats EA dtypes?

is_datetimelike = needs_i8_conversion(dtype)
# is_datetimelike = needs_i8_conversion(dtype)
is_datetimelike = dtype_kind in ["m", "M"] or (
dtype_kind == "O" and dtype.type is Period
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point we have an ndarray, so PeriodDtype shouldn't be possible i think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche can you respond to a few small comments here? this should be pretty easy to merge and will be a nice perf bump

@@ -490,21 +495,26 @@ def _call_cython_op(
orig_values = values

dtype = values.dtype
is_numeric = is_numeric_dtype(dtype)
dtype_kind = dtype.kind
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own edification, how big a difference does this make? i.e. should i get into the habit of doing this?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

prob could work if you can rebase @jorisvandenbossche

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 17, 2022
@jreback
Copy link
Contributor

jreback commented Feb 17, 2022

this would be nice to do

@simonjayhawkins
Copy link
Member

this would be nice to do

closing as stale. but feel free to reopen when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants