PERF: reducing dtype checking overhead in groupby #44738

jorisvandenbossche · 2021-12-03T17:17:34Z

This is not to be merged as is, but rather to illustrate how the generic dtype checking methods we have cause some overhead in certain operations (or at least in the benchmark cases we have), and how this is also relatively easy to solve.

Using the GroupManyLabels.time_sum case (a wide dataframe):

ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = pd.DataFrame(data)
df_am = df._as_manager('array').copy()

In [3]: %timeit df_am.groupby(labels).sum()
38.2 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- master
26.6 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- PR

So a combination of 1) more specialized dtype checks and 2) caching some repeated dtype+op-dependent helpers quickly gives a decent speedup (and there is some more room for improvement with similar changes).

So at some point I think we should look a bit more into our is_.. checks and have eg specialized versions that eg assume you already have a dtype (I know we discussed this before, not directly finding a relevant issue), and use those throughout the codebase where we know we have a dtype object (might be a good issue for someone starting to dive into the code base).

jreback · 2021-12-03T18:15:19Z

yeah looks nice, i agree should try to refactor is_* to make them more strict is a good idea

jbrockmendel · 2021-12-03T18:39:26Z

The dtype.kind checks i definitely like. How big a difference does the caching make?

jbrockmendel · 2021-12-04T00:32:04Z

pandas/core/dtypes/common.py

+    try:
+        return arr_or_dtype.kind in "uifcb"
+    except AttributeError:
+        pass


comment to the effect of "fastpath/not a dtype object"?

will this change how is_numeric_dtype treats EA dtypes?

jbrockmendel · 2021-12-04T00:34:31Z

pandas/core/groupby/ops.py

-        is_datetimelike = needs_i8_conversion(dtype)
+        # is_datetimelike = needs_i8_conversion(dtype)
+        is_datetimelike = dtype_kind in ["m", "M"] or (
+            dtype_kind == "O" and dtype.type is Period


at this point we have an ndarray, so PeriodDtype shouldn't be possible i think?

@jorisvandenbossche can you respond to a few small comments here? this should be pretty easy to merge and will be a nice perf bump

jbrockmendel · 2021-12-04T00:35:23Z

pandas/core/groupby/ops.py

@@ -490,21 +495,26 @@ def _call_cython_op(
        orig_values = values

        dtype = values.dtype
-        is_numeric = is_numeric_dtype(dtype)
+        dtype_kind = dtype.kind


for my own edification, how big a difference does this make? i.e. should i get into the habit of doing this?

jreback · 2022-01-16T17:39:53Z

prob could work if you can rebase @jorisvandenbossche

github-actions · 2022-02-17T00:03:49Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jreback · 2022-02-17T16:01:33Z

this would be nice to do

simonjayhawkins · 2023-02-22T13:02:17Z

this would be nice to do

closing as stale. but feel free to reopen when ready.

PERF: reducing dtype checking overhead in groupby

846c16a

jorisvandenbossche added the Performance Memory or execution speed performance label Dec 3, 2021

jbrockmendel reviewed Dec 4, 2021

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 7, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

github-actions bot added the Stale label Feb 17, 2022

simonjayhawkins closed this Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: reducing dtype checking overhead in groupby #44738

PERF: reducing dtype checking overhead in groupby #44738

jorisvandenbossche commented Dec 3, 2021

jreback commented Dec 3, 2021

jbrockmendel commented Dec 3, 2021

jbrockmendel Dec 4, 2021

jbrockmendel Dec 4, 2021

jbrockmendel Dec 13, 2021

jbrockmendel Dec 4, 2021

jreback commented Jan 16, 2022

github-actions bot commented Feb 17, 2022

jreback commented Feb 17, 2022

simonjayhawkins commented Feb 22, 2023

PERF: reducing dtype checking overhead in groupby #44738

PERF: reducing dtype checking overhead in groupby #44738

Conversation

jorisvandenbossche commented Dec 3, 2021

jreback commented Dec 3, 2021

jbrockmendel commented Dec 3, 2021

jbrockmendel Dec 4, 2021

Choose a reason for hiding this comment

jbrockmendel Dec 4, 2021

Choose a reason for hiding this comment

jbrockmendel Dec 13, 2021

Choose a reason for hiding this comment

jbrockmendel Dec 4, 2021

Choose a reason for hiding this comment

jreback commented Jan 16, 2022

github-actions bot commented Feb 17, 2022

jreback commented Feb 17, 2022

simonjayhawkins commented Feb 22, 2023