[ArrayManager] GroupBy cython aggregations (no fallback) #39885

jorisvandenbossche · 2021-02-18T14:35:58Z

This implements one aspect of groupby: basic cython-based aggregations (so not yet general apply, python fallback, or other methods, etc, only the basic aggregations that take the _cython_agg_general path).

Similarly to how we currently have a _cython_agg_blocks, this PR adds an equivalent _cython_agg_arrays which calls the cython operation on each column instead of each block.

pandas/core/groupby/generic.py

jbrockmendel · 2021-02-18T15:39:42Z

pandas/core/groupby/generic.py

+    def _wrap_agged_arrays(self, arrays: List[ArrayLike], columns: Index) -> DataFrame:
+        if not self.as_index:
+            index = np.arange(arrays[0].shape[0])
+            mgr = ArrayManager(arrays, axes=[index, columns])


if/when BlockManager is gone, then axes=[index, columns] makes sense, but until then switching to axes=[columns, index] to match BlockManager constructor will facilitate code sharing

pandas/core/groupby/generic.py

jorisvandenbossche · 2021-02-23T14:50:41Z

rather than adding new things & constructing the AM / BM here. can you do a pre-cursor PR that pushes this to an internals routine, maybe internals/blockwise. We do not want to keep adding code in here (and in fact want to remove the block manager / internals code).

It might be less easy to fully get internals out of groupby, as we still use it in several places. But focusing on this specific usage: in the same line as we have BlockManager.reduce for DataFrame reductions, I could look into adding a BlockManager.grouped_reduce (or similar name) for the grouped reductions.

jorisvandenbossche · 2021-02-23T15:50:47Z

One possible experiment to move some internals code out of groupby: #39997

jorisvandenbossche · 2021-02-24T19:17:04Z

Updated this now #39997 is merged. Could now easily reuse the same blk_func and cast_agg_result, so resulting diff looks better now.

jorisvandenbossche · 2021-02-24T19:21:33Z

pandas/core/internals/managers.py

@@ -235,16 +235,19 @@ def shape(self) -> Shape:
    def ndim(self) -> int:
        return len(self.axes)

-    def set_axis(self, axis: int, new_labels: Index) -> None:
+    def set_axis(
+        self, axis: int, new_labels: Index, verify_integrity: bool = True


I added a verify_integrity keyword here, and put the length verification behind if verify_integrity. Reason I needed this is because in groupby, we are sometimes setting an Index with a different length as the original one.

jbrockmendel · 2021-02-24T20:27:56Z

so resulting diff looks better now.

much nicer. perf impact?

jorisvandenbossche · 2021-02-24T20:30:00Z

You mean performance impact of the latest change? Or in general BlockManager vs ArrayManager?

jbrockmendel · 2021-02-24T20:34:43Z

You mean performance impact of the latest change? Or in general BlockManager vs ArrayManager?

i mean perf impact of the PR on the non-AM paths

jorisvandenbossche · 2021-02-24T20:37:55Z

Ah. Do you see a change that you think is potentially performance sensitive? For BlockManager paths, it's mostly some renaming of functions/arguments, an additional check for isinstance(obj, ArrayManager) and a replacement of mgr.axes[1] = .. with mgr.set_axis(1, ..)

jbrockmendel · 2021-02-24T21:01:51Z

Do you see a change that you think is potentially performance sensitive?

i dont see anything obvious, but ive often been surprised by results in this part of the code

jorisvandenbossche · 2021-02-24T21:52:44Z

$ asv continuous -f 1.1 upstream/master HEAD
...
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

jreback

lgtm. @jbrockmendel if any comments.

jreback · 2021-02-25T00:48:37Z

also pls merge master

jbrockmendel · 2021-02-25T02:14:47Z

pandas/core/internals/array_manager.py

@@ -330,7 +356,7 @@ def apply_with_block(self: T, f, align_keys=None, **kwargs) -> T:
            if hasattr(arr, "tz") and arr.tz is None:  # type: ignore[union-attr]
                # DatetimeArray needs to be converted to ndarray for DatetimeBlock
                arr = arr._data  # type: ignore[union-attr]
-            elif arr.dtype.kind == "m":
+            elif arr.dtype.kind == "m" and not isinstance(arr, np.ndarray):


could combine this with previous check as

if arr.dtype.kind in ["m", "M"] and not isinstance(arr, np.ndarray): arr = arr._data

That would be nice, but the problem is that we still need to keep DatetimeArray intact for DatetimeTZBlock. So we would still need the if hasattr(arr, "tz") and arr.tz is None check as well, in which case it doesn't necessarily become more readable to combine both checks.

Edit: the diff would be:

- if hasattr(arr, "tz") and arr.tz is None: # type: ignore[union-attr] - # DatetimeArray needs to be converted to ndarray for DatetimeBlock - arr = arr._data # type: ignore[union-attr] - elif arr.dtype.kind == "m" and not isinstance(arr, np.ndarray): - # TimedeltaArray needs to be converted to ndarray for TimedeltaBlock + if ( + arr.dtype.kind == "m" + and not isinstance(arr, np.ndarray) + and getattr(arr, "tz", None) is None + ): + # DatetimeArray/TimedeltaArray needs to be converted to ndarray + # for DatetimeBlock/TimedeltaBlock (except DatetimeArray with tz, + # which needs to be preserved for DatetimeTZBlock) arr = arr._data # type: ignore[union-attr]

instead of and getattr(arr, "tz", None) is None how about isinstance(arr.dtype, np.dtype). either way works i guess

That still gives the same length of the if check as in my diff example above, which I don't find an improvement in readability

yah the only possible difference is for mypy

jbrockmendel · 2021-02-25T02:15:23Z

pandas/tests/groupby/aggregate/test_cython.py

+def test_cythonized_aggers(op_name, using_array_manager):
+    if using_array_manager and op_name in {"count", "sem"}:
+        # TODO(ArrayManager) groupby count/sem
+        pytest.skip("ArrayManager groupby count/sem not yet implemented")


can the add_marker pattern be used here?

We use the add_marker pattern for xfail (because just raising pytest.xfail wouldn't result in a strict xfail, unlike the skip here), so for skip there is no advantage using it AFAIK.

Now, that said, I should maybe actually start using xfail instead of skip for the "skip_array_manager_not_yet_implemented", so it's easier to notice if certain tests can be unskipped when more features get implemented.

So will already start using the xfail as you suggest here.

Actually, in the meantime I fixed count, so the skip/xfail could be removed altogether.

(but a sign that using xfail instead of skip is actually a good idea ;))

sounds good

jorisvandenbossche · 2021-02-25T13:29:00Z

Merging this so I can do follow-up PRs. I think I answered to the remaining open comments, but otherwise I can further address those in the follow-up PRs.

jbrockmendel · 2021-02-25T17:06:47Z

Merging this so I can do follow-up PRs

i know timezones are a hassle, but please try to avoid the temptation to do this

jorisvandenbossche added 3 commits February 18, 2021 15:25

[ArrayManager] GroupBy cython aggregations (no fallback)

df70d2d

Merge remote-tracking branch 'upstream/master' into am-groupby-basic-agg

9cbbf97

style

692175e

jorisvandenbossche added Groupby Internals Related to non-user accessible pandas implementation labels Feb 18, 2021

jorisvandenbossche added this to the 1.3 milestone Feb 18, 2021

jorisvandenbossche requested a review from jbrockmendel February 18, 2021 14:35

jorisvandenbossche commented Feb 18, 2021

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 18, 2021

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 18, 2021

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 18, 2021

View reviewed changes

jreback requested changes Feb 19, 2021

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jorisvandenbossche mentioned this pull request Feb 23, 2021

REF: move Block construction in groupby aggregation to internals #39997

Merged

jorisvandenbossche added 5 commits February 24, 2021 15:26

Merge remote-tracking branch 'upstream/master' into am-groupby-basic-agg

e8e108b

Merge remote-tracking branch 'upstream/master' into am-groupby-basic-agg

a5fb361

common _cython_agg_manager

a7bf71e

clean-up test

8c1b8a2

clean-up setting of index axis

06b6f3f

jorisvandenbossche commented Feb 24, 2021

View reviewed changes

fix BM.arrays for use in tests

244152b

typing

32bf7d1

jreback approved these changes Feb 25, 2021

View reviewed changes

jbrockmendel reviewed Feb 25, 2021

View reviewed changes

jorisvandenbossche added 3 commits February 25, 2021 08:24

use add_marker

b44804e

remove xfail marker - count is actually implemented now

50fb97f

Merge remote-tracking branch 'upstream/master' into am-groupby-basic-agg

1d63f72

jorisvandenbossche merged commit 30021ac into pandas-dev:master Feb 25, 2021

jorisvandenbossche deleted the am-groupby-basic-agg branch February 25, 2021 13:36

This was referenced Feb 25, 2021

[ArrayManager] Groupby cython aggregation - python pyfallback #40047

Merged

[ArrayManager] Remaining GroupBy tests (fix count, pass on libreduction for now) #40050

Merged

jorisvandenbossche mentioned this pull request Mar 2, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ArrayManager] GroupBy cython aggregations (no fallback) #39885

[ArrayManager] GroupBy cython aggregations (no fallback) #39885

jorisvandenbossche commented Feb 18, 2021

jbrockmendel Feb 18, 2021

jorisvandenbossche commented Feb 23, 2021

jorisvandenbossche commented Feb 23, 2021

jorisvandenbossche commented Feb 24, 2021

jorisvandenbossche Feb 24, 2021

jbrockmendel commented Feb 24, 2021

jorisvandenbossche commented Feb 24, 2021

jbrockmendel commented Feb 24, 2021

jorisvandenbossche commented Feb 24, 2021

jbrockmendel commented Feb 24, 2021

jorisvandenbossche commented Feb 24, 2021

jreback left a comment

jreback commented Feb 25, 2021

jbrockmendel Feb 25, 2021

jorisvandenbossche Feb 25, 2021 •

edited

Loading

jbrockmendel Feb 25, 2021

jorisvandenbossche Feb 25, 2021

jbrockmendel Feb 25, 2021

jbrockmendel Feb 25, 2021

jorisvandenbossche Feb 25, 2021 •

edited

Loading

jorisvandenbossche Feb 25, 2021

jbrockmendel Feb 25, 2021

jorisvandenbossche commented Feb 25, 2021

jbrockmendel commented Feb 25, 2021

[ArrayManager] GroupBy cython aggregations (no fallback) #39885

[ArrayManager] GroupBy cython aggregations (no fallback) #39885

Conversation

jorisvandenbossche commented Feb 18, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 23, 2021

jorisvandenbossche commented Feb 23, 2021

jorisvandenbossche commented Feb 24, 2021

Choose a reason for hiding this comment

jbrockmendel commented Feb 24, 2021

jorisvandenbossche commented Feb 24, 2021

jbrockmendel commented Feb 24, 2021

jorisvandenbossche commented Feb 24, 2021

jbrockmendel commented Feb 24, 2021

jorisvandenbossche commented Feb 24, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback commented Feb 25, 2021

Choose a reason for hiding this comment

jorisvandenbossche Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 25, 2021

jbrockmendel commented Feb 25, 2021

jorisvandenbossche Feb 25, 2021 •

edited

Loading

jorisvandenbossche Feb 25, 2021 •

edited

Loading