Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] GroupBy cython aggregations (no fallback) #39885

Merged

Conversation

jorisvandenbossche
Copy link
Member

xref #39146

This implements one aspect of groupby: basic cython-based aggregations (so not yet general apply, python fallback, or other methods, etc, only the basic aggregations that take the _cython_agg_general path).

Similarly to how we currently have a _cython_agg_blocks, this PR adds an equivalent _cython_agg_arrays which calls the cython operation on each column instead of each block.

@jorisvandenbossche jorisvandenbossche added Groupby Internals Related to non-user accessible pandas implementation labels Feb 18, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Feb 18, 2021
def _wrap_agged_arrays(self, arrays: List[ArrayLike], columns: Index) -> DataFrame:
if not self.as_index:
index = np.arange(arrays[0].shape[0])
mgr = ArrayManager(arrays, axes=[index, columns])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if/when BlockManager is gone, then axes=[index, columns] makes sense, but until then switching to axes=[columns, index] to match BlockManager constructor will facilitate code sharing

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

rather than adding new things & constructing the AM / BM here. can you do a pre-cursor PR that pushes this to an internals routine, maybe internals/blockwise. We do not want to keep adding code in here (and in fact want to remove the block manager / internals code).

It might be less easy to fully get internals out of groupby, as we still use it in several places. But focusing on this specific usage: in the same line as we have BlockManager.reduce for DataFrame reductions, I could look into adding a BlockManager.grouped_reduce (or similar name) for the grouped reductions.

@jorisvandenbossche
Copy link
Member Author

One possible experiment to move some internals code out of groupby: #39997

@jorisvandenbossche
Copy link
Member Author

Updated this now #39997 is merged. Could now easily reuse the same blk_func and cast_agg_result, so resulting diff looks better now.

@@ -235,16 +235,19 @@ def shape(self) -> Shape:
def ndim(self) -> int:
return len(self.axes)

def set_axis(self, axis: int, new_labels: Index) -> None:
def set_axis(
self, axis: int, new_labels: Index, verify_integrity: bool = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a verify_integrity keyword here, and put the length verification behind if verify_integrity. Reason I needed this is because in groupby, we are sometimes setting an Index with a different length as the original one.

@jbrockmendel
Copy link
Member

so resulting diff looks better now.

much nicer. perf impact?

@jorisvandenbossche
Copy link
Member Author

You mean performance impact of the latest change? Or in general BlockManager vs ArrayManager?

@jbrockmendel
Copy link
Member

You mean performance impact of the latest change? Or in general BlockManager vs ArrayManager?

i mean perf impact of the PR on the non-AM paths

@jorisvandenbossche
Copy link
Member Author

Ah. Do you see a change that you think is potentially performance sensitive? For BlockManager paths, it's mostly some renaming of functions/arguments, an additional check for isinstance(obj, ArrayManager) and a replacement of mgr.axes[1] = .. with mgr.set_axis(1, ..)

@jbrockmendel
Copy link
Member

Do you see a change that you think is potentially performance sensitive?

i dont see anything obvious, but ive often been surprised by results in this part of the code

@jorisvandenbossche
Copy link
Member Author

$ asv continuous -f 1.1 upstream/master HEAD
...
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @jbrockmendel if any comments.

@jreback
Copy link
Contributor

jreback commented Feb 25, 2021

also pls merge master

@@ -330,7 +356,7 @@ def apply_with_block(self: T, f, align_keys=None, **kwargs) -> T:
if hasattr(arr, "tz") and arr.tz is None: # type: ignore[union-attr]
# DatetimeArray needs to be converted to ndarray for DatetimeBlock
arr = arr._data # type: ignore[union-attr]
elif arr.dtype.kind == "m":
elif arr.dtype.kind == "m" and not isinstance(arr, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could combine this with previous check as

if arr.dtype.kind in ["m", "M"] and not isinstance(arr, np.ndarray):
    arr = arr._data

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but the problem is that we still need to keep DatetimeArray intact for DatetimeTZBlock. So we would still need the if hasattr(arr, "tz") and arr.tz is None check as well, in which case it doesn't necessarily become more readable to combine both checks.

Edit: the diff would be:

-            if hasattr(arr, "tz") and arr.tz is None:  # type: ignore[union-attr]
-                # DatetimeArray needs to be converted to ndarray for DatetimeBlock
-                arr = arr._data  # type: ignore[union-attr]
-            elif arr.dtype.kind == "m" and not isinstance(arr, np.ndarray):
-                # TimedeltaArray needs to be converted to ndarray for TimedeltaBlock
+            if (
+                arr.dtype.kind == "m"
+                and not isinstance(arr, np.ndarray)
+                and getattr(arr, "tz", None) is None
+            ):
+                # DatetimeArray/TimedeltaArray needs to be converted to ndarray
+                # for DatetimeBlock/TimedeltaBlock (except DatetimeArray with tz,
+                # which needs to be preserved for DatetimeTZBlock)
                 arr = arr._data  # type: ignore[union-attr]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of and getattr(arr, "tz", None) is None how about isinstance(arr.dtype, np.dtype). either way works i guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still gives the same length of the if check as in my diff example above, which I don't find an improvement in readability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah the only possible difference is for mypy

def test_cythonized_aggers(op_name, using_array_manager):
if using_array_manager and op_name in {"count", "sem"}:
# TODO(ArrayManager) groupby count/sem
pytest.skip("ArrayManager groupby count/sem not yet implemented")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the add_marker pattern be used here?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the add_marker pattern for xfail (because just raising pytest.xfail wouldn't result in a strict xfail, unlike the skip here), so for skip there is no advantage using it AFAIK.

Now, that said, I should maybe actually start using xfail instead of skip for the "skip_array_manager_not_yet_implemented", so it's easier to notice if certain tests can be unskipped when more features get implemented.

So will already start using the xfail as you suggest here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in the meantime I fixed count, so the skip/xfail could be removed altogether.

(but a sign that using xfail instead of skip is actually a good idea ;))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@jorisvandenbossche
Copy link
Member Author

Merging this so I can do follow-up PRs. I think I answered to the remaining open comments, but otherwise I can further address those in the follow-up PRs.

@jbrockmendel
Copy link
Member

Merging this so I can do follow-up PRs

i know timezones are a hassle, but please try to avoid the temptation to do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants