-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrayManager] GroupBy cython aggregations (no fallback) #39885
[ArrayManager] GroupBy cython aggregations (no fallback) #39885
Conversation
pandas/core/groupby/generic.py
Outdated
def _wrap_agged_arrays(self, arrays: List[ArrayLike], columns: Index) -> DataFrame: | ||
if not self.as_index: | ||
index = np.arange(arrays[0].shape[0]) | ||
mgr = ArrayManager(arrays, axes=[index, columns]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if/when BlockManager is gone, then axes=[index, columns]
makes sense, but until then switching to axes=[columns, index]
to match BlockManager constructor will facilitate code sharing
It might be less easy to fully get internals out of groupby, as we still use it in several places. But focusing on this specific usage: in the same line as we have |
One possible experiment to move some internals code out of groupby: #39997 |
Updated this now #39997 is merged. Could now easily reuse the same |
@@ -235,16 +235,19 @@ def shape(self) -> Shape: | |||
def ndim(self) -> int: | |||
return len(self.axes) | |||
|
|||
def set_axis(self, axis: int, new_labels: Index) -> None: | |||
def set_axis( | |||
self, axis: int, new_labels: Index, verify_integrity: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a verify_integrity
keyword here, and put the length verification behind if verify_integrity
. Reason I needed this is because in groupby, we are sometimes setting an Index with a different length as the original one.
much nicer. perf impact? |
You mean performance impact of the latest change? Or in general BlockManager vs ArrayManager? |
i mean perf impact of the PR on the non-AM paths |
Ah. Do you see a change that you think is potentially performance sensitive? For BlockManager paths, it's mostly some renaming of functions/arguments, an additional check for |
i dont see anything obvious, but ive often been surprised by results in this part of the code |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @jbrockmendel if any comments.
also pls merge master |
@@ -330,7 +356,7 @@ def apply_with_block(self: T, f, align_keys=None, **kwargs) -> T: | |||
if hasattr(arr, "tz") and arr.tz is None: # type: ignore[union-attr] | |||
# DatetimeArray needs to be converted to ndarray for DatetimeBlock | |||
arr = arr._data # type: ignore[union-attr] | |||
elif arr.dtype.kind == "m": | |||
elif arr.dtype.kind == "m" and not isinstance(arr, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could combine this with previous check as
if arr.dtype.kind in ["m", "M"] and not isinstance(arr, np.ndarray):
arr = arr._data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice, but the problem is that we still need to keep DatetimeArray intact for DatetimeTZBlock. So we would still need the if hasattr(arr, "tz") and arr.tz is None
check as well, in which case it doesn't necessarily become more readable to combine both checks.
Edit: the diff would be:
- if hasattr(arr, "tz") and arr.tz is None: # type: ignore[union-attr]
- # DatetimeArray needs to be converted to ndarray for DatetimeBlock
- arr = arr._data # type: ignore[union-attr]
- elif arr.dtype.kind == "m" and not isinstance(arr, np.ndarray):
- # TimedeltaArray needs to be converted to ndarray for TimedeltaBlock
+ if (
+ arr.dtype.kind == "m"
+ and not isinstance(arr, np.ndarray)
+ and getattr(arr, "tz", None) is None
+ ):
+ # DatetimeArray/TimedeltaArray needs to be converted to ndarray
+ # for DatetimeBlock/TimedeltaBlock (except DatetimeArray with tz,
+ # which needs to be preserved for DatetimeTZBlock)
arr = arr._data # type: ignore[union-attr]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of and getattr(arr, "tz", None) is None
how about isinstance(arr.dtype, np.dtype)
. either way works i guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still gives the same length of the if
check as in my diff example above, which I don't find an improvement in readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah the only possible difference is for mypy
def test_cythonized_aggers(op_name, using_array_manager): | ||
if using_array_manager and op_name in {"count", "sem"}: | ||
# TODO(ArrayManager) groupby count/sem | ||
pytest.skip("ArrayManager groupby count/sem not yet implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the add_marker pattern be used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the add_marker
pattern for xfail (because just raising pytest.xfail wouldn't result in a strict xfail, unlike the skip here), so for skip there is no advantage using it AFAIK.
Now, that said, I should maybe actually start using xfail instead of skip for the "skip_array_manager_not_yet_implemented", so it's easier to notice if certain tests can be unskipped when more features get implemented.
So will already start using the xfail as you suggest here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in the meantime I fixed count
, so the skip/xfail could be removed altogether.
(but a sign that using xfail instead of skip is actually a good idea ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Merging this so I can do follow-up PRs. I think I answered to the remaining open comments, but otherwise I can further address those in the follow-up PRs. |
i know timezones are a hassle, but please try to avoid the temptation to do this |
xref #39146
This implements one aspect of groupby: basic cython-based aggregations (so not yet general apply, python fallback, or other methods, etc, only the basic aggregations that take the
_cython_agg_general
path).Similarly to how we currently have a
_cython_agg_blocks
, this PR adds an equivalent_cython_agg_arrays
which calls the cython operation on each column instead of each block.