-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrayManager] Test DataFrame reductions + implement ignore_failures #39719
[ArrayManager] Test DataFrame reductions + implement ignore_failures #39719
Conversation
# TODO NaT doesn't preserve dtype, so we need to ensure to create | ||
# a timedelta result array if original was timedelta | ||
# what if datetime results in timedelta? (eg std) | ||
if res is NaT and is_timedelta64_ns_dtype(arr.dtype): | ||
result_arrays.append(np.array(["NaT"], dtype="timedelta64[ns]")) | ||
else: | ||
result_arrays.append(sanitize_array([res], None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an ugly special case ... But as far as I can see it is the consequence of storing Datetime/TimedeltaArray as the 1D array for those dtypes in ArrayManager instead of the numpy ndarray version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
res = arr.reshape(-1, 1)._reduce(op, axis=0, ...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes. That would need the Manager to get the actual op name, and not the function object.
Now, just passing the op name in addition is not difficult of course.
But I could also do a precursor to actually move creating the function from the op name (and defining blk_func
) into the Manager as well. It's moving more things into the internals, but I think that's actually a cleaner separation of concern (you could say that the DataFrame shouldn't need to decide which function object the Manager needs to use, it just needs to specify the op name. This actually already happens for EAs, those just ignore the function object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
res is NaT and is_timedelta64_ns_dtype(arr.dtype)
I think this will be OK, but what about if arr.dtype
is dt64 and the reduction is std
, then we still want td64nat, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know, that case is not covered (see my comment "what if datetime results in timedelta? (eg std)"). I can make the todo more explicit.
(note that this is a bug that already exists on master for ArrayManager as well, it's not caused by the changes in this PR; it's just not yet addressed by this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comes down to a bit of philosophy. do we put the messy code for both AM & BM in the same place and de-duplicate it, or just override it for now. Generally I like the first strategy. But reduce is a method that almost by definition is handled completely differently so ok with 2nd in this case.
lgtm. if you can merge master to make sure up to date and merge on greenish |
thanks @jorisvandenbossche |
xref #39146
This adds support for ignoring failures in
ArrayManager.reduce
.