Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] Test DataFrame reductions + implement ignore_failures #39719

Merged
merged 9 commits into from
Feb 25, 2021

Conversation

jorisvandenbossche
Copy link
Member

xref #39146

This adds support for ignoring failures in ArrayManager.reduce.

@jorisvandenbossche jorisvandenbossche added Internals Related to non-user accessible pandas implementation Reduction Operations sum, mean, min, max, etc. labels Feb 10, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Feb 10, 2021
Comment on lines +205 to +211
# TODO NaT doesn't preserve dtype, so we need to ensure to create
# a timedelta result array if original was timedelta
# what if datetime results in timedelta? (eg std)
if res is NaT and is_timedelta64_ns_dtype(arr.dtype):
result_arrays.append(np.array(["NaT"], dtype="timedelta64[ns]"))
else:
result_arrays.append(sanitize_array([res], None))
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ugly special case ... But as far as I can see it is the consequence of storing Datetime/TimedeltaArray as the 1D array for those dtypes in ArrayManager instead of the numpy ndarray version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res = arr.reshape(-1, 1)._reduce(op, axis=0, ...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes. That would need the Manager to get the actual op name, and not the function object.

Now, just passing the op name in addition is not difficult of course.
But I could also do a precursor to actually move creating the function from the op name (and defining blk_func) into the Manager as well. It's moving more things into the internals, but I think that's actually a cleaner separation of concern (you could say that the DataFrame shouldn't need to decide which function object the Manager needs to use, it just needs to specify the op name. This actually already happens for EAs, those just ignore the function object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res is NaT and is_timedelta64_ns_dtype(arr.dtype)

I think this will be OK, but what about if arr.dtype is dt64 and the reduction is std, then we still want td64nat, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know, that case is not covered (see my comment "what if datetime results in timedelta? (eg std)"). I can make the todo more explicit.

(note that this is a bug that already exists on master for ArrayManager as well, it's not caused by the changes in this PR; it's just not yet addressed by this PR)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comes down to a bit of philosophy. do we put the messy code for both AM & BM in the same place and de-duplicate it, or just override it for now. Generally I like the first strategy. But reduce is a method that almost by definition is handled completely differently so ok with 2nd in this case.

pandas/core/internals/array_manager.py Show resolved Hide resolved
pandas/core/internals/array_manager.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Feb 24, 2021

lgtm. if you can merge master to make sure up to date and merge on greenish

@jreback jreback merged commit 36ff425 into pandas-dev:master Feb 25, 2021
@jreback
Copy link
Contributor

jreback commented Feb 25, 2021

thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the am-frame-reduce branch February 25, 2021 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants