Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW: Set copy=False in internal usages of Series/DataFrame constructors #51834

Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 8, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

cc @jbrockmendel This keeps copy=False internally where necessary, to avoid unnecessary copies as a side-effect of #51731 (by default copying numpy arrays in the DataFrame constructor)

@phofl phofl added this to the 2.0 milestone Mar 8, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Commented in a few cases where I was not sure about being sure to already have a copy.

@@ -3604,7 +3609,11 @@ def transpose(self, *args, copy: bool = False) -> DataFrame:
if copy:
new_arr = new_arr.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can pass copy=False below (which I think is correct, since in this code path we know that we have multiple blocks, and so self.values will always be a copy), then this copy if copy=True should also not be necessary?

I would also add a short comment mentioning why we can specify copy=False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought so as well, couldn't figure out why we would do an additional copy here, but wasn't sure if I am missing something.

Added a comment

@@ -3830,7 +3839,7 @@ def _getitem_multilevel(self, key):
else:
new_values = self.values[:, loc]
result = self._constructor(
new_values, index=self.index, columns=result_columns
new_values, index=self.index, columns=result_columns, copy=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure here? loc can be a slice, in which case self.values[:, loc] can be a view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are correct, this was wrong before as well. Will open a separate pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we generally want a view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we should track references when we get a view. Which we did not do before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was handled by #51944

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@@ -777,6 +777,7 @@ def swapaxes(
return self._constructor(
new_values,
*new_axes,
copy=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be passed here because in case of CoW, the single block case is already handled above? So if we end up here and use CoW, we always have multiple blocks and thus already a copy?

I would maybe add a brief comment mentioning that, maybe like

# ...
copy = False

to have the comment together with the value, and then passing copy to the constructor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly. Added a comment

)
return self._constructor(
self._values[indexer], index=new_index, copy=False
).__finalize__(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that is the case here, but in general get_loc_level can return a slice, and so self._values[indexer] can be a view?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can get here with a slice, but will check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't construct a case where we would end up with a slice here. Will re-check but would open a separate pr anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, looking at the code, it seems that when passing a tuple we typically convert the slice to a boolean ndarray.
(we could still add a if isinstance(indexer, slice): add_reference to be sure, but that would probably be not covered by our tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah took another look, If you have a MultiIndex that has tuples in the first level then you can get a slice indexer here... Weird corner case 😃

I'll open another pr for this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_ser = self._constructor(new_values, index=new_index, name=self.name)
new_ser = self._constructor(
new_values, index=new_index, name=self.name, copy=False
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: are we sure this is a copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this can be a view. Will open a separate pr since this does not work right now either

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you already have another PR for this?

Copy link
Member Author

@phofl phofl Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl
Copy link
Member Author

phofl commented Mar 16, 2023

So all non-working cases have been merged, I'd merge this tomorrow if green. Afterwards we can merge the copy True pr, this would cause failures now if merged before this one

@jorisvandenbossche jorisvandenbossche merged commit c98b7c8 into pandas-dev:main Mar 16, 2023
@lumberbot-app
Copy link

lumberbot-app bot commented Mar 16, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.0.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 c98b7c84bdb3ba94d7cf482802f15fe313c0f5c7
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #51834: CoW: Set copy=False in internal usages of Series/DataFrame constructors'
  1. Push to a named branch:
git push YOURFORK 2.0.x:auto-backport-of-pr-51834-on-2.0.x
  1. Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51834 on branch 2.0.x (CoW: Set copy=False in internal usages of Series/DataFrame constructors)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@jorisvandenbossche
Copy link
Member

Manual backport -> #52012

@phofl phofl deleted the cow_copy_false_in_constructor branch March 16, 2023 09:16
phofl added a commit that referenced this pull request Mar 16, 2023
…DataFrame constructors (#52012)

CoW: Set copy=False in internal usages of Series/DataFrame constructors (#51834)

(cherry picked from commit c98b7c8)

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants