Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW: Add lazy copy mechanism to DataFrame constructor for dict of Index #52947

Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Apr 26, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

xref #48998

@phofl phofl requested a review from jorisvandenbossche May 5, 2023 10:13
Comment on lines 463 to 468
arrays = [arr if not isinstance(arr, Index) else arr._data for arr in arrays]
if not using_copy_on_write():
arrays = [
arr if not isinstance(arr, Index) else arr._data for arr in arrays
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this in the default case of not CoW? This basically just unpacks Index objects, but also eg Series doesn't get unpacked, and this is handled in arrays_to_mgr ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if ci passes, wasn't sure so wanted to avoid changing it

Comment on lines +477 to +478
else x.copy(deep=True)
if isinstance(x, Index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this copy is only needed when the dtype is an ExtensionDtype?

And we might actually have the same issue right now with passing a Series here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is currently a "bug" for when passing a dict of Series with EA dtype. Because we are passing a dict, it should result in a copy, but it doesn't when it's an EA dtype and does when default numpy dtype:

In [17]: ser = pd.Series([1, 2, 3], dtype="Int64")

In [18]: df = pd.DataFrame({"col": ser})

In [19]: df.iloc[0, 0] = 10

In [20]: ser
Out[20]: 
0    10
1     2
2     3
dtype: Int64

In [21]: ser = pd.Series([1, 2, 3], dtype="int64")

In [22]: df = pd.DataFrame({"col": ser})

In [23]: df.iloc[0, 0] = 10

In [24]: ser
Out[24]: 
0    1
1    2
2    3
dtype: int64

And also when creating it from an array instead of Series, it is copied correctly:

In [25]: arr = pd.array([1, 2, 3], dtype="Int64")

In [26]: df = pd.DataFrame({"col": arr})

In [27]: df.iloc[0, 0] = 10

In [28]: arr
Out[28]: 
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change the check for ExtensionArray to check for the dtype being ExtensionDtype? (or move the copy deeper down into arrays_to_mgr after we know the data have been sanitized / unboxed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to do this independently of the CoW change, can do as a follow up or pre-cursor. What do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine for me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wild o as follow up then

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 8, 2023
@phofl phofl requested a review from jorisvandenbossche June 8, 2023 01:53
@jorisvandenbossche
Copy link
Member

This needs a merge of main now I merged the other PR

phofl added 2 commits June 19, 2023 22:19
…x_to_frame

# Conflicts:
#	doc/source/whatsnew/v2.1.0.rst
@phofl
Copy link
Member Author

phofl commented Jun 19, 2023

updated

@jorisvandenbossche
Copy link
Member

Thanks!

@jorisvandenbossche jorisvandenbossche merged commit bc4f149 into pandas-dev:main Jun 20, 2023
@phofl phofl deleted the cow_avoid_copy_index_to_frame branch June 20, 2023 07:48
@phofl
Copy link
Member Author

phofl commented Jun 20, 2023

will put up a follow up for the copy thing

canthonyscott pushed a commit to canthonyscott/pandas-anthony that referenced this pull request Jun 23, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants