-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoW: Add lazy copy mechanism to DataFrame constructor for dict of Index #52947
Changes from 3 commits
2c323a9
fb54b78
d837e55
adc7d38
0d603af
48e58b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,8 @@ | |
import numpy as np | ||
from numpy import ma | ||
|
||
from pandas._config import using_copy_on_write | ||
|
||
from pandas._libs import lib | ||
|
||
from pandas.core.dtypes.astype import astype_is_view | ||
|
@@ -460,13 +462,23 @@ def dict_to_mgr( | |
keys = list(data.keys()) | ||
columns = Index(keys) if keys else default_index(0) | ||
arrays = [com.maybe_iterable_to_list(data[k]) for k in keys] | ||
arrays = [arr if not isinstance(arr, Index) else arr._data for arr in arrays] | ||
if not using_copy_on_write(): | ||
arrays = [ | ||
arr if not isinstance(arr, Index) else arr._data for arr in arrays | ||
] | ||
|
||
if copy: | ||
if typ == "block": | ||
# We only need to copy arrays that will not get consolidated, i.e. | ||
# only EA arrays | ||
arrays = [x.copy() if isinstance(x, ExtensionArray) else x for x in arrays] | ||
arrays = [ | ||
x.copy() | ||
if isinstance(x, ExtensionArray) | ||
else x.copy(deep=True) | ||
if isinstance(x, Index) | ||
Comment on lines
+471
to
+472
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In principle this copy is only needed when the dtype is an ExtensionDtype? And we might actually have the same issue right now with passing a Series here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this is currently a "bug" for when passing a dict of Series with EA dtype. Because we are passing a dict, it should result in a copy, but it doesn't when it's an EA dtype and does when default numpy dtype:
And also when creating it from an array instead of Series, it is copied correctly:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should change the check for ExtensionArray to check for the dtype being ExtensionDtype? (or move the copy deeper down into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would like to do this independently of the CoW change, can do as a follow up or pre-cursor. What do you prefer? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either way is fine for me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wild o as follow up then |
||
else x | ||
for x in arrays | ||
] | ||
else: | ||
# dtype check to exclude e.g. range objects, scalars | ||
arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays] | ||
|
@@ -573,10 +585,10 @@ def _homogenize( | |
refs: list[Any] = [] | ||
|
||
for val in data: | ||
if isinstance(val, ABCSeries): | ||
if isinstance(val, (ABCSeries, Index)): | ||
if dtype is not None: | ||
val = val.astype(dtype, copy=False) | ||
if val.index is not index: | ||
if isinstance(val, ABCSeries) and val.index is not index: | ||
# Forces alignment. No need to copy data since we | ||
# are putting it into an ndarray later | ||
val = val.reindex(index, copy=False) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this in the default case of not CoW? This basically just unpacks Index objects, but also eg Series doesn't get unpacked, and this is handled in
arrays_to_mgr
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets see if ci passes, wasn't sure so wanted to avoid changing it