Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API / CoW: Copy NumPy arrays by default in DataFrame constructor #51731

Merged
merged 29 commits into from
Mar 17, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
563257e
Implement copy by default for numpy array
phofl Mar 1, 2023
f3161a3
Fix tests
phofl Mar 1, 2023
3a95311
Simplify
phofl Mar 2, 2023
17cf5ae
Fix
phofl Mar 2, 2023
07aa26d
Remove
phofl Mar 2, 2023
8e84d85
Merge branch 'main' into cow_copy_numpy_array
phofl Mar 2, 2023
f3ccf0f
Add comment
phofl Mar 3, 2023
5cdc6ad
Fix
phofl Mar 3, 2023
3e384ea
Change logic
phofl Mar 3, 2023
49ee53f
Merge remote-tracking branch 'upstream/main' into cow_copy_numpy_array
phofl Mar 3, 2023
fcc7be2
Fix tests
phofl Mar 3, 2023
9223836
Fix test
phofl Mar 3, 2023
a474bf5
Fix test
phofl Mar 3, 2023
d5a0268
Merge branch 'main' into cow_copy_numpy_array
phofl Mar 4, 2023
0be7fc6
Merge remote-tracking branch 'upstream/main' into cow_copy_numpy_array
phofl Mar 7, 2023
293f8a5
Remove order changes
phofl Mar 14, 2023
3376d06
Merge remote-tracking branch 'upstream/main' into cow_copy_numpy_array
phofl Mar 14, 2023
265d9e3
Update
phofl Mar 14, 2023
9ac1bae
Address review
phofl Mar 14, 2023
db92ce4
Add test
phofl Mar 14, 2023
be9cb04
Fix test
phofl Mar 15, 2023
4bf3ee8
Merge branch 'main' into cow_copy_numpy_array
phofl Mar 15, 2023
e2eceec
Fix array manager
phofl Mar 15, 2023
65965c6
Remove elif
phofl Mar 15, 2023
ecb756c
Update pandas/tests/copy_view/test_constructors.py
phofl Mar 15, 2023
8e837d9
Merge remote-tracking branch 'upstream/main' into cow_copy_numpy_array
phofl Mar 15, 2023
177fbbc
Add whatsnew
phofl Mar 15, 2023
2bbff3b
Merge branch 'main' into cow_copy_numpy_array
phofl Mar 16, 2023
5bef4ba
Add note
phofl Mar 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,10 @@ def multiindex_dataframe_random_data(
"""DataFrame with 2 level MultiIndex with random data"""
index = lexsorted_two_level_string_multiindex
return DataFrame(
np.random.randn(10, 3), index=index, columns=Index(["A", "B", "C"], name="exp")
np.random.randn(10, 3),
index=index,
columns=Index(["A", "B", "C"], name="exp"),
copy=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the False "needed" here (did it otherwise give failures), or just for efficiency since this is an example case where we know the array is not owned by anyone else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying here causes one test to fail, which is very weird(the failure). Haven't looked closer yet, but the test is useless as soon as your read_only pr is merged.

Want to understand what's off there nevertheless though

)


Expand Down
7 changes: 6 additions & 1 deletion pandas/core/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
import numpy as np
from numpy import ma

from pandas._config import using_copy_on_write

from pandas._libs import lib
from pandas._libs.tslibs.period import Period
from pandas._typing import (
Expand Down Expand Up @@ -762,6 +764,9 @@ def _try_cast(

subarr = maybe_cast_to_integer_array(arr, dtype)
else:
subarr = np.array(arr, dtype=dtype, copy=copy)
if using_copy_on_write():
subarr = np.array(arr, dtype=dtype, copy=copy, order="F")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the order="F" specifically needed for CoW? (and can you add a comment about it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about #50756

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that it's related to that, but I don't understand why we would only do it for CoW? The default is not to copy arrays (without CoW) at the moment, which creates this inefficient layout; but so if the user manually specifies copy=True in the constructor, why not directly copy it to the desired layout in all cases without the if/else check for CoW?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah did not think about this, will add

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @jbrockmendel s comment above I think we should leave it out for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @jbrockmendel s comment above I think we should leave it out for now

Leave it out in general (from this PR), or you mean not do it for the default mode for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely default mode and maybe also split the copy and the layout change into two PRs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reiterating preference for this to be separate. The two of you are much more familiar with the CoW logic than most of the rest of us; i get antsy when i see using_copy_on_write checks appearing in new places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all order relevant changes, is more straightforward now

else:
subarr = np.array(arr, dtype=dtype, copy=copy)

return subarr
12 changes: 10 additions & 2 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@

import numpy as np

from pandas._config import using_copy_on_write

from pandas._libs import lib
from pandas._libs.missing import (
NA,
Expand Down Expand Up @@ -1613,11 +1615,17 @@ def maybe_cast_to_integer_array(arr: list | np.ndarray, dtype: np.dtype) -> np.n
"NumPy will stop allowing conversion of out-of-bound Python int",
DeprecationWarning,
)
casted = np.array(arr, dtype=dtype, copy=False)
if using_copy_on_write():
casted = np.array(arr, dtype=dtype, copy=False, order="F")
else:
casted = np.array(arr, dtype=dtype, copy=False)
else:
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=RuntimeWarning)
casted = arr.astype(dtype, copy=False)
if using_copy_on_write():
casted = arr.astype(dtype, copy=False, order="F")
else:
casted = arr.astype(dtype, copy=False)
except OverflowError as err:
raise OverflowError(
"The elements provided in the data cannot all be "
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -685,6 +685,8 @@ def __init__(
# INFO(ArrayManager) by default copy the 2D input array to get
# contiguous 1D arrays
copy = True
elif using_copy_on_write() and isinstance(data, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative is to express this in the inverse: .. and not isinstance(data, (Series, DataFrame)).

That would ensure that also other "array-likes" that would coerce zero-copy into a numpy array but are not exactly np.ndarray would get copied by default (not fully sure how our constructor currently handles such array like objects, though).
I think in the end only for pandas objects we can keep track of references, so only in that case the default should be a shallow copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try and see what breaks

copy = True
else:
copy = False

Expand Down
11 changes: 11 additions & 0 deletions pandas/core/internals/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
import numpy as np
from numpy import ma

from pandas._config import using_copy_on_write

from pandas._libs import lib
from pandas._typing import (
ArrayLike,
Expand Down Expand Up @@ -289,6 +291,15 @@ def ndarray_to_mgr(
if values.ndim == 1:
values = values.reshape(-1, 1)

elif (
using_copy_on_write()
and isinstance(values, np.ndarray)
and (dtype is None or is_dtype_equal(values.dtype, dtype))
and copy_on_sanitize
):
values = np.array(values, order="F", copy=copy_on_sanitize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the order="F" thing something we should be doing in general in cases with copy=True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are operating on the transposed array when copying a DataFrame object, so not needed in that case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #44871 took a look at preserving order when doing copy. there are some tradeoffs here

values = _ensure_2d(values)

elif isinstance(values, (np.ndarray, ExtensionArray, ABCSeries, Index)):
# drop subclass info
values = np.array(values, copy=copy_on_sanitize)
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/frame/methods/test_fillna.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,10 @@ def test_fillna_on_column_view(self, using_copy_on_write):

# i.e. we didn't create a new 49-column block
assert len(df._mgr.arrays) == 1
assert np.shares_memory(df.values, arr)
if using_copy_on_write:
assert not np.shares_memory(df.values, arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or pass copy=False to the constructor instead?

Because now the check above about assert np.isnan(arr[:, 0]).all() is kind of useless because arr was copied and so of course will not be modified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since this is inplace=True and there are no other references to df, shouldn't arr be modified also with CoW before this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are doing df[0] which is a chained assignment case. Updated the test to set copy=False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, missed the [0] ..

else:
assert np.shares_memory(df.values, arr)

def test_fillna_datetime(self, datetime_frame):
tf = datetime_frame
Expand Down
10 changes: 7 additions & 3 deletions pandas/tests/frame/methods/test_to_numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,15 @@ def test_to_numpy_dtype(self):
tm.assert_numpy_array_equal(result, expected)

@td.skip_array_manager_invalid_test
def test_to_numpy_copy(self):
def test_to_numpy_copy(self, using_copy_on_write):
arr = np.random.randn(4, 3)
df = DataFrame(arr)
assert df.values.base is arr
assert df.to_numpy(copy=False).base is arr
if using_copy_on_write:
assert df.values.base is not arr
assert df.to_numpy(copy=False).base is not arr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add something like df.to_numpy(copy=False).base is df.values.base (because I think this was part of the intention of the test to verify that to_numpy(copy=False) didn't make a copy, and not so much that DataFrame(arr) doesn't make a copy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else:
assert df.values.base is arr
assert df.to_numpy(copy=False).base is arr
assert df.to_numpy(copy=True).base is not arr

def test_to_numpy_mixed_dtype_to_str(self):
Expand Down
16 changes: 12 additions & 4 deletions pandas/tests/frame/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,11 +313,14 @@ def test_1d_object_array_does_not_copy(self):
assert np.shares_memory(df.values, arr)

@td.skip_array_manager_invalid_test
def test_2d_object_array_does_not_copy(self):
def test_2d_object_array_does_not_copy(self, using_copy_on_write):
# https://github.com/pandas-dev/pandas/issues/39272
arr = np.array([["a", "b"], ["c", "d"]], dtype="object")
df = DataFrame(arr)
assert np.shares_memory(df.values, arr)
if using_copy_on_write:
assert not np.shares_memory(df.values, arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else:
assert np.shares_memory(df.values, arr)

def test_constructor_dtype_list_data(self):
df = DataFrame([[1, "2"], [None, "a"]], dtype=object)
Expand Down Expand Up @@ -2107,13 +2110,18 @@ def test_constructor_frame_shallow_copy(self, float_frame):
cop.index = np.arange(len(cop))
tm.assert_frame_equal(float_frame, orig)

def test_constructor_ndarray_copy(self, float_frame, using_array_manager):
def test_constructor_ndarray_copy(
self, float_frame, using_array_manager, using_copy_on_write
):
if not using_array_manager:
arr = float_frame.values.copy()
df = DataFrame(arr)

arr[5] = 5
assert (df.values[5] == 5).all()
if using_copy_on_write:
assert not (df.values[5] == 5).all()
else:
assert (df.values[5] == 5).all()

df = DataFrame(arr, copy=True)
arr[6] = 6
Expand Down
9 changes: 7 additions & 2 deletions pandas/tests/indexing/multiindex/test_setitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -482,12 +482,17 @@ def test_setitem_new_column_all_na(self):

@td.skip_array_manager_invalid_test # df["foo"] select multiple columns -> .values
# is not a view
def test_frame_setitem_view_direct(multiindex_dataframe_random_data):
def test_frame_setitem_view_direct(
multiindex_dataframe_random_data, using_copy_on_write
):
# this works because we are modifying the underlying array
# really a no-no
df = multiindex_dataframe_random_data.T
df["foo"].values[:] = 0
assert (df["foo"].values == 0).all()
if using_copy_on_write:
assert not (df["foo"].values == 0).all()
else:
assert (df["foo"].values == 0).all()


def test_frame_setitem_copy_raises(
Expand Down