API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy #51239

phofl · 2023-02-08T15:57:33Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Checking if everything passes

jorisvandenbossche · 2023-02-08T16:10:45Z

This replaces #50499?

phofl · 2023-02-08T16:11:40Z

If ci passes everywhere, e.g. I did not miss anything, yes. Wanted to keep the other pr open until a run is through here

phofl · 2023-02-08T18:02:05Z

Looks good I think

phofl · 2023-02-08T18:02:26Z

doc/source/whatsnew/v2.0.0.rst

@@ -238,6 +238,9 @@ Copy-on-Write improvements
  a modification to the data happens) when constructing a Series from an existing
  Series with the default of ``copy=False`` (:issue:`50471`)

+- The :class:`DataFrame` constructor now will keep track of references when called
+  with another :class:`DataFrame` or ``BlockManager``.


Not sure if mentioning BlockManager here is worth it?

No, I wouldn't mention that (normal users should never pass that)

Thx, removed

phofl · 2023-02-09T11:08:35Z

I think this is ready as well except the doc comment

jorisvandenbossche · 2023-02-10T16:19:08Z

doc/source/whatsnew/v2.0.0.rst

@@ -238,6 +238,9 @@ Copy-on-Write improvements
  a modification to the data happens) when constructing a Series from an existing
  Series with the default of ``copy=False`` (:issue:`50471`)

+- The :class:`DataFrame` constructor now will keep track of references when called
+  with another :class:`DataFrame` or ``BlockManager``.


No, I wouldn't mention that (normal users should never pass that)

jorisvandenbossche · 2023-02-10T16:25:00Z

pandas/core/frame.py

@@ -654,6 +654,8 @@ def __init__(
            data = data._mgr

        if isinstance(data, (BlockManager, ArrayManager)):
+            if using_copy_on_write():
+                data = data.copy(deep=False)


The previous PR only did this for DataFrame, not for BlockManager.

We have many places where we have the pattern of new_data = self._mgr.<something>; self._constructor(new_data). In those cases, I think in theory the manager method should already have taken care of the references, and so an additional shallow copy is not needed.

But, it should also be harmless (except for a bit of overheead), since those intermediate manager / blocks will go out of scope?

I wanted to do this also for a manager, but this caused all sorts of problems because we kept the Manager alive.

Yeah this is a safety net for something like

new_data = self._mgr if something: # False ... return self._constructor(new_data)

Yes if they are a true intermediate manager they go out of scope immediately. But if we forgot performing a shallow copy for some reason, this catches this

One more thing here, do you know what is the overhead of this shallow copy? Because our "fastpath" for DataFrame creation goes through here

Small test (not using this branch):

In [97]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3]}) In [98]: mgr = df._mgr In [100]: %timeit mgr.copy(deep=False) 4.1 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [102]: %timeit pd.DataFrame(mgr) 1.47 µs ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

So in relative terms, it's not an insignificant change .. Not fully sure how important this is in real-world cases though.

Yeah it's not totally cheap, it's from 1 to 3 on my machine.

We could keep it out of here if you prefer and investigate other methods of keeping the reference here?

jorisvandenbossche · 2023-02-10T16:28:31Z

(I changed the ENH -> API in the title, as for CoW enabled, this is actually a behaviour change (copy=False is the default), not just avoiding a copy where we currently copy)

# Conflicts: # doc/source/whatsnew/v2.0.0.rst # pandas/core/reshape/concat.py # pandas/tests/copy_view/test_constructors.py

lithomas1 · 2023-02-15T21:23:16Z

doc/source/whatsnew/v2.0.0.rst

@@ -243,6 +243,9 @@ Copy-on-Write improvements
  a modification to the data happens) when constructing a Series from an existing
  Series with the default of ``copy=False`` (:issue:`50471`)

+- The :class:`DataFrame` constructor now will keep track of references when called


IIRC, we don't tell users about how CoW is implemented under the hood.

Maybe we can say that the constructor will obey CoW rules when called with a DataFrame.

Yeah this makes more sense, will change

I would match this sentence with the bullet point above, since it is basically the same enhancement but for DataFrame instead of Series

phofl · 2023-02-20T21:54:09Z

@jorisvandenbossche are you ok here?

phofl · 2023-02-26T17:43:04Z

Merging, we can still address the mgr case before 2.0 is out if necessary

…DataFrame/BlockManager creates lazy copy

… from DataFrame/BlockManager creates lazy copy) (#51650) Backport PR #51239: API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche · 2023-03-01T09:22:32Z

Yeah, for the manager case, it might be good to check some benchmarks for this. Although in practice I suppose it will only show up when creating many small DataFrames in some operation, so not sure what a typical use case would be where this would be a significant issue.

ENH: Keep track of references in DataFrame constructor for manager of df

692af75

phofl marked this pull request as draft February 8, 2023 15:57

mroeschke added the Copy / view semantics label Feb 8, 2023

phofl mentioned this pull request Feb 8, 2023

API / CoW: constructing DataFrame from DataFrame creates lazy copy #50499

Closed

5 tasks

phofl marked this pull request as ready for review February 8, 2023 18:01

phofl commented Feb 8, 2023

View reviewed changes

Merge branch 'main' into cons_df

9d8902f

jorisvandenbossche mentioned this pull request Feb 10, 2023

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

38 tasks

jorisvandenbossche reviewed Feb 10, 2023

View reviewed changes

jorisvandenbossche changed the title ~~ENH: Keep track of references in DataFrame constructor for manager of df~~ API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy Feb 10, 2023

phofl and others added 2 commits February 10, 2023 17:30

Update v2.0.0.rst

b1f20aa

Merge remote-tracking branch 'upstream/main' into cons_df

9ce8b4a

# Conflicts: # doc/source/whatsnew/v2.0.0.rst # pandas/core/reshape/concat.py # pandas/tests/copy_view/test_constructors.py

lithomas1 reviewed Feb 15, 2023

View reviewed changes

phofl and others added 3 commits February 15, 2023 22:27

Adjust whatsnew

dd40d8c

Adjust whatsnew

e090f8a

Merge branch 'main' into cons_df

382ac97

phofl added this to the 2.0 milestone Feb 16, 2023

Merge branch 'main' into cons_df

f7b38e0

phofl merged commit 9203f9e into pandas-dev:main Feb 26, 2023

phofl deleted the cons_df branch February 26, 2023 17:43

meeseeksmachine mentioned this pull request Feb 26, 2023

Backport PR #51239 on branch 2.0.x (API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy) #51650

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 26, 2023

Backport PR pandas-dev#51239: API / CoW: constructing DataFrame from …

3b5af41

…DataFrame/BlockManager creates lazy copy

jorisvandenbossche mentioned this pull request May 29, 2023

REF: implement NDFrame._from_mgr #52132

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy #51239

API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy #51239

phofl commented Feb 8, 2023 •

edited

Loading

jorisvandenbossche commented Feb 8, 2023

phofl commented Feb 8, 2023

phofl commented Feb 8, 2023

phofl Feb 8, 2023

jorisvandenbossche Feb 10, 2023

phofl Feb 10, 2023

phofl commented Feb 9, 2023

jorisvandenbossche Feb 10, 2023

jorisvandenbossche Feb 10, 2023

phofl Feb 10, 2023

phofl Feb 10, 2023

jorisvandenbossche Feb 15, 2023

jorisvandenbossche Feb 15, 2023

phofl Feb 15, 2023

jorisvandenbossche commented Feb 10, 2023

lithomas1 Feb 15, 2023

phofl Feb 15, 2023

jorisvandenbossche Feb 15, 2023

phofl Feb 15, 2023

phofl commented Feb 20, 2023

phofl commented Feb 26, 2023

jorisvandenbossche commented Mar 1, 2023

API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy #51239

API / CoW: constructing DataFrame from DataFrame/BlockManager creates lazy copy #51239

Conversation

phofl commented Feb 8, 2023 • edited Loading

jorisvandenbossche commented Feb 8, 2023

phofl commented Feb 8, 2023

phofl commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Feb 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Feb 20, 2023

phofl commented Feb 26, 2023

jorisvandenbossche commented Mar 1, 2023

phofl commented Feb 8, 2023 •

edited

Loading