Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open
21 of 38 tasks
jorisvandenbossche opened this issue Oct 7, 2022 · 3 comments
Open
21 of 38 tasks

Copy-on-Write (PDEP-7) follow-up overview issue #48998

jorisvandenbossche opened this issue Oct 7, 2022 · 3 comments
Labels
Copy / view semantics Master Tracker High level tracker for similar issues

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 7, 2022

PDEP-7: https://pandas.pydata.org/pdeps/0007-copy-on-write.html

An initial implementation was merged in #46958 (with the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in #36195).

In #36195 (comment) I mentioned some next steps that are still needed; moving this to a new issue.

Implementation

Complete the API surface:

Improve the performance

  • Optimize setitem operations to prevent copies of whole blocks (eg splitting the block could help keeping a view for all other columns, and we only take a copy for the columns that are modified) where splitting the block could keep a view for all other columns, and
  • Check overall performance impact (eg run asv with / without CoW enabled by default and see the difference)

Provide upgrade path:

  • Add a warning mode that gives deprecation warnings for all cases where the current behaviour would change (initially also behind an option): CoW warning mode for cases that will change behaviour #56019
    • We can also update the message of the existing SettingWithCopyWarnings to point users towards enabling CoW as a way to get rid of the warnings
    • Add a general FutureWarning "on first use that would change" that is only raised a single time

Documentation / feedback

Aside from finalizing the implementation, we also need to start documenting this, and it will be super useful to have people give this a try, run their code or test suites with it, etc, so we can iron out bugs / missing warnings / or discover unexpected consequences that need to be addressed/discussed.

  • Document this new feature (how it works, how you can test it)
  • We can still add a note to the 1.5 whatsnew linking to those docs
  • Write a set of blogposts on the topic
  • Gather feedback from users / downstream packages
  • Update existing documentation:
  • Write an upgrade guide

Some remaining aspects of the API to figure out:

@jorisvandenbossche jorisvandenbossche added Master Tracker High level tracker for similar issues Copy / view semantics labels Oct 7, 2022
@bashtage
Copy link
Contributor

bashtage commented Apr 21, 2023

  • Add the same warning/error for inplace methods that are called in a chained context (eg df['a'].fillna(.., inplace=True)

This one just caught me out in statsmodels. Seems like it is hard to get high-performance in place filling on a column-by-column basis with CoW. Is this correct?

The code that cause the issue was

self.data[col].fillna(imp, inplace=True)

which is pretty standard IME.

I replaced it with

self.data[col] = self.data[col].fillna(imp)

Is there a better way when using CoW?

@bashtage
Copy link
Contributor

Ok, so a better way is self.data.fillna(imp_values, inplace=True) where imp_values is dict[col, fill_val]

@jorisvandenbossche
Copy link
Member Author

Yes, that's a good question. In general, the answer is always to do it through a method on the DataFrame directly, and indeed in this case DataFrame.fillna allows this with a dict, as you mention.
When you want to do this for a single columns, going through the DataFrame method is a bit less obvious I think, but so maybe we could hint on doing that if we add a warning for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Master Tracker High level tracker for similar issues
Projects
None yet
Development

No branches or pull requests

3 participants
@jorisvandenbossche @bashtage and others