Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC/REF: remove axes from Managers #48126

Closed
wants to merge 4 commits into from

Conversation

jbrockmendel
Copy link
Member

cc @jorisvandenbossche we've discussed the idea of refactoring the Manager classes to not have the axes. This is a (not-remotely-working) attempt at de-coupling the NDFrame axes from Manager axes. Some questions before I spend much more time on this:

  1. Am I remembering correctly that you are in principle in favor of this refactor?
  2. Thoughts on how to handle the pyarrow usages?
  3. Thoughts on cleaner ways to implement this in managably-sized chunks?

@jbrockmendel jbrockmendel marked this pull request as draft August 17, 2022 16:57
@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 589:89: E501 line too long (98 > 88 characters)
Line 630:48: E261 at least two spaces before inline comment
Line 630:48: E262 inline comment should start with '# '
Line 632:49: E261 at least two spaces before inline comment
Line 632:49: E262 inline comment should start with '# '
Line 802:89: E501 line too long (90 > 88 characters)
Line 3696:89: E501 line too long (108 > 88 characters)
Line 4211:89: E501 line too long (91 > 88 characters)

Line 1537:89: E501 line too long (95 > 88 characters)
Line 3910:9: E265 block comment should start with '# '
Line 3910:89: E501 line too long (92 > 88 characters)
Line 3911:89: E501 line too long (96 > 88 characters)
Line 3912:89: E501 line too long (89 > 88 characters)
Line 5076:23: E261 at least two spaces before inline comment
Line 5076:23: E262 inline comment should start with '# '
Line 6307:89: E501 line too long (95 > 88 characters)

Line 3827:89: E501 line too long (91 > 88 characters)

Line 495:89: E501 line too long (91 > 88 characters)

Line 624:89: E501 line too long (119 > 88 characters)

Line 356:47: E261 at least two spaces before inline comment
Line 356:47: E262 inline comment should start with '# '

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche gentle ping, just want to make sure we're on the same page w/r/t the general goal. No need to look at the diff yet.

@mroeschke mroeschke added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation labels Aug 22, 2022
@jbrockmendel
Copy link
Member Author

@jorisvandenbossche gentle ping

1 similar comment
@jbrockmendel
Copy link
Member Author

@jorisvandenbossche gentle ping

@jbrockmendel
Copy link
Member Author

@mroeschke @jreback i'd like to get some sign-off on the concept here before i put more time into it; not time-sensitive.

@mroeschke
Copy link
Member

What would be the main benefits of removing axes from Managers? I'm guessing this was necessary when pandas supported Panel so it would be a nice simplification?

@jbrockmendel
Copy link
Member Author

What would be the main benefits of removing axes from Managers?

The original motivation was the ongoing goal of simplifying the internals code. Without the axes, we can think of Managers are positional-only, which is a much cleaner abstraction than we have ATM.

This came back onto my radar screen recently while troubleshooting modin perf. The thought there is that with axes removed, modin/dask/etc could save some overhead by serializing/deserializing the Manager instead of the DataFrame.

@mroeschke
Copy link
Member

save some overhead by serializing/deserializing the Manager instead of the DataFrame

Would that imply that the Manager should become public?

@jbrockmendel
Copy link
Member Author

Would that imply that the Manager should become public?

I think it would go in the pseudo-public API in core.internals.api

@jorisvandenbossche
Copy link
Member

I don't remember the exact details of the discussion we had about this, so it would indeed be good to list some reasoning and advantages/disadvantages (but I am certainly not opposed to exploring this).

The thought there is that with axes removed, modin/dask/etc could save some overhead by serializing/deserializing the Manager instead of the DataFrame.

Can you explain how that helps performance? I would assume you still need to serialize the index objects separately as well? (or is the idea that, if you have many dataframe partitions that share the same columns, this can be serialized only once? But how would modin make use of this without relying to much on internals? Would modin stores Managers as partitions instead of DataFrames? )

Without the axes, we can think of Managers are positional-only, which is a much cleaner abstraction than we have ATM.

In practice, Managers are already kind of positional-only, I think? (in the sense that in indexing methods, the Manager methods already receive only positional arguments, it never does Index lookups?)
Now, of course, exactly because it doesn't do Index lookups (but only subsets them and passes them through), that also makes it less important that the indices are stored on the Manager.

Thoughts on how to handle the pyarrow usages?

Can you expand this question?

Thoughts on cleaner ways to implement this in managably-sized chunks?

Can you give a rough idea of what part of the required changes is already in here, and what more would be needed?

@jbrockmendel
Copy link
Member Author

Can you explain how that helps [modin/dask] performance?

The idea- which is purely speculative at this point- is that the axes are really only needed by the parent process, while the child processes only need the data. I doubt it makes a huge difference.

Now, of course, exactly because it doesn't do Index lookups (but only subsets them and passes them through), that also makes it less important that the indices are stored on the Manager.

Right. Nothing about the Managers logic needs the axes. It is a clearer abstraction without them.

Thoughts on how to handle the pyarrow usages?
Can you expand this question?

In pyarrow.lib._PandasAPIShim.data_frame pyarrow passes a Manager to DataFrame.__init__ without passing index/columns, since they are already present inside the Manager. We would need a transition path away from this usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants