-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC/REF: remove axes from Managers #48126
Conversation
Hello @jbrockmendel! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
@jorisvandenbossche gentle ping, just want to make sure we're on the same page w/r/t the general goal. No need to look at the diff yet. |
@jorisvandenbossche gentle ping |
1 similar comment
@jorisvandenbossche gentle ping |
@mroeschke @jreback i'd like to get some sign-off on the concept here before i put more time into it; not time-sensitive. |
What would be the main benefits of removing |
The original motivation was the ongoing goal of simplifying the internals code. Without the axes, we can think of Managers are positional-only, which is a much cleaner abstraction than we have ATM. This came back onto my radar screen recently while troubleshooting modin perf. The thought there is that with axes removed, modin/dask/etc could save some overhead by serializing/deserializing the Manager instead of the DataFrame. |
Would that imply that the Manager should become public? |
I think it would go in the pseudo-public API in core.internals.api |
I don't remember the exact details of the discussion we had about this, so it would indeed be good to list some reasoning and advantages/disadvantages (but I am certainly not opposed to exploring this).
Can you explain how that helps performance? I would assume you still need to serialize the index objects separately as well? (or is the idea that, if you have many dataframe partitions that share the same columns, this can be serialized only once? But how would modin make use of this without relying to much on internals? Would modin stores Managers as partitions instead of DataFrames? )
In practice, Managers are already kind of positional-only, I think? (in the sense that in indexing methods, the Manager methods already receive only positional arguments, it never does Index lookups?)
Can you expand this question?
Can you give a rough idea of what part of the required changes is already in here, and what more would be needed? |
The idea- which is purely speculative at this point- is that the axes are really only needed by the parent process, while the child processes only need the data. I doubt it makes a huge difference.
Right. Nothing about the Managers logic needs the axes. It is a clearer abstraction without them.
In |
a017402
to
e8ce5e8
Compare
cc @jorisvandenbossche we've discussed the idea of refactoring the Manager classes to not have the axes. This is a (not-remotely-working) attempt at de-coupling the NDFrame axes from Manager axes. Some questions before I spend much more time on this: