-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: implement NDFrame._from_mgr #52132
Conversation
Compatibility wise, this can break existing subclass implementations since |
Sounds good! |
Hmm this is straightforward to do for _sliced_from_mgr and _expanddim_from_mgr, but making _from_mgr itself breaks as long as _from_mgr is a classmethod instead of a method. abstraction-wise it makes more sense as a classmethod, but i can try making it a method and see if that breaks anything else |
How long would you want to keep that for before making a "real" _from_mgr method? im hoping the answer isn't "until 3.0", since we need a real version to exist before we can deprecate passing Manager to the constructors |
Maybe yes? The problem here is that this is not easy to communicate via a deprecation (how do subclasses get notified that they need to change this, apart from expecting them to read release notes? Is there a way we can somehow signal this?)
Would it be possible to somehow do this in parallel? If we have a way to signal to the constructor whether to raise the warning or not (eg, when called from via |
Good question. My working assumption is that informing geopandas gets us 90+% of the way there. That last few % may need to be the "release notes" approach. I guess we could implement a
The natural way to signal the constructor would be a new keyword. I guess if we make it super-duper explicit that the keyword is not there to stay and users/libraries should never ever use it... Any thoughts on the classmethod bit? Making _from_mgr a regular method mostly works, but makes it so we can't call it from other classmethods, in particular from_records. |
Would something like this work, if we define our own
That ensures that internally we don't call
Are you sure that subclasses that return a class from
I didn't fully follow your explanation above. What is the reason that it breaks things if it is a class method? |
That looks plausible. Will need to check how it affects performance.
The most important/relevant place is in DataFrame.from_records (a classmethod) which currently ends with
Good point. |
Given that |
Fair enough. Looking at the suggested _constructor implementation I want to double-check. This PR currently changes a bunch of uses of |
Yes, otherwise that would be a breaking change for subclasses that rely on |
@jorisvandenbossche thanks for the reminder to take another look at #52132 (comment) (BTW that reminder was in the form of a comment on the dev meeting notes. I hit a check-mark there thinking that was akin to a thumbs-up but it made the note disappear. Just in case there was any ambiguity: the intent was to convey the thumbs-up)
This is clever, and I think would work. I'm wary bc 1) part of the point of this is performance, which this would hurt, even if just a little, and 2) that conflicts with deprecating non-class _constructor.
This is definitely possible. Such subclasses would need to override _from_mgr.
I think that topic got successfully cleared up, pls let me know if im wrong on this point? |
+1 let's just do this already tbh we are making mountains of molehills for subclasses - yea we have some support but so what let's just do this already we need velocity not endless discussions about every point |
@jbrockmendel thanks for giving this another look! The google doc comment resolving was correctly interpreted
Regarding performance: it's true it requires an extra check of the input type, but that's something that is currently needed in
Yes, I assume so as well. I think the nice thing about my proposal above is that this gives subclasses the time to notice that they need to do this, without directly breaking them. One annoyance with this approach is that we would still be using
Just to be sure: the current idea is to make those from_mgr methods normal methods then, and not class methods? (as in the current diff) |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Another option, instead of short-term having a customized @classmethod
def _from_mgr(cls, mgr: Manager, axes: list[Index]) -> Self:
obj = cls.__new__(cls)
NDFrame.__init__(obj, mgr)
return obj
def _constructor_from_mgr(self, mgr):
if self._constructor is DataFrame._constructor:
# we are pandas.DataFrame (or a subclass that doesn't override _constructor)
return self._from_mgr(mgr)
else:
return self._constructor(mgr) This would also allow already starting to actually deprecate The two methods could be merged into one (just |
The question is also, for all possible options, what would a typical future implementation of a subclass' Maybe in many cases, you might not actually want to customize something (if you rely on class MySubclassedDataFrame(pd.DataFrame):
...
def _constructor_from_mgr(self, mgr):
# this will call the class method defined in the parent pandas.DataFrame
return self._from_mgr(mgr) Does that look correct? (in the future, overriding it that way would be redundant I assume, but it would enable the subclass to make the transition / support multiple pandas versions without getting warnings) Or you could add a check in your |
I'm leaning towards a path similar to what @jorisvandenbossche suggested here. |
Thanks for the update!
Coming back to this: would it also be sufficient to just have (this assumes that |
# we are pandas.DataFrame (or a subclass that doesn't override _constructor) | ||
return self._from_mgr(mgr, axes=axes) | ||
else: | ||
assert axes is mgr.axes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assert needed here? (it's also not done in _from_mgr
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly. ATM _from_mgr is documented as requiring them to match, so this assertion seemed like an easy way of making it required along this path too. could remove the assertion and document the requirement in the docstring
pandas/core/frame.py
Outdated
# with self._constructor_sliced._from_mgr(...) | ||
# once downstream packages (geopandas) have had a chance to implement | ||
# their own overrides. | ||
return self._constructor_sliced(mgr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be something like Series._from_mgr(mgr)
or self._constructor_sliced._from_mgr(mgr)
?
Because now this is calling the same as the fall-back in _constructor_sliced_from_mgr
, and thus the if
block there is not doing anything different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. i think this is leftover from a previous round of editing. will update.
in the event that axes are refactored out of the Manager objects. | ||
""" | ||
obj = cls.__new__(cls) | ||
NDFrame.__init__(obj, mgr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to discuss if this should make a shallow copy of the input mgr for CoW (cc @phofl). If we keep this consistent with the current behaviour of DataFrame(mgr)
, we should add this:
if using_copy_on_write():
mgr = mgr.copy(deep=False)
We had some discussion on the PR that introduced this to what extent this is actually needed: #51239 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this should only be used internally I think we should be safe without the extra copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this should only be used internally I think we should be safe without the extra copy.
We could also internally have cases that require it. Essentially, whenever we would do something that boils down to return self._constructor_from_mgr(self.mgr)
(i.e. a code path that simplifies to that) in some method, this is required.
That should probably be considered a bug in our implementation, but so in #51239 (comment) we went for the safe route for now.
If we want to change that, maybe better to do that in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to change that, maybe better to do that in a separate PR?
Yah I'd like to punt on this for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are in a state now that we should be able to rely on doing this on the manager level and not here. I think I am ok with not doing a shallow copy here
I think that would work. In general kwargs is a pattern I try to avoid, but I don't care enough to make it a sticking point if you feel strongly about it. |
I don't feel super strong about it, but I mostly don't like seeing |
That's reasonable. So both the |
This turns out to be something of a hassle. Any chance we can all be OK with the current implementation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
Personally I also prefer not using kwargs, I'd rather have one specific keyword, even if it's not doing anything.
I'd be ok with merging this
thx @jbrockmendel |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.cc @phofl @jorisvandenbossche we've discussed this recently-ish but I've lost track of where.
There is a pretty nice performance improvement in ops with small sizes:
Making axes required despite the fact that we don't use them ATM bc that opens up the option of refactoring axes out of the Managers xref #48126.