-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROADMAP: add consistent missing values for all dtypes to the roadmap #35208
ROADMAP: add consistent missing values for all dtypes to the roadmap #35208
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit but not a huge blocker for me
doc/source/development/roadmap.rst
Outdated
data or are cast to float. | ||
|
||
Long term, we want to introduce consistent missing value handling accross the | ||
different data types: all data types should support missing values and with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree that all data types should support missing values. Non-nullable types could be beneficial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there's value in pandas ensuring that a column cannot contain NAs. I'm not sure where best to put that invariant: the dtype or the array / column.
But, to sidestep this issue, perhaps something like "pandas should provide consistent missing value handling for all the different kinds of data", i.e. we give you the ability, without saying that every dtype has to be nullable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to remember that you, Tom and I had a similar discussion in another issues (but can't directly find it).
I agree that the concept of non-nullability is useful/interesting, but as Tom also mentions: non-nullability shouldn't necessarily be a property of the data type itself. Similarly as Tom is working on the "allow_duplicate_labels" flag, we could have "nullable=True/False" flags per column.
Because for the dtype itself to be non-nullable, we should ask ourselves: do we have an example of a data type (that we want to include in pandas) for which it would never be useful to be nullable?
Or to phrase the text in a different way: if a data type supports missing values, it should follow consistent semantics.
Note that right now, pandas doesn't really know the concept of non-nullable. Yes, we have some dtypes that don't support NAs (like integer dtype), but whenever some operation introduces a missing value, we simply upcast to a dtype that can store the missing value (so in practice means changing to float for integer). A proper concept of nullable=False
flag wouldn't necessarily work like this, I think.
Yah, I considered commenting earlier about the issues raised in earlier threads, but then noticed that the all-1D thing is already in the roadmap, so the degree to which the roadmap describes consensus is limited. |
@jbrockmendel honestly, that is exactly what I don't want from this PR. As I mentioned in the top post: yes there are still technical discussions to have (how to reconcile pd.NaT / pd.NA for datetimelike, should pd.NA be typed or not, ..?). But for me that are technical discussions subordinate to the general idea (with which I don't want to say they are not important, to be clear, but they follow from the general goal). So if you have fundamental problems with the proposal, please raise them. |
I have, in #28095, and AFAICT nothing has changed since then. Asking people to re-state the issues in multiple issues/PRs/mailing list threads and pretending there is consensus when people get tired of doing so is inappropriate.
If you want this document to represent items on which there is a consensus, then it is inappropriate to make a PR to add items which you know full-well do not have consensus. |
May I suggest a rewrite of 2 paragraphs to reflect the lack of consensus. Instead of
we say the following, which leaves some of the decision open:
From my reading of the past discussion, we agree that |
A PR is a proposal, to open a discussion, an attempt to build a consensus. So personally I think it is appropriate to make a PR like this. Because a PR is meant to be discussed, reviewed and edited, etc to finally have an adapted version merged, or to have it rejected. The PR was meant to inquire about a possible consensus. I'm sorry if that wasn't clear from the top post. But it would also be nice to not assume such bad intentions from me.
Brock, to yourself it might be clear what your opinion is, but honestly, it is not fully to me. It's always tempting to think that others will understand what you really were thinking based on what you wrote (I do that constantly myself as well, and have to remember questioning if I actually made myself clear), but:
So reading again through #28095 and the mailing list discussion (I should have done this more carefully before posting this), my understanding of your position is:
(note, this is my interpretation of someone's thoughts, so can easily be wrong. If so, please also take the time to go through the issues and summarize correctly) Given the above summary, it was my hope that we could already find agreement on some common ground: consistent missing value semantics (i.e. propagating in comparisons, kleene logic, etc, which, to my understanding up to now, you haven't been objecting against). Maybe the misunderstanding is that I see those open questions "secondary" to the general principle, while you find that those are fundamental to be solved before we can agree on the goal of consistent missing value behaviour? |
On other PRs you often ask for the discussion to occur in an Issue instead.
This is mostly accurate. Small comments on individual bullet points:
Yes. "where np.nan has incorrect type information" is, to me, the most compelling use case for pd.NA.
I have no strong opinion on this. I have a few weak opinions:
Correct. Other commenters in #28095 expressed a similar opinion regarding numeric dtypes.
Re-reading that thread, I still think this makes sense:
I think that corresponds to "1b", yes. The "if/when" conditional there matters, as I'm not 100% sure we should do this. I do see the upside to consistently propagating NAs, but it is also very convenient to have |
IMO, I think this is the strongest argument against adopting I think that your remaining points can all be solved with some effort. I think the only (or primary) architectural issue is around whether |
@TomAugspurger AFAICT Joris's previous point was that this isn't suggesting removing pd.NaT in favor of pd.NA, but changing pd.NaT comparison logic to match pd.NA's. If this is correct, then I don't understand what your comment is getting at. |
The text proposed in this PR indeed leaves that question open (the exact missing sentinel used for certain dtypes, eg pd.NaT vs pd.NA). But, if we choose to change the behaviour of pd.NaT to match pd.NA, then pd.NaT basically is a kind of "typed NA", which is what Tom is mentioning. But datetime/timedelta are not the only dtypes where type stability issues come up (although they are the most prominent there), so if we go down the path of typed NAs we probably want to do that for other dtypes as well. |
Well, we don't have prior art for doing a significant chang to the roadmap, so this is a first ;)
@jbrockmendel please open issues for the items you mention (and which do not yet have an issue)
So I think that's the crux of the proposal: I think we should do this (work towards having consistent behaviour for missing values, like in propagation or boolean operators), and propose to add the "principle" about this (not the technical detail on how to achieve it) to the roadmap. Note that |
Agreed that this is the crux. Could we get a round of comments (or +/-1s) on this specific point? Do we as a project think that having consistent missing data handling across all data types is a worthy goal? I'm +1. |
At this level of abstraction, absolutely +1. @jorisvandenbossche what would you expect to get with |
+1 from me. |
Can I get some concrete feedback on the actual text then? It was meant from the start to be about the general principle of consistent missing data handling (and not the technical details on how to implement it), but given the discussion above at least the perception of it was different. But I suppose the focus of the text in the diff is maybe too much on the scalar value (NA, NaN, NaT, ..) and not on the behaviour. I pushed a small update in an attempt to make it more general (using parts of the suggestions from @Dr-Irv above, thanks Irv!). |
doc/source/development/roadmap.rst
Outdated
To this end, a new experimental ``pd.NA`` scalar to be used as missing value | ||
indicator has already been added in pandas 1.0 (and used in the experimental | ||
nullable dtypes). Further work is needed to integrate this with other data | ||
To this end, a new experimental ``pd.NA`` scalar that can be used as missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small correction above. Change "that can be used as missing" to "that can be used as a missing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, thanks
@jorisvandenbossche here's a suggested rewrite, if you want to take pieces from it: Currently, pandas handles missing data differently for different data types. We
use different types to indicate that a value is missing (``np.nan`` for
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
strings or booleans -- with some missing values, and ``pd.NaT`` for datetimelike
data, and ``pd.NA`` for the nullable integer, boolean, and string data types).
These different missing values have different behaviors in user-facing
operations. Notably, ``NaN`` and ``NaT`` have the sometimes surprising behavior
of always comparing false in comparison operations.
We would like to implement consistent missing data handling for all data types.
This includes consistent behavior in all operations (indexing, arithmetic
operations, comparisons, etc.). We want to eventually make the new semantics the
default.
This has been discussed at
`github #28095 <https://github.com/pandas-dev/pandas/issues/28095>`__ (and
linked issues), and described in more detail in this
`design doc <https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB>`__.
|
Thanks Tom! That's helpful, I used part of it, but kept a more explicit notion of the fact that the nullable data types introduced different semantics. Please take a look. |
These different missing values have different behaviors in user-facing | ||
operations. Specifically, we introduced different semantics for the nullable | ||
data types for certain operations (e.g. propagating in comparison operations | ||
instead of comparing as False). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparison operations are the only ones that come to mind. are there other examples im missing, or is it just that in principle there could be others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also kleene-logic in logical operations, the boolean behaviour of the scalar value, and the behaviour with missing values in boolean indexing.
@pandas-dev/pandas-core friendly ping |
thanks @jorisvandenbossche can always update if needed |
@jreback I think it would have been good to first let the people with comments to react again before merging So @jbrockmendel @WillAyd (and others of course as well) I am still interested if you are OK with the current text |
@jorisvandenbossche the commentary is already 10 days old. |
Fine by me |
The more detailed motivation is described in https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB, and many aspects have already been discussed in #28095 and linked issues.
(and there are probably still other details / practical aspects that can be further discussed in dedicated issues)
Last year when I made the
pd.NA
proposal (and which resulted in using that for the nullable integer, boolean and string dtypes), which described it as "can be used consistently across all data types", the implicit / aspirational end goal of this proposal for me always was to actually have this for all dtypes (and as the default, at some point).I tried to discuss this goal more explicitly on the mailing list earlier this year (in the thread about pandas 2.0: https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html). But we never really "officially" adopted this as a goal / roadmap item, or discussed about doing that.
Proposing to add a section about it to the roadmap is an attempt to do this (as that is actually how it's described to do it in our roadmap).
The aforementioned mailing list thread mostly resulted in a discussion about how to integrate the new semantics in the datetime-like dtypes (pd.NaT vs pd.NA, keep pd.NaT but change its behaviour, etc). This is still a technical discussion we further need to resolve, but note that I kept the text on that in the PR somewhat vague on purpose for this reason: "..: all data types should support missing values and with the same behaviour"
And the general disclaimer that is also in our roadmap: An item being on the roadmap does not mean that it will necessarily happen, even with unlimited funding. During the implementation period we may discover issues preventing the adoption of the feature.
cc @pandas-dev/pandas-core @pandas-dev/pandas-triage