ROADMAP: add consistent missing values for all dtypes to the roadmap #35208

jorisvandenbossche · 2020-07-10T09:34:16Z

The more detailed motivation is described in https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB, and many aspects have already been discussed in #28095 and linked issues.
(and there are probably still other details / practical aspects that can be further discussed in dedicated issues)

Last year when I made the pd.NA proposal (and which resulted in using that for the nullable integer, boolean and string dtypes), which described it as "can be used consistently across all data types", the implicit / aspirational end goal of this proposal for me always was to actually have this for all dtypes (and as the default, at some point).
I tried to discuss this goal more explicitly on the mailing list earlier this year (in the thread about pandas 2.0: https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html). But we never really "officially" adopted this as a goal / roadmap item, or discussed about doing that.
Proposing to add a section about it to the roadmap is an attempt to do this (as that is actually how it's described to do it in our roadmap).

The aforementioned mailing list thread mostly resulted in a discussion about how to integrate the new semantics in the datetime-like dtypes (pd.NaT vs pd.NA, keep pd.NaT but change its behaviour, etc). This is still a technical discussion we further need to resolve, but note that I kept the text on that in the PR somewhat vague on purpose for this reason: "..: all data types should support missing values and with the same behaviour"

And the general disclaimer that is also in our roadmap: An item being on the roadmap does not mean that it will necessarily happen, even with unlimited funding. During the implementation period we may discover issues preventing the adoption of the feature.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

WillAyd

minor nit but not a huge blocker for me

WillAyd · 2020-07-10T20:27:38Z

doc/source/development/roadmap.rst

+data or are cast to float.
+
+Long term, we want to introduce consistent missing value handling accross the
+different data types: all data types should support missing values and with the


Not sure I agree that all data types should support missing values. Non-nullable types could be beneficial

I agree that there's value in pandas ensuring that a column cannot contain NAs. I'm not sure where best to put that invariant: the dtype or the array / column.

But, to sidestep this issue, perhaps something like "pandas should provide consistent missing value handling for all the different kinds of data", i.e. we give you the ability, without saying that every dtype has to be nullable.

I seem to remember that you, Tom and I had a similar discussion in another issues (but can't directly find it).

I agree that the concept of non-nullability is useful/interesting, but as Tom also mentions: non-nullability shouldn't necessarily be a property of the data type itself. Similarly as Tom is working on the "allow_duplicate_labels" flag, we could have "nullable=True/False" flags per column.

Because for the dtype itself to be non-nullable, we should ask ourselves: do we have an example of a data type (that we want to include in pandas) for which it would never be useful to be nullable?

Or to phrase the text in a different way: if a data type supports missing values, it should follow consistent semantics.

Note that right now, pandas doesn't really know the concept of non-nullable. Yes, we have some dtypes that don't support NAs (like integer dtype), but whenever some operation introduces a missing value, we simply upcast to a dtype that can store the missing value (so in practice means changing to float for integer). A proper concept of nullable=False flag wouldn't necessarily work like this, I think.

jbrockmendel · 2020-07-10T20:39:50Z

minor nit but not a huge blocker for me

Yah, I considered commenting earlier about the issues raised in earlier threads, but then noticed that the all-1D thing is already in the roadmap, so the degree to which the roadmap describes consensus is limited.

jorisvandenbossche · 2020-07-13T12:14:02Z

so the degree to which the roadmap describes consensus is limited.

@jbrockmendel honestly, that is exactly what I don't want from this PR.
IMO it would be beneficial for pandas to be able to say: as a project, we collectively think that this proposal is what we would like to see happen (with the typical disclaimers of: given the resources to do it, given that the technical details work out / are resolved, etc), and thus actively welcome contributions in that direction.

As I mentioned in the top post: yes there are still technical discussions to have (how to reconcile pd.NaT / pd.NA for datetimelike, should pd.NA be typed or not, ..?). But for me that are technical discussions subordinate to the general idea (with which I don't want to say they are not important, to be clear, but they follow from the general goal).

So if you have fundamental problems with the proposal, please raise them.

jbrockmendel · 2020-07-14T20:04:23Z

So if you have fundamental problems with the proposal, please raise them.

I have, in #28095, and AFAICT nothing has changed since then. Asking people to re-state the issues in multiple issues/PRs/mailing list threads and pretending there is consensus when people get tired of doing so is inappropriate.

honestly, that is exactly what I don't want from this PR.

If you want this document to represent items on which there is a consensus, then it is inappropriate to make a PR to add items which you know full-well do not have consensus.

Dr-Irv · 2020-07-14T20:43:57Z

May I suggest a rewrite of 2 paragraphs to reflect the lack of consensus. Instead of

Long term, we want to introduce consistent missing value handling accross the different data types: all data types should support missing values and with the same behaviour.

To this end, a new experimental pd.NA scalar to be used as missing value indicator has already been added in pandas 1.0 (and used in the experimental nullable dtypes). Further work is needed to integrate this with other data types, and to provide a path forward to make this the default in a future version of pandas.

we say the following, which leaves some of the decision open:

Long term, we want to introduce consistent missing value handling across the different data types: all data types should support missing values and with similar behaviours, dependent on the data type.

To this end, a new experimental pd.NA scalar that can be used as a missing value indicator has already been added in pandas 1.0 (and used in the experimental nullable dtypes). Further work and research is needed to determine whether and how to integrate this with other data types, and to determine a path forward whether this should be the default for all dtypes in a future version of pandas.

From my reading of the past discussion, we agree that pd.NA is a good thing for Float, Integer, String dtypes (the nullable dtypes). It is unclear whether pd.NA should be used for datetime related dtypes. I hope that the rewrite above makes it clear that a decision hasn't been made.

jorisvandenbossche · 2020-07-15T19:07:20Z

If you want this document to represent items on which there is a consensus, then it is inappropriate to make a PR to add items which you know full-well do not have consensus.

A PR is a proposal, to open a discussion, an attempt to build a consensus. So personally I think it is appropriate to make a PR like this. Because a PR is meant to be discussed, reviewed and edited, etc to finally have an adapted version merged, or to have it rejected. The PR was meant to inquire about a possible consensus.

I'm sorry if that wasn't clear from the top post. But it would also be nice to not assume such bad intentions from me.

So if you have fundamental problems with the proposal, please raise them.

I have, in #28095, and AFAICT nothing has changed since then. Asking people to re-state the issues in multiple issues/PRs/mailing list threads and pretending there is consensus when people get tired of doing so is inappropriate.

Brock, to yourself it might be clear what your opinion is, but honestly, it is not fully to me. It's always tempting to think that others will understand what you really were thinking based on what you wrote (I do that constantly myself as well, and have to remember questioning if I actually made myself clear), but:

We have had long discussions with a lot of arguments. It is easy to lose track of certain elements you have said in such long discussions. I think there is no harm in from time to time try to summarize (and thus repeat) things.
Arguments don't always directly map to opinions, or it can be tricky to correctly infer someone's opinion ("should we do it or not?") from technical arguments. For example, you explicitly stated that you agree with some elements, and that you don't like some other elements, but does that mean you disagree with the overall idea or not? (one does not need to like every detail to be OK with a full proposal)

So reading again through #28095 and the mailing list discussion (I should have done this more carefully before posting this), my understanding of your position is:

You supported introducing pd.NA for cases where np.nan has incorrect type information, i.e. the nullable int, string and bool data types (not really sure about float)
I have the feeling you agree on the new semantics for missing data that we introduced for the nullable dtypes (propagating missing values in comparisons, Kleene for logical ops)
You have strong concerns about type stability in arithmetic operations (e.g. not being able to tell the result type from <typed scalar> + pd.NA). This is especially the case for datetime-like dtypes, and therefore you don't like using pd.NA for those dtypes (as the existing pd.NaT is more informative).
(I know you think this "horse has already been beaten to death" and discussed ad nauseum, but I think we will need to pick it up again at some point, though. Because we need to make some decision on this for the current pd.NA as well)
In the pandas 2.0 mailing list thread, there was discussion about changing the behavior of pd.NaT to follow the new missing semantics (bullet point 2), as a way to follow consistent behaviour in datetime-like dtypes while keeping pd.NaT.
Tom summarized some options regarding this at the bottom of this post; you didn't respond explicitly about this, but I assume you would be in favor of 1b (change behaviour of pd.NaT).

(note, this is my interpretation of someone's thoughts, so can easily be wrong. If so, please also take the time to go through the issues and summarize correctly)

Given the above summary, it was my hope that we could already find agreement on some common ground: consistent missing value semantics (i.e. propagating in comparisons, kleene logic, etc, which, to my understanding up to now, you haven't been objecting against).
And knowing very well your concerns, I explicitly didn't say that this needs to be pd.NA (the text says: "data types should support missing values and with the same behaviour"), and I explicitly acknowledged in the top post the open questions about using pd.NA vs changing behaviour of pd.NaT vs .. to achieve this.

Maybe the misunderstanding is that I see those open questions "secondary" to the general principle, while you find that those are fundamental to be solved before we can agree on the goal of consistent missing value behaviour?

jbrockmendel · 2020-07-15T23:25:22Z

A PR is a proposal, to open a discussion, an attempt to build a consensus.

On other PRs you often ask for the discussion to occur in an Issue instead.

So reading again through #28095 and the mailing list discussion (I should have done this more carefully before posting this), my understanding of your position is: [...]

This is mostly accurate. Small comments on individual bullet points:

You supported introducing pd.NA for cases where np.nan has incorrect type information, i.e. the nullable int, string and bool data types (not really sure about float)

Yes. "where np.nan has incorrect type information" is, to me, the most compelling use case for pd.NA.

I have the feeling you agree on the new semantics for missing data that we introduced for the nullable dtypes (propagating missing values in comparisons, Kleene for logical ops)

I have no strong opinion on this. I have a few weak opinions:

I'd like to see pd.NA firmed up before being more ambitious in its usage, e.g. API/BUG: pd.NA == pd.NaT returns False #34104
IIRC I get a bunch of test failures if I put pd.NA as the name for a Series or Index created in pd._testing. I'd like to get that firmed up [note to self: open an issue about this]
ATM we check for pd.NA in tslib.array_to_datetime but not in any of the constructors/conversion functions in tslibs. I'd like for a) it to be recognized symmetrically everywhere and b) minimize messing with the _libs dependency structure.

You have strong concerns about type stability in arithmetic operations (e.g. not being able to tell the result type from + pd.NA). This is especially the case for datetime-like dtypes, and therefore you don't like using pd.NA for those dtypes (as the existing pd.NaT is more informative).

Correct. Other commenters in #28095 expressed a similar opinion regarding numeric dtypes.

In the pandas 2.0 mailing list thread, there was discussion about changing the behavior of pd.NaT to follow the new missing semantics (bullet point 2), as a way to follow consistent behaviour in datetime-like dtypes while keeping pd.NaT.
Tom summarized some options regarding this at the bottom of this post; you didn't respond explicitly about this, but I assume you would be in favor of 1b (change behaviour of pd.NaT).

Re-reading that thread, I still think this makes sense:

> > You have no problem with changing the behavior of NaT, or changing to
> use pd.NA?
>
> If/when we get to a point where we propagate NAs in all other comparisons,
> I would have no problem with editing `NaT.__richcmp__` to match that
> convention.

I think that corresponds to "1b", yes.

The "if/when" conditional there matters, as I'm not 100% sure we should do this. I do see the upside to consistently propagating NAs, but it is also very convenient to have (ser1 == ser2).values match ser1.values == ser2.values, which this will break.

TomAugspurger · 2020-08-04T14:05:00Z

You have strong concerns about type stability in arithmetic operations (e.g. not being able to tell the result type from + pd.NA). This is especially the case for datetime-like dtypes, and therefore you don't like using pd.NA for those dtypes (as the existing pd.NaT is more informative).

Correct. Other commenters in #28095 expressed a similar opinion regarding numeric dtypes.

IMO, I think this is the strongest argument against adopting pd.NA. But, I think it's relatively straightforward to solve with a typed NA, and I don't think it will be much more difficult to implement.

I think that your remaining points can all be solved with some effort. I think the only (or primary) architectural issue is around whether NA can be typed or not.

jbrockmendel · 2020-08-04T16:24:17Z

@TomAugspurger AFAICT Joris's previous point was that this isn't suggesting removing pd.NaT in favor of pd.NA, but changing pd.NaT comparison logic to match pd.NA's. If this is correct, then I don't understand what your comment is getting at.

jorisvandenbossche · 2020-08-04T20:05:41Z

The text proposed in this PR indeed leaves that question open (the exact missing sentinel used for certain dtypes, eg pd.NaT vs pd.NA). But, if we choose to change the behaviour of pd.NaT to match pd.NA, then pd.NaT basically is a kind of "typed NA", which is what Tom is mentioning.

But datetime/timedelta are not the only dtypes where type stability issues come up (although they are the most prominent there), so if we go down the path of typed NAs we probably want to do that for other dtypes as well.

jorisvandenbossche · 2020-08-04T20:15:53Z

On other PRs you often ask for the discussion to occur in an Issue instead.

Well, we don't have prior art for doing a significant chang to the roadmap, so this is a first ;)

I have no strong opinion on this. I have a few weak opinions:

@jbrockmendel please open issues for the items you mention (and which do not yet have an issue)

The "if/when" conditional there matters, as I'm not 100% sure we should do this. I do see the upside to consistently propagating NAs, but it is also very convenient to have (ser1 == ser2).values match ser1.values == ser2.values, which this will break.

So I think that's the crux of the proposal: I think we should do this (work towards having consistent behaviour for missing values, like in propagation or boolean operators), and propose to add the "principle" about this (not the technical detail on how to achieve it) to the roadmap.

Note that (ser1 == ser2).values and ser1.values == ser2.values will actually match for new dtypes, where .values is a pandas array, and not a numpy array. But of course, you can still give the same example with to_numpy(), as the essence is that we loose consistency with the operation on the converted numpy arrays, and which is certainly a considerable change (but in the end comes down to: numpy has no "proper" support for missing values).

TomAugspurger · 2020-08-04T20:22:23Z

So I think that's the crux of the proposal: I think we should do this (work towards having consistent behaviour for missing values, like in propagation or boolean operators), and propose to add the "principle" about this (not the technical detail on how to achieve it) to the roadmap.

Agreed that this is the crux. Could we get a round of comments (or +/-1s) on this specific point? Do we as a project think that having consistent missing data handling across all data types is a worthy goal?

I'm +1.

jbrockmendel · 2020-08-04T21:15:37Z

Do we as a project think that having consistent missing data handling across all data types is a worthy goal?

At this level of abstraction, absolutely +1.

@jorisvandenbossche what would you expect to get with pd.NaT == df.values? im really hoping we have a better solution that object-dtype ndarray

jreback · 2020-08-05T02:24:35Z

+1 from me.

jorisvandenbossche · 2020-08-07T11:19:29Z

Can I get some concrete feedback on the actual text then?

It was meant from the start to be about the general principle of consistent missing data handling (and not the technical details on how to implement it), but given the discussion above at least the perception of it was different. But I suppose the focus of the text in the diff is maybe too much on the scalar value (NA, NaN, NaT, ..) and not on the behaviour. I pushed a small update in an attempt to make it more general (using parts of the suggestions from @Dr-Irv above, thanks Irv!).

Dr-Irv · 2020-08-07T13:44:39Z

doc/source/development/roadmap.rst

-To this end, a new experimental ``pd.NA`` scalar to be used as missing value
-indicator has already been added in pandas 1.0 (and used in the experimental
-nullable dtypes). Further work is needed to integrate this with other data
+To this end, a new experimental ``pd.NA`` scalar that can be used as missing


Small correction above. Change "that can be used as missing" to "that can be used as a missing"

Indeed, thanks

TomAugspurger · 2020-08-07T15:37:42Z

@jorisvandenbossche here's a suggested rewrite, if you want to take pieces from it:

Currently, pandas handles missing data differently for different data types. We
use different types to indicate that a value is missing (``np.nan`` for
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
strings or booleans -- with some missing values, and ``pd.NaT`` for datetimelike
data, and ``pd.NA`` for the nullable integer, boolean, and string data types).

These different missing values have different behaviors in user-facing
operations. Notably, ``NaN`` and ``NaT`` have the sometimes surprising behavior
of always comparing false in comparison operations.

We would like to implement consistent missing data handling for all data types.
This includes consistent behavior in all operations (indexing, arithmetic
operations, comparisons, etc.). We want to eventually make the new semantics the
default.

This has been discussed at
`github #28095 <https://github.com/pandas-dev/pandas/issues/28095>`__ (and
linked issues), and described in more detail in this
`design doc <https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB>`__.

jorisvandenbossche · 2020-08-07T19:00:16Z

Thanks Tom! That's helpful, I used part of it, but kept a more explicit notion of the fact that the nullable data types introduced different semantics. Please take a look.

jbrockmendel · 2020-08-07T20:54:06Z

doc/source/development/roadmap.rst

+These different missing values have different behaviors in user-facing
+operations. Specifically, we introduced different semantics for the nullable
+data types for certain operations (e.g. propagating in comparison operations
+instead of comparing as False).


comparison operations are the only ones that come to mind. are there other examples im missing, or is it just that in principle there could be others?

There is also kleene-logic in logical operations, the boolean behaviour of the scalar value, and the behaviour with missing values in boolean indexing.

jorisvandenbossche · 2020-08-20T09:27:20Z

@pandas-dev/pandas-core friendly ping

jreback · 2020-08-20T14:53:01Z

thanks @jorisvandenbossche can always update if needed

jorisvandenbossche · 2020-08-20T14:56:52Z

@jreback I think it would have been good to first let the people with comments to react again before merging
(we should maybe discuss what merge policy we should use for roadmap changes)

So @jbrockmendel @WillAyd (and others of course as well) I am still interested if you are OK with the current text

jreback · 2020-08-20T14:58:14Z

@jorisvandenbossche the commentary is already 10 days old.

jbrockmendel · 2020-08-20T15:50:21Z

Fine by me

ROADMAP: add consistent missing values for all dtypes to the roadmap

3a16f7b

jorisvandenbossche added the Roadmap A proposal for the roadmap. label Jul 10, 2020

WillAyd reviewed Jul 10, 2020

View reviewed changes

jreback added this to the 1.2 milestone Aug 5, 2020

add notion of different semantics

ee62bd0

Dr-Irv reviewed Aug 7, 2020

View reviewed changes

update with Tom's suggestions

7bcb4e6

jbrockmendel reviewed Aug 7, 2020

View reviewed changes

TomAugspurger approved these changes Aug 20, 2020

View reviewed changes

simonjayhawkins approved these changes Aug 20, 2020

View reviewed changes

jreback approved these changes Aug 20, 2020

View reviewed changes

jreback merged commit 2778c3e into pandas-dev:master Aug 20, 2020

jorisvandenbossche deleted the roadmap-missing-data branch August 20, 2020 14:54

jbrockmendel mentioned this pull request Apr 20, 2021

DOC: more accurate wording in roadmap #41057

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROADMAP: add consistent missing values for all dtypes to the roadmap #35208

ROADMAP: add consistent missing values for all dtypes to the roadmap #35208

jorisvandenbossche commented Jul 10, 2020 •

edited

Loading

WillAyd left a comment

WillAyd Jul 10, 2020

TomAugspurger Jul 13, 2020 •

edited

Loading

jorisvandenbossche Jul 13, 2020

jbrockmendel commented Jul 10, 2020

jorisvandenbossche commented Jul 13, 2020

jbrockmendel commented Jul 14, 2020

Dr-Irv commented Jul 14, 2020

jorisvandenbossche commented Jul 15, 2020

jbrockmendel commented Jul 15, 2020

TomAugspurger commented Aug 4, 2020

jbrockmendel commented Aug 4, 2020

jorisvandenbossche commented Aug 4, 2020

jorisvandenbossche commented Aug 4, 2020

TomAugspurger commented Aug 4, 2020

jbrockmendel commented Aug 4, 2020

jreback commented Aug 5, 2020

jorisvandenbossche commented Aug 7, 2020

Dr-Irv Aug 7, 2020

jorisvandenbossche Aug 7, 2020

TomAugspurger commented Aug 7, 2020

jorisvandenbossche commented Aug 7, 2020

jbrockmendel Aug 7, 2020

jorisvandenbossche Aug 10, 2020

jorisvandenbossche commented Aug 20, 2020

jreback commented Aug 20, 2020

jorisvandenbossche commented Aug 20, 2020

jreback commented Aug 20, 2020

jbrockmendel commented Aug 20, 2020

ROADMAP: add consistent missing values for all dtypes to the roadmap #35208

ROADMAP: add consistent missing values for all dtypes to the roadmap #35208

Conversation

jorisvandenbossche commented Jul 10, 2020 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Jul 10, 2020

Choose a reason for hiding this comment

TomAugspurger Jul 13, 2020 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Jul 13, 2020

Choose a reason for hiding this comment

jbrockmendel commented Jul 10, 2020

jorisvandenbossche commented Jul 13, 2020

jbrockmendel commented Jul 14, 2020

Dr-Irv commented Jul 14, 2020

jorisvandenbossche commented Jul 15, 2020

jbrockmendel commented Jul 15, 2020

TomAugspurger commented Aug 4, 2020

jbrockmendel commented Aug 4, 2020

jorisvandenbossche commented Aug 4, 2020

jorisvandenbossche commented Aug 4, 2020

TomAugspurger commented Aug 4, 2020

jbrockmendel commented Aug 4, 2020

jreback commented Aug 5, 2020

jorisvandenbossche commented Aug 7, 2020

Dr-Irv Aug 7, 2020

Choose a reason for hiding this comment

jorisvandenbossche Aug 7, 2020

Choose a reason for hiding this comment

TomAugspurger commented Aug 7, 2020

jorisvandenbossche commented Aug 7, 2020

jbrockmendel Aug 7, 2020

Choose a reason for hiding this comment

jorisvandenbossche Aug 10, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 20, 2020

jreback commented Aug 20, 2020

jorisvandenbossche commented Aug 20, 2020

jreback commented Aug 20, 2020

jbrockmendel commented Aug 20, 2020

jorisvandenbossche commented Jul 10, 2020 •

edited

Loading

TomAugspurger Jul 13, 2020 •

edited

Loading