Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROADMAP: add consistent missing values for all dtypes to the roadmap #35208

Merged
merged 3 commits into from
Aug 20, 2020

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jul 10, 2020

The more detailed motivation is described in https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB, and many aspects have already been discussed in #28095 and linked issues.
(and there are probably still other details / practical aspects that can be further discussed in dedicated issues)

Last year when I made the pd.NA proposal (and which resulted in using that for the nullable integer, boolean and string dtypes), which described it as "can be used consistently across all data types", the implicit / aspirational end goal of this proposal for me always was to actually have this for all dtypes (and as the default, at some point).
I tried to discuss this goal more explicitly on the mailing list earlier this year (in the thread about pandas 2.0: https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html). But we never really "officially" adopted this as a goal / roadmap item, or discussed about doing that.
Proposing to add a section about it to the roadmap is an attempt to do this (as that is actually how it's described to do it in our roadmap).

The aforementioned mailing list thread mostly resulted in a discussion about how to integrate the new semantics in the datetime-like dtypes (pd.NaT vs pd.NA, keep pd.NaT but change its behaviour, etc). This is still a technical discussion we further need to resolve, but note that I kept the text on that in the PR somewhat vague on purpose for this reason: "..: all data types should support missing values and with the same behaviour"

And the general disclaimer that is also in our roadmap: An item being on the roadmap does not mean that it will necessarily happen, even with unlimited funding. During the implementation period we may discover issues preventing the adoption of the feature.

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

@jorisvandenbossche jorisvandenbossche added the Roadmap A proposal for the roadmap. label Jul 10, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit but not a huge blocker for me

data or are cast to float.

Long term, we want to introduce consistent missing value handling accross the
different data types: all data types should support missing values and with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree that all data types should support missing values. Non-nullable types could be beneficial

Copy link
Contributor

@TomAugspurger TomAugspurger Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there's value in pandas ensuring that a column cannot contain NAs. I'm not sure where best to put that invariant: the dtype or the array / column.

But, to sidestep this issue, perhaps something like "pandas should provide consistent missing value handling for all the different kinds of data", i.e. we give you the ability, without saying that every dtype has to be nullable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to remember that you, Tom and I had a similar discussion in another issues (but can't directly find it).

I agree that the concept of non-nullability is useful/interesting, but as Tom also mentions: non-nullability shouldn't necessarily be a property of the data type itself. Similarly as Tom is working on the "allow_duplicate_labels" flag, we could have "nullable=True/False" flags per column.

Because for the dtype itself to be non-nullable, we should ask ourselves: do we have an example of a data type (that we want to include in pandas) for which it would never be useful to be nullable?

Or to phrase the text in a different way: if a data type supports missing values, it should follow consistent semantics.


Note that right now, pandas doesn't really know the concept of non-nullable. Yes, we have some dtypes that don't support NAs (like integer dtype), but whenever some operation introduces a missing value, we simply upcast to a dtype that can store the missing value (so in practice means changing to float for integer). A proper concept of nullable=False flag wouldn't necessarily work like this, I think.

@jbrockmendel
Copy link
Member

minor nit but not a huge blocker for me

Yah, I considered commenting earlier about the issues raised in earlier threads, but then noticed that the all-1D thing is already in the roadmap, so the degree to which the roadmap describes consensus is limited.

@jorisvandenbossche
Copy link
Member Author

so the degree to which the roadmap describes consensus is limited.

@jbrockmendel honestly, that is exactly what I don't want from this PR.
IMO it would be beneficial for pandas to be able to say: as a project, we collectively think that this proposal is what we would like to see happen (with the typical disclaimers of: given the resources to do it, given that the technical details work out / are resolved, etc), and thus actively welcome contributions in that direction.

As I mentioned in the top post: yes there are still technical discussions to have (how to reconcile pd.NaT / pd.NA for datetimelike, should pd.NA be typed or not, ..?). But for me that are technical discussions subordinate to the general idea (with which I don't want to say they are not important, to be clear, but they follow from the general goal).

So if you have fundamental problems with the proposal, please raise them.

@jbrockmendel
Copy link
Member

So if you have fundamental problems with the proposal, please raise them.

I have, in #28095, and AFAICT nothing has changed since then. Asking people to re-state the issues in multiple issues/PRs/mailing list threads and pretending there is consensus when people get tired of doing so is inappropriate.

honestly, that is exactly what I don't want from this PR.

If you want this document to represent items on which there is a consensus, then it is inappropriate to make a PR to add items which you know full-well do not have consensus.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 14, 2020

May I suggest a rewrite of 2 paragraphs to reflect the lack of consensus. Instead of

Long term, we want to introduce consistent missing value handling accross the different data types: all data types should support missing values and with the same behaviour.

To this end, a new experimental pd.NA scalar to be used as missing value indicator has already been added in pandas 1.0 (and used in the experimental nullable dtypes). Further work is needed to integrate this with other data types, and to provide a path forward to make this the default in a future version of pandas.

we say the following, which leaves some of the decision open:

Long term, we want to introduce consistent missing value handling across the different data types: all data types should support missing values and with similar behaviours, dependent on the data type.

To this end, a new experimental pd.NA scalar that can be used as a missing value indicator has already been added in pandas 1.0 (and used in the experimental nullable dtypes). Further work and research is needed to determine whether and how to integrate this with other data types, and to determine a path forward whether this should be the default for all dtypes in a future version of pandas.

From my reading of the past discussion, we agree that pd.NA is a good thing for Float, Integer, String dtypes (the nullable dtypes). It is unclear whether pd.NA should be used for datetime related dtypes. I hope that the rewrite above makes it clear that a decision hasn't been made.

@jorisvandenbossche
Copy link
Member Author

If you want this document to represent items on which there is a consensus, then it is inappropriate to make a PR to add items which you know full-well do not have consensus.

A PR is a proposal, to open a discussion, an attempt to build a consensus. So personally I think it is appropriate to make a PR like this. Because a PR is meant to be discussed, reviewed and edited, etc to finally have an adapted version merged, or to have it rejected. The PR was meant to inquire about a possible consensus.

I'm sorry if that wasn't clear from the top post. But it would also be nice to not assume such bad intentions from me.

So if you have fundamental problems with the proposal, please raise them.

I have, in #28095, and AFAICT nothing has changed since then. Asking people to re-state the issues in multiple issues/PRs/mailing list threads and pretending there is consensus when people get tired of doing so is inappropriate.

Brock, to yourself it might be clear what your opinion is, but honestly, it is not fully to me. It's always tempting to think that others will understand what you really were thinking based on what you wrote (I do that constantly myself as well, and have to remember questioning if I actually made myself clear), but:

  • We have had long discussions with a lot of arguments. It is easy to lose track of certain elements you have said in such long discussions. I think there is no harm in from time to time try to summarize (and thus repeat) things.
  • Arguments don't always directly map to opinions, or it can be tricky to correctly infer someone's opinion ("should we do it or not?") from technical arguments. For example, you explicitly stated that you agree with some elements, and that you don't like some other elements, but does that mean you disagree with the overall idea or not? (one does not need to like every detail to be OK with a full proposal)

So reading again through #28095 and the mailing list discussion (I should have done this more carefully before posting this), my understanding of your position is:

  • You supported introducing pd.NA for cases where np.nan has incorrect type information, i.e. the nullable int, string and bool data types (not really sure about float)
  • I have the feeling you agree on the new semantics for missing data that we introduced for the nullable dtypes (propagating missing values in comparisons, Kleene for logical ops)
  • You have strong concerns about type stability in arithmetic operations (e.g. not being able to tell the result type from <typed scalar> + pd.NA). This is especially the case for datetime-like dtypes, and therefore you don't like using pd.NA for those dtypes (as the existing pd.NaT is more informative).
    (I know you think this "horse has already been beaten to death" and discussed ad nauseum, but I think we will need to pick it up again at some point, though. Because we need to make some decision on this for the current pd.NA as well)
  • In the pandas 2.0 mailing list thread, there was discussion about changing the behavior of pd.NaT to follow the new missing semantics (bullet point 2), as a way to follow consistent behaviour in datetime-like dtypes while keeping pd.NaT.
    Tom summarized some options regarding this at the bottom of this post; you didn't respond explicitly about this, but I assume you would be in favor of 1b (change behaviour of pd.NaT).

(note, this is my interpretation of someone's thoughts, so can easily be wrong. If so, please also take the time to go through the issues and summarize correctly)

Given the above summary, it was my hope that we could already find agreement on some common ground: consistent missing value semantics (i.e. propagating in comparisons, kleene logic, etc, which, to my understanding up to now, you haven't been objecting against).
And knowing very well your concerns, I explicitly didn't say that this needs to be pd.NA (the text says: "data types should support missing values and with the same behaviour"), and I explicitly acknowledged in the top post the open questions about using pd.NA vs changing behaviour of pd.NaT vs .. to achieve this.

Maybe the misunderstanding is that I see those open questions "secondary" to the general principle, while you find that those are fundamental to be solved before we can agree on the goal of consistent missing value behaviour?

@jbrockmendel
Copy link
Member

A PR is a proposal, to open a discussion, an attempt to build a consensus.

On other PRs you often ask for the discussion to occur in an Issue instead.


So reading again through #28095 and the mailing list discussion (I should have done this more carefully before posting this), my understanding of your position is: [...]

This is mostly accurate. Small comments on individual bullet points:

  • You supported introducing pd.NA for cases where np.nan has incorrect type information, i.e. the nullable int, string and bool data types (not really sure about float)

Yes. "where np.nan has incorrect type information" is, to me, the most compelling use case for pd.NA.

  • I have the feeling you agree on the new semantics for missing data that we introduced for the nullable dtypes (propagating missing values in comparisons, Kleene for logical ops)

I have no strong opinion on this. I have a few weak opinions:

  • I'd like to see pd.NA firmed up before being more ambitious in its usage, e.g. API/BUG: pd.NA == pd.NaT returns False #34104
  • IIRC I get a bunch of test failures if I put pd.NA as the name for a Series or Index created in pd._testing. I'd like to get that firmed up [note to self: open an issue about this]
  • ATM we check for pd.NA in tslib.array_to_datetime but not in any of the constructors/conversion functions in tslibs. I'd like for a) it to be recognized symmetrically everywhere and b) minimize messing with the _libs dependency structure.
  • You have strong concerns about type stability in arithmetic operations (e.g. not being able to tell the result type from + pd.NA). This is especially the case for datetime-like dtypes, and therefore you don't like using pd.NA for those dtypes (as the existing pd.NaT is more informative).

Correct. Other commenters in #28095 expressed a similar opinion regarding numeric dtypes.

  • In the pandas 2.0 mailing list thread, there was discussion about changing the behavior of pd.NaT to follow the new missing semantics (bullet point 2), as a way to follow consistent behaviour in datetime-like dtypes while keeping pd.NaT.
    Tom summarized some options regarding this at the bottom of this post; you didn't respond explicitly about this, but I assume you would be in favor of 1b (change behaviour of pd.NaT).

Re-reading that thread, I still think this makes sense:

> > You have no problem with changing the behavior of NaT, or changing to
> use pd.NA?
>
> If/when we get to a point where we propagate NAs in all other comparisons,
> I would have no problem with editing `NaT.__richcmp__` to match that
> convention.

I think that corresponds to "1b", yes.

The "if/when" conditional there matters, as I'm not 100% sure we should do this. I do see the upside to consistently propagating NAs, but it is also very convenient to have (ser1 == ser2).values match ser1.values == ser2.values, which this will break.

@TomAugspurger
Copy link
Contributor

You have strong concerns about type stability in arithmetic operations (e.g. not being able to tell the result type from + pd.NA). This is especially the case for datetime-like dtypes, and therefore you don't like using pd.NA for those dtypes (as the existing pd.NaT is more informative).

Correct. Other commenters in #28095 expressed a similar opinion regarding numeric dtypes.

IMO, I think this is the strongest argument against adopting pd.NA. But, I think it's relatively straightforward to solve with a typed NA, and I don't think it will be much more difficult to implement.

I think that your remaining points can all be solved with some effort. I think the only (or primary) architectural issue is around whether NA can be typed or not.

@jbrockmendel
Copy link
Member

@TomAugspurger AFAICT Joris's previous point was that this isn't suggesting removing pd.NaT in favor of pd.NA, but changing pd.NaT comparison logic to match pd.NA's. If this is correct, then I don't understand what your comment is getting at.

@jorisvandenbossche
Copy link
Member Author

The text proposed in this PR indeed leaves that question open (the exact missing sentinel used for certain dtypes, eg pd.NaT vs pd.NA). But, if we choose to change the behaviour of pd.NaT to match pd.NA, then pd.NaT basically is a kind of "typed NA", which is what Tom is mentioning.

But datetime/timedelta are not the only dtypes where type stability issues come up (although they are the most prominent there), so if we go down the path of typed NAs we probably want to do that for other dtypes as well.

@jorisvandenbossche
Copy link
Member Author

On other PRs you often ask for the discussion to occur in an Issue instead.

Well, we don't have prior art for doing a significant chang to the roadmap, so this is a first ;)

I have no strong opinion on this. I have a few weak opinions:

@jbrockmendel please open issues for the items you mention (and which do not yet have an issue)

The "if/when" conditional there matters, as I'm not 100% sure we should do this. I do see the upside to consistently propagating NAs, but it is also very convenient to have (ser1 == ser2).values match ser1.values == ser2.values, which this will break.

So I think that's the crux of the proposal: I think we should do this (work towards having consistent behaviour for missing values, like in propagation or boolean operators), and propose to add the "principle" about this (not the technical detail on how to achieve it) to the roadmap.

Note that (ser1 == ser2).values and ser1.values == ser2.values will actually match for new dtypes, where .values is a pandas array, and not a numpy array. But of course, you can still give the same example with to_numpy(), as the essence is that we loose consistency with the operation on the converted numpy arrays, and which is certainly a considerable change (but in the end comes down to: numpy has no "proper" support for missing values).

@TomAugspurger
Copy link
Contributor

So I think that's the crux of the proposal: I think we should do this (work towards having consistent behaviour for missing values, like in propagation or boolean operators), and propose to add the "principle" about this (not the technical detail on how to achieve it) to the roadmap.

Agreed that this is the crux. Could we get a round of comments (or +/-1s) on this specific point? Do we as a project think that having consistent missing data handling across all data types is a worthy goal?

I'm +1.

@jbrockmendel
Copy link
Member

Do we as a project think that having consistent missing data handling across all data types is a worthy goal?

At this level of abstraction, absolutely +1.

@jorisvandenbossche what would you expect to get with pd.NaT == df.values? im really hoping we have a better solution that object-dtype ndarray

@jreback jreback added this to the 1.2 milestone Aug 5, 2020
@jreback
Copy link
Contributor

jreback commented Aug 5, 2020

+1 from me.

@jorisvandenbossche
Copy link
Member Author

Can I get some concrete feedback on the actual text then?

It was meant from the start to be about the general principle of consistent missing data handling (and not the technical details on how to implement it), but given the discussion above at least the perception of it was different. But I suppose the focus of the text in the diff is maybe too much on the scalar value (NA, NaN, NaT, ..) and not on the behaviour. I pushed a small update in an attempt to make it more general (using parts of the suggestions from @Dr-Irv above, thanks Irv!).

To this end, a new experimental ``pd.NA`` scalar to be used as missing value
indicator has already been added in pandas 1.0 (and used in the experimental
nullable dtypes). Further work is needed to integrate this with other data
To this end, a new experimental ``pd.NA`` scalar that can be used as missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small correction above. Change "that can be used as missing" to "that can be used as a missing"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche here's a suggested rewrite, if you want to take pieces from it:

Currently, pandas handles missing data differently for different data types. We
use different types to indicate that a value is missing (``np.nan`` for
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
strings or booleans -- with some missing values, and ``pd.NaT`` for datetimelike
data, and ``pd.NA`` for the nullable integer, boolean, and string data types).

These different missing values have different behaviors in user-facing
operations. Notably, ``NaN`` and ``NaT`` have the sometimes surprising behavior
of always comparing false in comparison operations.

We would like to implement consistent missing data handling for all data types.
This includes consistent behavior in all operations (indexing, arithmetic
operations, comparisons, etc.). We want to eventually make the new semantics the
default.

This has been discussed at
`github #28095 <https://github.com/pandas-dev/pandas/issues/28095>`__ (and
linked issues), and described in more detail in this
`design doc <https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB>`__. 

@jorisvandenbossche
Copy link
Member Author

Thanks Tom! That's helpful, I used part of it, but kept a more explicit notion of the fact that the nullable data types introduced different semantics. Please take a look.

These different missing values have different behaviors in user-facing
operations. Specifically, we introduced different semantics for the nullable
data types for certain operations (e.g. propagating in comparison operations
instead of comparing as False).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparison operations are the only ones that come to mind. are there other examples im missing, or is it just that in principle there could be others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also kleene-logic in logical operations, the boolean behaviour of the scalar value, and the behaviour with missing values in boolean indexing.

@jorisvandenbossche
Copy link
Member Author

@pandas-dev/pandas-core friendly ping

@jreback jreback merged commit 2778c3e into pandas-dev:master Aug 20, 2020
@jreback
Copy link
Contributor

jreback commented Aug 20, 2020

thanks @jorisvandenbossche can always update if needed

@jorisvandenbossche jorisvandenbossche deleted the roadmap-missing-data branch August 20, 2020 14:54
@jorisvandenbossche
Copy link
Member Author

@jreback I think it would have been good to first let the people with comments to react again before merging
(we should maybe discuss what merge policy we should use for roadmap changes)

So @jbrockmendel @WillAyd (and others of course as well) I am still interested if you are OK with the current text

@jreback
Copy link
Contributor

jreback commented Aug 20, 2020

@jorisvandenbossche the commentary is already 10 days old.

@jbrockmendel
Copy link
Member

Fine by me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap A proposal for the roadmap.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants