Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-16: Consistent missing value handling #58988

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions web/pandas/pdeps/0016-consistent-missing-values.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# PDEP-16: Consistent missing value handling (with a single NA scalar)

- Created: March 2024
- Status: Under discussion
- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265)
- Author: [Patrick Hoefler](https://github.com/phofl)
[Joris Van den Bossche](https://github.com/jorisvandenbossche)
- Revision: 1

## Abstract

...

## Background

Currently, pandas handles missing data differently for different data types. We
use different types to indicate that a value is missing: ``np.nan`` for
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically
strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floating-point data, np.nan or None for object-dtype data -- typically
strings or booleans -- with missing values, and pd.NaT for datetimelike

Maybe after PDEP-14 gets accepted we should revise this to say "str" uses np.nan, "string" uses pd.NA. Boolean data types do not directly support missing values, so are often cast to object as a result

data. Some other data types, such as integer and bool, cannot store missing data
or are cast to float or object dtype. In addition, pandas 1.0 introduced a new
missing value sentinel, ``pd.NA``, which is being used for the experimental
nullable integer, float, boolean, and string data types, and more recently also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullable integer data types came out in 0.24.0 back in January 2019. I think after all that time and having survived two major releases (soon 3) without any substantial change, we should stop referring to these as experimental

for the pyarrow-backed data types.

These different missing values also have different behaviors in user-facing
operations. Specifically, we introduced different semantics for the nullable
data types for certain operations (e.g. propagating in comparison operations
instead of comparing as False).

The nullable extension dtypes and the `pd.NA` scalar were originally designed to
solve these problems and to provide consistent missing value behavior between
different dtypes. Historically those are used as 1D arrays, which hinders usage
of those dtypes in certain scenarios that rely on the 2D block structure of the
pandas internals for fast operations (``axis=1`` operations, transposing, etc.).

Long term, we want to introduce consistent missing data handling for all data
types. This includes consistent behavior in all operations (indexing, arithmetic
operations, comparisons, etc.) and using a missing value scalar that behaves
consistently.

## Proposal

This proposal aims to unify the missing value handling across all dtypes. This
proposal is not meant to address implementation details, rather to provide a
high level way forward.

1. All data types support missing values and use `pd.NA` exclusively as the
user-facing missing value indicator.

2. All data types implement consistent missing value "semantics" corresponding
to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in
comparisons, see below for details).

3. As a consequence, pandas will move to nullable extension arrays by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've lost track of if we already implemented this or not, but why do we want to do this generically? Seems like this is provided for compatability with NumPy for the primitive types, but as we expand to non-NumPy types is it really necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the question here, can you try to clarify? What exactly are we doing "generically"? Or what do you understand that we are providing for compatibility with numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2D block structure is for compatability with NumPy right? So I'm curious what would be the perceived advantage of having a 2D extension array of Arrow strings, lists, etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, to be clear your comment is not about this first sentence then, only about the last sentence below? (" To preserve the default 2D block structure of the DataFrame internals, the ExtensionArray interface will be extended to support 2D arrays.")

My assumption is that we will only make use of 2D ExtensionArrays for data types where the default dtype now is also 2D (i.e. mostly the numerical ones, maybe the datetimes, but so not for example the arrow dtypes).

The reason that this is included (although it might sound as an implementation detail, and otherwise the PDEP said to leave out implementation details) is that the choice to go all-in on extension arrays and dtypes (which is the consequence of using "nullable extension arrays by default for all dtypes") would, if we stick to the current 1d EAs, an implicit choice to drop 2d blocks. And that would be a change that we should not make implicitly, but very explicit. And so this PDEP makes that explicit by mentioning it, but makes the choice to not change the 2d block ability of pandas in this PDEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from previous discussion was that all EAs would support 2D, which would let us rip out a bunch of complexity in the code. In the case of our ArrowEA I expect we would back it by a pa.Table instead of pa.ChunkedArray (but that detail is definitely out of scope for the pdep). Most other cases are ndarray-backed so easy to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the 2D operations we are trying to optimize for? I think especially since Arrow uses bitmasks and we assumedly move towards that approach, handling that in a 2D fashion seems rather challenging

for all data types, instead of using the NumPy dtypes that are currently the
default. To preserve the default 2D block structure of the DataFrame internals,
the ExtensionArray interface will be extended to support 2D arrays.

4. For backwards compatibility, existing missing value indicators like `NaN` and
`NaT` will be interpreted as `pd.NA` when introduced in user input, IO or
through operations (to ensure it keeps being considered as missing).
Specifically for floating dtypes, in practice this means a float column can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents us from allowing both NaN and NA for floats? I understand that the major use case historically when assigning np.nan has probably been to introduce a missing value and that should map to pd.NA with this PDEP, but what prevents an operation like 0/0 from still introduction NaN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'd also appreciate it if we could argue about this one a bit longer

glad that this PR says

Potentially distinguishing NA and NaN is left for a separate discussion.

because I think it's fairly important

surely there's perf overhead to having to check whether values are NaN, as opposed to just preserving validity masks?


i'm all for aiming for nullable dtypes by default, and am hoping that this is done properly (as in, giving up on the historical hack of treating NaN as a missing value indicator). add a few nan_as_null arguments in a few places, and enjoy consistency with arrow-based tools

Copy link
Member

@WillAyd WillAyd Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me

An assignment like ser.iloc[0] = np.nan could mean either "assign a missing value" or "assign the NaN" sentinel. The latter is more technically correct, but owing to our history this probably means the former in most cases. So continuing with that assumption in the near term reduces the amount of churn for end users.

Longer term it would for sure be nice to get to a place where something like that actually assigns NaN and a user would have to use pd.NA exclusively for missing values, but we probably need a release or two of the proposal in its current state before we can disentangle that specifically for the float data type. We might even need to add new methods like is_nan instead of isna for that purpose.

Given those considerations are specific to one data type, we might be better off living with that issue through execution of this PDEP and leaving it to a subsequent PDEP that clarifies that behavior just for float data types down the road. So we certainly won't be at the ideal state here, but at least moving in the right direction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me

Good summary

for now only contain NA values. Potentially distinguishing NA and NaN is left
for a separate discussion.

This will ensure that all dtypes have consistent missing value handling and there
is no need to upcast if a missing value is inserted into integers or booleans. Those
nullability semantics will be mostly consistent with how PyArrow treats nulls and thus
make switching between both set of dtypes easier. Additionally, it allows the usage of
other Arrow dtypes by default that use the same semantics (bytes, nested dtypes, ...).

In practice, this means solidifying the existing integer, float, boolean and
string nullable data types that already exist, and implementing (variants of)
the categorical, datetimelike and interval data types using `pd.NA`. The
proposal leaves the exact implementation details (e.g. whether to use a mask or
a sentinel (where the best strategy might vary by data type depending on
existing code), or whether to use byte masks vs bitmaps, or whether to use
PyArrow under the hood like the string dtype, etc) out of scope.

This PDEP also does not define the exact API for dtype constructors or
propose a new consistent interface; this is left for a separate discussion
(PDEP-13).

### The `NA` scalar

...

### Missing value semantics


...

## Backward compatibility

...

## Timeline

...

### PDEP History

- March 2024: Initial draft

Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265)
that concerns this topic.
Loading