-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-16: Consistent missing value handling #58988
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# PDEP-16: Consistent missing value handling (with a single NA scalar) | ||
|
||
- Created: March 2024 | ||
- Status: Under discussion | ||
- Discussion: [#32265](https://github.com/pandas-dev/pandas/issues/32265) | ||
- Author: [Patrick Hoefler](https://github.com/phofl) | ||
[Joris Van den Bossche](https://github.com/jorisvandenbossche) | ||
- Revision: 1 | ||
|
||
## Abstract | ||
|
||
... | ||
|
||
## Background | ||
|
||
Currently, pandas handles missing data differently for different data types. We | ||
use different types to indicate that a value is missing: ``np.nan`` for | ||
floating-point data, ``np.nan`` or ``None`` for object-dtype data -- typically | ||
strings or booleans -- with missing values, and ``pd.NaT`` for datetimelike | ||
data. Some other data types, such as integer and bool, cannot store missing data | ||
or are cast to float or object dtype. In addition, pandas 1.0 introduced a new | ||
missing value sentinel, ``pd.NA``, which is being used for the experimental | ||
nullable integer, float, boolean, and string data types, and more recently also | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nullable integer data types came out in 0.24.0 back in January 2019. I think after all that time and having survived two major releases (soon 3) without any substantial change, we should stop referring to these as experimental |
||
for the pyarrow-backed data types. | ||
|
||
These different missing values also have different behaviors in user-facing | ||
operations. Specifically, we introduced different semantics for the nullable | ||
data types for certain operations (e.g. propagating in comparison operations | ||
instead of comparing as False). | ||
|
||
The nullable extension dtypes and the `pd.NA` scalar were originally designed to | ||
solve these problems and to provide consistent missing value behavior between | ||
different dtypes. Historically those are used as 1D arrays, which hinders usage | ||
of those dtypes in certain scenarios that rely on the 2D block structure of the | ||
pandas internals for fast operations (``axis=1`` operations, transposing, etc.). | ||
|
||
Long term, we want to introduce consistent missing data handling for all data | ||
types. This includes consistent behavior in all operations (indexing, arithmetic | ||
operations, comparisons, etc.) and using a missing value scalar that behaves | ||
consistently. | ||
|
||
## Proposal | ||
|
||
This proposal aims to unify the missing value handling across all dtypes. This | ||
proposal is not meant to address implementation details, rather to provide a | ||
high level way forward. | ||
|
||
1. All data types support missing values and use `pd.NA` exclusively as the | ||
user-facing missing value indicator. | ||
|
||
2. All data types implement consistent missing value "semantics" corresponding | ||
to the current nullable dtypes using `pd.NA` (i.e. regarding behaviour in | ||
comparisons, see below for details). | ||
|
||
3. As a consequence, pandas will move to nullable extension arrays by default | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've lost track of if we already implemented this or not, but why do we want to do this generically? Seems like this is provided for compatability with NumPy for the primitive types, but as we expand to non-NumPy types is it really necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't get the question here, can you try to clarify? What exactly are we doing "generically"? Or what do you understand that we are providing for compatibility with numpy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The 2D block structure is for compatability with NumPy right? So I'm curious what would be the perceived advantage of having a 2D extension array of Arrow strings, lists, etc... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, to be clear your comment is not about this first sentence then, only about the last sentence below? (" To preserve the default 2D block structure of the DataFrame internals, the ExtensionArray interface will be extended to support 2D arrays.") My assumption is that we will only make use of 2D ExtensionArrays for data types where the default dtype now is also 2D (i.e. mostly the numerical ones, maybe the datetimes, but so not for example the arrow dtypes). The reason that this is included (although it might sound as an implementation detail, and otherwise the PDEP said to leave out implementation details) is that the choice to go all-in on extension arrays and dtypes (which is the consequence of using "nullable extension arrays by default for all dtypes") would, if we stick to the current 1d EAs, an implicit choice to drop 2d blocks. And that would be a change that we should not make implicitly, but very explicit. And so this PDEP makes that explicit by mentioning it, but makes the choice to not change the 2d block ability of pandas in this PDEP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding from previous discussion was that all EAs would support 2D, which would let us rip out a bunch of complexity in the code. In the case of our ArrowEA I expect we would back it by a pa.Table instead of pa.ChunkedArray (but that detail is definitely out of scope for the pdep). Most other cases are ndarray-backed so easy to do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the 2D operations we are trying to optimize for? I think especially since Arrow uses bitmasks and we assumedly move towards that approach, handling that in a 2D fashion seems rather challenging |
||
for all data types, instead of using the NumPy dtypes that are currently the | ||
default. To preserve the default 2D block structure of the DataFrame internals, | ||
the ExtensionArray interface will be extended to support 2D arrays. | ||
|
||
4. For backwards compatibility, existing missing value indicators like `NaN` and | ||
`NaT` will be interpreted as `pd.NA` when introduced in user input, IO or | ||
through operations (to ensure it keeps being considered as missing). | ||
Specifically for floating dtypes, in practice this means a float column can | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What prevents us from allowing both NaN and NA for floats? I understand that the major use case historically when assigning There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tbh I'd also appreciate it if we could argue about this one a bit longer glad that this PR says
because I think it's fairly important surely there's perf overhead to having to check whether values are NaN, as opposed to just preserving validity masks? i'm all for aiming for nullable dtypes by default, and am hoping that this is done properly (as in, giving up on the historical hack of treating NaN as a missing value indicator). add a few There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We discussed this on the team call yesterday so trying to paraphrase, but @jorisvandenbossche and @Dr-Irv please correct me An assignment like Longer term it would for sure be nice to get to a place where something like that actually assigns NaN and a user would have to use pd.NA exclusively for missing values, but we probably need a release or two of the proposal in its current state before we can disentangle that specifically for the float data type. We might even need to add new methods like Given those considerations are specific to one data type, we might be better off living with that issue through execution of this PDEP and leaving it to a subsequent PDEP that clarifies that behavior just for float data types down the road. So we certainly won't be at the ideal state here, but at least moving in the right direction There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Good summary |
||
for now only contain NA values. Potentially distinguishing NA and NaN is left | ||
for a separate discussion. | ||
|
||
This will ensure that all dtypes have consistent missing value handling and there | ||
is no need to upcast if a missing value is inserted into integers or booleans. Those | ||
nullability semantics will be mostly consistent with how PyArrow treats nulls and thus | ||
make switching between both set of dtypes easier. Additionally, it allows the usage of | ||
other Arrow dtypes by default that use the same semantics (bytes, nested dtypes, ...). | ||
|
||
In practice, this means solidifying the existing integer, float, boolean and | ||
string nullable data types that already exist, and implementing (variants of) | ||
the categorical, datetimelike and interval data types using `pd.NA`. The | ||
proposal leaves the exact implementation details (e.g. whether to use a mask or | ||
a sentinel (where the best strategy might vary by data type depending on | ||
existing code), or whether to use byte masks vs bitmaps, or whether to use | ||
PyArrow under the hood like the string dtype, etc) out of scope. | ||
|
||
This PDEP also does not define the exact API for dtype constructors or | ||
propose a new consistent interface; this is left for a separate discussion | ||
(PDEP-13). | ||
|
||
### The `NA` scalar | ||
|
||
... | ||
|
||
### Missing value semantics | ||
|
||
|
||
... | ||
|
||
## Backward compatibility | ||
|
||
... | ||
|
||
## Timeline | ||
|
||
... | ||
|
||
### PDEP History | ||
|
||
- March 2024: Initial draft | ||
|
||
Note: There is a very long discussion in [GH-32265](https://github.com/pandas-dev/pandas/issues/32265) | ||
that concerns this topic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe after PDEP-14 gets accepted we should revise this to say
"str" uses np.nan, "string" uses pd.NA. Boolean data types do not directly support missing values, so are often cast to object as a result