ENH: add NA scalar for missing value indicator, use in StringArray. #29597

jorisvandenbossche · 2019-11-13T14:12:26Z

This PR adds a pd.NA singleton with the behaviour as discussed in above issues. For now, it's only used in StringArray in this PR.

jorisvandenbossche · 2019-11-13T14:15:57Z

pandas/core/na_scalar.py

+
+def _create_binary_propagating_op(name):
+    def method(self, other):
+        if isinstance(other, numbers.Number) or other is NA or isinstance(other, str):


Question is what type of objects we should recognize here.

I think we want to recognize scalar values that can be stored in dataframes. But of course we don't have control over what kind of scalars people put in a dataframe.
For now I did numbers + string (numbers also covers numpy scalars). Our own scalars (Timedelta, Timestamp, etc) could also be added, or they can also handle that on their side)

you might want to look at how NaT handles these, in particular for arraylike others

Do we want it to be recognized by arrays? (I somewhat purposefully left those out, but didn't think it fully through)

For our own internal arrays (eg IntegerArray), that can be handled on the array level? (so returning NotImplemented here)

For numpy arrays the question is what to do with it, as you cannot represent NAs in a numpy array (except in an object array). So not handling it might be fine?

For numpy arrays the question is what to do with it, as you cannot represent NAs in a numpy array (except in an object array). So not handling it might be fine?

Actually, by not handling it and deferring to numpy, this already happens (-> converting to object):

In [61]: np.array([1, 2, 3]) + pd.NA Out[61]: array([NA, NA, NA], dtype=object) In [62]: pd.NA + np.array([1, 2, 3]) Out[62]: array([NA, NA, NA], dtype=object)

thats fine for now, but longer-term this will need to be optimized

but longer-term this will need to be optimized

Can you explain what you mean with this?

pandas/core/na_scalar.py

jorisvandenbossche · 2019-11-13T14:19:05Z

pandas/tests/scalar/test_na_scalar.py

+        if isinstance(other, np.int64):
+            # for numpy scalars we get a deprecation warning and False as result
+            # for equality or error for larger/lesser than
+            continue


So numpy scalars we don't have full control over, so this means that if they are the left operand, we get some other behaviour:

In [27]: np.int64(1) == pd.NA /home/joris/miniconda3/envs/dev/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future. #!/home/joris/miniconda3/envs/dev/bin/python Out[27]: False In [28]: pd.NA == np.int64(1) Out[28]: NA In [29]: np.int64(1) < pd.NA --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-29-87134fac2734> in <module> ----> 1 np.int64(1) < pd.NA ~/scipy/pandas/pandas/core/na_scalar.py in __bool__(self) 37 38 def __bool__(self): ---> 39 raise TypeError("boolean value of NA is ambiguous") 40 41 def __hash__(self): TypeError: boolean value of NA is ambiguous In [30]: pd.NA > np.int64(1) Out[30]: NA

(for the first case, not sure what the behaviour will be once the change in numpy is done)

pandas/core/na_scalar.py

pandas/tests/scalar/test_na_scalar.py

TomAugspurger · 2019-11-13T16:14:10Z

Any chance you could change StringDtype.na_value to use this, and see what breaks?

There are a few places where we use np.nan when we should be referring to StringDtype.na_value instead.

I suspect that _libs.lib.Validator.is_valid_null or util.is_nan will need to be updated to recognice pd.NA as NA. So not a big deal if it's a bunch of work to prototype.

pandas/core/na_scalar.py

jbrockmendel · 2019-11-13T19:12:38Z

pandas/core/na_scalar.py

+        elif other is False or other is NA:
+            return NA
+        else:
+            return NotImplemented


other NA objects? NaT? NaN? None?

For logical ops, I am not sure it should handle those, as logical ops typically involve booleans (and NaT/NaN/None cannot be stored in a boolean array)

Eg for float is also raises right now:

In [69]: pd.NA | 1.5 ... TypeError: unsupported operand type(s) for |: 'NAType' and 'float' In [70]: pd.NA | np.nan ... TypeError: unsupported operand type(s) for |: 'NAType' and 'float'

(for comparison ops, though, we should probably add those)

jbrockmendel · 2019-11-13T19:14:13Z

core.missing and _libs.missing will need to recognize this

TomAugspurger · 2019-11-13T21:23:00Z

Playing with using this in StringArray, will push something (here or a PR against this) in a bit.

Anecdotally, having __bool__ raise is kinda annoying since it breaks reductions like all.

(Pdb) pp left_value
<StringArray>
[NA, NA]
Length: 2, dtype: string
(Pdb) np.all(left_value == right_value)
*** TypeError: boolean value of NA is ambiguous

But maybe that's a good thing. I'd like to develop some more experience here.

jorisvandenbossche · 2019-11-13T21:37:11Z

@jbrockmendel I moved it to cython (well, still mostly copy paste of the python version into the cython file) and let missing functions recognize this (isna), so feedback on that part is very welcome

pandas/_libs/missing.pyx

jorisvandenbossche · 2019-11-13T21:44:36Z

Anecdotally, having bool raise is kinda annoying since it breaks reductions like all.

We will need to rethink any/all in general as well. In principle, those are reducing operations, and reductions skip NAs, so your example should be equivalent to all of an empty Series/array.
Although in the case of any/all, skipping NAs also doesn't fully feel as the best behaviour in many cases (eg for boolean array all([True, NA]) == True would be also be somewhat strange as the NA could be False, so you don't know the result).
This also recently came up for object dtypes IIRC, and was planning to open a new issue for it (or find an existing).

TomAugspurger · 2019-11-13T22:06:44Z

If anyone wants a preview of what getting this working with StringArray looks like: https://github.com/jorisvandenbossche/pandas/pull/1/files.

It'll mean a larger diff, but I think we'll want to merge that PR against Joris' branch into here, so that we can get a feel for what actually using this is going to look like before merging it into master.

jbrockmendel · 2019-11-13T22:31:27Z

I suspect that _libs.lib.Validator.is_valid_null or util.is_nan will need to be updated to recognice pd.NA as NA. So not a big deal if it's a bunch of work to prototype.

util.is_nan is specifically for float-nan. The scalar-recognizing functions that need to recognize this are in _libs.missing.

pandas/_libs/missing.pyx

TomAugspurger · 2019-11-14T14:15:03Z

Joris merged in my changes to make StringArray use pd.NA. It wasn't too large of a change so far.

I want to call out one behavior change we may want to make: Right now, Series[string].str methods that return numeric output (like .str.count) return either int or float dtype, depending on whether there are NAs. See https://github.com/pandas-dev/pandas/pull/29597/files#diff-683f8dacd08abda4913680fafe3a4ea7R3512.

I'm proposing that we change that behavior to always return an integer-na dtype in those cases for string dtype (no change to object dtype).

jorisvandenbossche · 2019-11-14T14:28:32Z

I'm proposing that we change that behavior to always return an integer-na dtype in those cases for string dtype (no change to object dtype).

+1

It's also not a "breaking" change, as string dtype is still new. In general, with those new dtypes I think we should try to return as much as possible also new dtypes for the result (eg also for boolean results, once boolean array has landed)

TomAugspurger · 2019-11-25T16:58:28Z

Just to scope this PR, I think that using pd.NA in BooleanArray should be done as a followup PR.

pandas/_libs/missing.pyx

jbrockmendel · 2019-11-25T17:06:26Z

pandas/_libs/missing.pyx

+def _create_binary_propagating_op(name, divmod=False):
+
+    def method(self, other):
+        if isinstance(other, numbers.Number) or other is NA or isinstance(other, str):


can you put the numbers.Number check last, as it will be least performant

jorisvandenbossche · 2019-11-26T18:03:22Z

I added some initial docs (whatsnew + section in missing_data

Dr-Irv · 2019-11-27T14:33:28Z

@jorisvandenbossche I've looked at your doc changes. You may want to say something about that when you read data, the new pd.NA type won't be used, until it gets addressed by closing the issue I created #29752 .

TomAugspurger

Docs look very nice, thanks.

TomAugspurger · 2019-11-27T15:12:19Z

doc/source/user_guide/missing_data.rst

+This also means that ``pd.NA`` cannot be used in a context where it is
+evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can
+potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check
+for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA``


Suggested change

for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA``

for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by filling missing values beforehand.

(can't apply this directly, since it wraps two lines)

jorisvandenbossche · 2019-11-27T15:43:19Z

cc @pandas-dev/pandas-core more comments on this?

jreback

lgtm. some very minor doc comments. can you run the asv's to see if isna is impacted in any way here.

jreback · 2019-11-27T15:46:53Z

doc/source/user_guide/missing_data.rst

+
+Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is
+available to represent scalar missing values. At this moment, it is used in
+the nullable integer and boolean data types and the dedicated string data type


use commas rather than 3 ands, ideally if you can link to those sections?

jreback · 2019-11-27T15:48:09Z

doc/source/user_guide/missing_data.rst

+   pd.NA | True
+
+On the other hand, if one of the operands is ``False``, the result depends
+on the value of the other operand. Therefore, in thise case ``pd.NA``


thise -> this

jreback · 2019-11-27T15:49:52Z

doc/source/user_guide/missing_data.rst

+``NA`` in a boolean context
+---------------------------
+
+Since the actual value of an NA is unknown, it is ambiguous to convert NA


I think we have a section gotchas.truth that is very similiar, could link.

jorisvandenbossche · 2019-11-28T12:53:51Z

Updated for the doc comments, thanks! @Dr-Irv I added a sentence on that. But we should probably add a more extensive section on that in a page dedicated to those new data types (for a follow-up).

@jreback I run the isnull benchmarks on this PR and on master, nothing seems off out of the error / noise margin.

jreback · 2019-12-01T23:42:00Z

thanks @jorisvandenbossche nice addition.

unless I missed this, you didn't update nullable integers to return that, assume that's a followon.

jorisvandenbossche · 2019-12-02T07:09:57Z

you didn't update nullable integers to return that, assume that's a followon.

Yes, as being discussed, see #29556 for all follow-ups

…ndexing-1row-df * upstream/master: (49 commits) repr() (pandas-dev#29959) DOC : Typo fix in userguide/Styling (pandas-dev#29956) CLN: small things in pytables (pandas-dev#29958) API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max (pandas-dev#27929) DEPR: DTI/TDI/PI constructor arguments (pandas-dev#29930) CLN: fix pytables passing too many kwargs (pandas-dev#29951) Typing (pandas-dev#29947) repr() (pandas-dev#29948) repr() (pandas-dev#29950) Added space at the end of the sentence (pandas-dev#29949) ENH: add NA scalar for missing value indicator, use in StringArray. (pandas-dev#29597) CLN: BlockManager.apply (pandas-dev#29825) TST: add test for rolling max/min/mean with DatetimeIndex over different frequencies (pandas-dev#29932) CLN: explicit signature for to_hdf (pandas-dev#29939) CLN: make kwargs explicit for pytables read_ methods (pandas-dev#29935) Convert core/indexes/base.py to f-strings (pandas-dev#29903) DEPR: dropna multiple axes, fillna int for td64, from_codes with floats, Series.nonzero (pandas-dev#29875) CLN: make kwargs explicit in pytables constructors (pandas-dev#29936) DEPR: tz_convert in the Timestamp constructor raises (pandas-dev#29929) STY: F-strings and repr (pandas-dev#29938) ...

…andas-dev#29597)

ENH: add NA scalar for missing value indicator

03f83bd

jorisvandenbossche added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 13, 2019

jorisvandenbossche mentioned this pull request Nov 13, 2019

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open

jorisvandenbossche commented Nov 13, 2019

View reviewed changes

TomAugspurger reviewed Nov 13, 2019

View reviewed changes

pandas/core/na_scalar.py Outdated Show resolved Hide resolved

Dr-Irv reviewed Nov 13, 2019

View reviewed changes

pandas/tests/scalar/test_na_scalar.py Show resolved Hide resolved

jorisvandenbossche added 2 commits November 13, 2019 15:40

add np.nan to arithmetic/comparison tests

c1797d5

use id(self) for hash

3339eaa

Dr-Irv reviewed Nov 13, 2019

View reviewed changes

pandas/tests/scalar/test_na_scalar.py Outdated Show resolved Hide resolved

fix api test

e9d4d6a

jbrockmendel reviewed Nov 13, 2019

View reviewed changes

pandas/core/na_scalar.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Nov 13, 2019

View reviewed changes

move to cython

4450d2d

jorisvandenbossche commented Nov 13, 2019

View reviewed changes

pandas/_libs/missing.pyx Show resolved Hide resolved

jbrockmendel reviewed Nov 13, 2019

View reviewed changes

pandas/_libs/missing.pyx Outdated Show resolved Hide resolved

jorisvandenbossche and others added 2 commits November 14, 2019 09:20

add examples to isna/notna docstring

1849a23

Use NA scalar in string dtype (#1)

c72e3ee

jorisvandenbossche marked this pull request as ready for review November 14, 2019 13:54

TomAugspurger changed the title ~~ENH: add NA scalar for missing value indicator~~ ENH: add NA scalar for missing value indicator, use in StringArray. Nov 14, 2019

jorisvandenbossche added 2 commits November 14, 2019 15:52

Merge remote-tracking branch 'upstream/master' into NA-scalar

3a97782

fix doctest

2302661

jbrockmendel reviewed Nov 25, 2019

View reviewed changes

pandas/_libs/missing.pyx Show resolved Hide resolved

jbrockmendel reviewed Nov 25, 2019

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 25, 2019

ENH: add BooleanArray extension array #29555

Merged

jorisvandenbossche added 3 commits November 25, 2019 18:34

Merge remote-tracking branch 'upstream/master' into NA-scalar

1cadeda

NA -> C_NA

1fcf4b7

start some docs

f6798e5

jorisvandenbossche added 4 commits November 27, 2019 08:37

futher doc updates

14c1434

Merge remote-tracking branch 'upstream/master' into NA-scalar

788a2c2

doc fixup

1bcbab2

Merge remote-tracking branch 'upstream/master' into NA-scalar

775cdfb

TomAugspurger reviewed Nov 27, 2019

View reviewed changes

jreback approved these changes Nov 27, 2019

View reviewed changes

jreback added the Enhancement label Nov 27, 2019

jreback added this to the 1.0 milestone Nov 27, 2019

further doc updates

589a961

TomAugspurger approved these changes Nov 28, 2019

View reviewed changes

jreback merged commit 7ea4e61 into pandas-dev:master Dec 1, 2019

jorisvandenbossche deleted the NA-scalar branch December 2, 2019 07:08

jorisvandenbossche mentioned this pull request Dec 2, 2019

Use new NA scalar in BooleanArray #29961

Merged

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: add NA scalar for missing value indicator, use in StringArray. (p…

722470a

…andas-dev#29597)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

ENH: add NA scalar for missing value indicator, use in StringArray. (p…

a2cec79

…andas-dev#29597)

jorisvandenbossche mentioned this pull request Dec 2, 2020

API: bool(pd.NA) #38224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

jorisvandenbossche commented Nov 13, 2019 •

edited

Loading

jorisvandenbossche Nov 13, 2019

jbrockmendel Nov 13, 2019

jorisvandenbossche Nov 13, 2019

jorisvandenbossche Nov 13, 2019

jbrockmendel Nov 13, 2019

jorisvandenbossche Nov 14, 2019

jorisvandenbossche Nov 13, 2019

TomAugspurger commented Nov 13, 2019

jbrockmendel Nov 13, 2019

jorisvandenbossche Nov 13, 2019

jbrockmendel commented Nov 13, 2019

TomAugspurger commented Nov 13, 2019

jorisvandenbossche commented Nov 13, 2019 •

edited

Loading

jorisvandenbossche commented Nov 13, 2019

TomAugspurger commented Nov 13, 2019

jbrockmendel commented Nov 13, 2019

TomAugspurger commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

TomAugspurger commented Nov 25, 2019

jbrockmendel Nov 25, 2019

jorisvandenbossche commented Nov 26, 2019

Dr-Irv commented Nov 27, 2019

TomAugspurger left a comment

TomAugspurger Nov 27, 2019

jorisvandenbossche commented Nov 27, 2019

jreback left a comment

jreback Nov 27, 2019

jreback Nov 27, 2019

jreback Nov 27, 2019

jorisvandenbossche commented Nov 28, 2019

jreback commented Dec 1, 2019

jorisvandenbossche commented Dec 2, 2019

	for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA``
	for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by filling missing values beforehand.

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

Conversation

jorisvandenbossche commented Nov 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Nov 13, 2019

TomAugspurger commented Nov 13, 2019

jorisvandenbossche commented Nov 13, 2019 • edited Loading

jorisvandenbossche commented Nov 13, 2019

TomAugspurger commented Nov 13, 2019

jbrockmendel commented Nov 13, 2019

TomAugspurger commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

TomAugspurger commented Nov 25, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 26, 2019

Dr-Irv commented Nov 27, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 27, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 28, 2019

jreback commented Dec 1, 2019

jorisvandenbossche commented Dec 2, 2019

jorisvandenbossche commented Nov 13, 2019 •

edited

Loading

jorisvandenbossche commented Nov 13, 2019 •

edited

Loading