-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add NA scalar for missing value indicator, use in StringArray. #29597
Conversation
pandas/core/na_scalar.py
Outdated
|
||
def _create_binary_propagating_op(name): | ||
def method(self, other): | ||
if isinstance(other, numbers.Number) or other is NA or isinstance(other, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is what type of objects we should recognize here.
I think we want to recognize scalar values that can be stored in dataframes. But of course we don't have control over what kind of scalars people put in a dataframe.
For now I did numbers + string (numbers also covers numpy scalars). Our own scalars (Timedelta, Timestamp, etc) could also be added, or they can also handle that on their side)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to look at how NaT handles these, in particular for arraylike others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want it to be recognized by arrays? (I somewhat purposefully left those out, but didn't think it fully through)
For our own internal arrays (eg IntegerArray), that can be handled on the array level? (so returning NotImplemented here)
For numpy arrays the question is what to do with it, as you cannot represent NAs in a numpy array (except in an object array). So not handling it might be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For numpy arrays the question is what to do with it, as you cannot represent NAs in a numpy array (except in an object array). So not handling it might be fine?
Actually, by not handling it and deferring to numpy, this already happens (-> converting to object):
In [61]: np.array([1, 2, 3]) + pd.NA
Out[61]: array([NA, NA, NA], dtype=object)
In [62]: pd.NA + np.array([1, 2, 3])
Out[62]: array([NA, NA, NA], dtype=object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats fine for now, but longer-term this will need to be optimized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but longer-term this will need to be optimized
Can you explain what you mean with this?
if isinstance(other, np.int64): | ||
# for numpy scalars we get a deprecation warning and False as result | ||
# for equality or error for larger/lesser than | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So numpy scalars we don't have full control over, so this means that if they are the left operand, we get some other behaviour:
In [27]: np.int64(1) == pd.NA
/home/joris/miniconda3/envs/dev/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
#!/home/joris/miniconda3/envs/dev/bin/python
Out[27]: False
In [28]: pd.NA == np.int64(1)
Out[28]: NA
In [29]: np.int64(1) < pd.NA
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-87134fac2734> in <module>
----> 1 np.int64(1) < pd.NA
~/scipy/pandas/pandas/core/na_scalar.py in __bool__(self)
37
38 def __bool__(self):
---> 39 raise TypeError("boolean value of NA is ambiguous")
40
41 def __hash__(self):
TypeError: boolean value of NA is ambiguous
In [30]: pd.NA > np.int64(1)
Out[30]: NA
(for the first case, not sure what the behaviour will be once the change in numpy is done)
Any chance you could change StringDtype.na_value to use this, and see what breaks? There are a few places where we use I suspect that |
pandas/core/na_scalar.py
Outdated
elif other is False or other is NA: | ||
return NA | ||
else: | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other NA objects? NaT? NaN? None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For logical ops, I am not sure it should handle those, as logical ops typically involve booleans (and NaT/NaN/None cannot be stored in a boolean array)
Eg for float is also raises right now:
In [69]: pd.NA | 1.5
...
TypeError: unsupported operand type(s) for |: 'NAType' and 'float'
In [70]: pd.NA | np.nan
...
TypeError: unsupported operand type(s) for |: 'NAType' and 'float'
(for comparison ops, though, we should probably add those)
core.missing and _libs.missing will need to recognize this |
Playing with using this in StringArray, will push something (here or a PR against this) in a bit. Anecdotally, having (Pdb) pp left_value
<StringArray>
[NA, NA]
Length: 2, dtype: string
(Pdb) np.all(left_value == right_value)
*** TypeError: boolean value of NA is ambiguous But maybe that's a good thing. I'd like to develop some more experience here. |
@jbrockmendel I moved it to cython (well, still mostly copy paste of the python version into the cython file) and let missing functions recognize this (isna), so feedback on that part is very welcome |
We will need to rethink any/all in general as well. In principle, those are reducing operations, and reductions skip NAs, so your example should be equivalent to |
If anyone wants a preview of what getting this working with StringArray looks like: https://github.com/jorisvandenbossche/pandas/pull/1/files. It'll mean a larger diff, but I think we'll want to merge that PR against Joris' branch into here, so that we can get a feel for what actually using this is going to look like before merging it into master. |
util.is_nan is specifically for float-nan. The scalar-recognizing functions that need to recognize this are in _libs.missing. |
Joris merged in my changes to make StringArray use I want to call out one behavior change we may want to make: Right now, I'm proposing that we change that behavior to always return an integer-na dtype in those cases for |
+1 It's also not a "breaking" change, as string dtype is still new. In general, with those new dtypes I think we should try to return as much as possible also new dtypes for the result (eg also for boolean results, once boolean array has landed) |
Just to scope this PR, I think that using |
pandas/_libs/missing.pyx
Outdated
def _create_binary_propagating_op(name, divmod=False): | ||
|
||
def method(self, other): | ||
if isinstance(other, numbers.Number) or other is NA or isinstance(other, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put the numbers.Number check last, as it will be least performant
I added some initial docs (whatsnew + section in missing_data |
@jorisvandenbossche I've looked at your doc changes. You may want to say something about that when you read data, the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look very nice, thanks.
This also means that ``pd.NA`` cannot be used in a context where it is | ||
evaluated to a boolean, such as ``if condition: ...`` where ``condition`` can | ||
potentially be ``pd.NA``. In such cases, :func:`isna` can be used to check | ||
for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for ``pd.NA`` or it could be prevented that ``condition`` can be ``pd.NA`` | |
for ``pd.NA`` or ``condition`` being ``pd.NA`` can be avoided, for example by filling missing values beforehand. |
(can't apply this directly, since it wraps two lines)
cc @pandas-dev/pandas-core more comments on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some very minor doc comments. can you run the asv's to see if isna is impacted in any way here.
|
||
Starting from pandas 1.0, an experimental ``pd.NA`` value (singleton) is | ||
available to represent scalar missing values. At this moment, it is used in | ||
the nullable integer and boolean data types and the dedicated string data type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use commas rather than 3 ands, ideally if you can link to those sections?
pd.NA | True | ||
|
||
On the other hand, if one of the operands is ``False``, the result depends | ||
on the value of the other operand. Therefore, in thise case ``pd.NA`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thise -> this
``NA`` in a boolean context | ||
--------------------------- | ||
|
||
Since the actual value of an NA is unknown, it is ambiguous to convert NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a section gotchas.truth that is very similiar, could link.
Updated for the doc comments, thanks! @Dr-Irv I added a sentence on that. But we should probably add a more extensive section on that in a page dedicated to those new data types (for a follow-up). @jreback I run the |
thanks @jorisvandenbossche nice addition. unless I missed this, you didn't update nullable integers to return that, assume that's a followon. |
Yes, as being discussed, see #29556 for all follow-ups |
…ndexing-1row-df * upstream/master: (49 commits) repr() (pandas-dev#29959) DOC : Typo fix in userguide/Styling (pandas-dev#29956) CLN: small things in pytables (pandas-dev#29958) API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max (pandas-dev#27929) DEPR: DTI/TDI/PI constructor arguments (pandas-dev#29930) CLN: fix pytables passing too many kwargs (pandas-dev#29951) Typing (pandas-dev#29947) repr() (pandas-dev#29948) repr() (pandas-dev#29950) Added space at the end of the sentence (pandas-dev#29949) ENH: add NA scalar for missing value indicator, use in StringArray. (pandas-dev#29597) CLN: BlockManager.apply (pandas-dev#29825) TST: add test for rolling max/min/mean with DatetimeIndex over different frequencies (pandas-dev#29932) CLN: explicit signature for to_hdf (pandas-dev#29939) CLN: make kwargs explicit for pytables read_ methods (pandas-dev#29935) Convert core/indexes/base.py to f-strings (pandas-dev#29903) DEPR: dropna multiple axes, fillna int for td64, from_codes with floats, Series.nonzero (pandas-dev#29875) CLN: make kwargs explicit in pytables constructors (pandas-dev#29936) DEPR: tz_convert in the Timestamp constructor raises (pandas-dev#29929) STY: F-strings and repr (pandas-dev#29938) ...
xref #28095, #28778
This PR adds a
pd.NA
singleton with the behaviour as discussed in above issues. For now, it's only used in StringArray in this PR.