-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use new NA scalar in BooleanArray #29961
Use new NA scalar in BooleanArray #29961
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM at a glance. Request for a few more tests / confirmation that these are already testsed:
- Test to enure that
array([True, False, None, np.nan pd.NA], dtype="boolean")
correctly sanitizes all the NA-like value to be NA. - Test in
BooleanArray.__setitem__
ensuring thatarr[0] = np.nan
, etc., always insertsNA
.
One question that comes up here (but the same question applies to string and integer arrays): what "NA value" do we want to use when converting to object dtype numpy arrays? In the initial BooleanArray PR, I used |
I'm not sure, but my initial preference is for having def to_numpy(self, dtype=object, na_value=pd.NA):
... Then if the user wants None / NaN, they can request it relatively easily.
What are the user-actions that hit this?
|
Examples where the Now, both are easy to solve (certainly since both can easily be solved on our side; the hashing code can recognize pd.NA and for the conversion to pyarrow we can ensure to use None before passing to pyarrow). But it are examples of how other code can break. Still not sure that we should therefore use None as the default in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "what value to use when converting", I'm not sure that downstream projects not understanding pd.NA
should drive our decision here. We'll be living with this decision for a while, and projects will have time to adapt.
pandas/core/arrays/boolean.py
Outdated
@@ -281,7 +281,9 @@ def __getitem__(self, item): | |||
return self._data[item] | |||
return type(self)(self._data[item], self._mask[item]) | |||
|
|||
def _coerce_to_ndarray(self, force_bool: bool = False): | |||
def _coerce_to_ndarray( | |||
self, force_bool: bool = False, na_value: "Scalar" = lib._no_default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to prefer lib_no_default
to just libmissing.NA
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure. I was probably sidetracked by the idea I could not use None as default as that is a valid value as well ..
(if we want to have this generic / shared with other arrays, using self._na_value
might be useful, but I don't think we will share this with arrays that don't use pd.NA, so ..)
@TomAugspurger updated this |
See #30043 for the CI failures. OK to ignore for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move forward with this, despite the ongoing API discussion about .to_numpy
/ .astype(float)
/ np.asarray(arr)
, since that will be affecting all IntegerArray / StringArray / BooleanArray.
if is_integer_dtype(dtype): | ||
if self.isna().any(): | ||
raise ValueError("cannot convert NA to integer") | ||
# for float dtype, ensure we use np.nan before casting (numpy cannot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending the discussion in #30038.
One more note: we'll need to handle reductions like In [16]: pd.array([True, None])._reduce('all')
Out[16]: True
In [17]: pd.array([True, None])._reduce('all', skipna=False)
Out[17]: True
In [18]: pd.array([False, None])._reduce('all', )
Out[18]: False
In [19]: pd.array([False, None])._reduce('all', skipna=False)
Out[19]: False Do we want to do that here or as a followup? |
Let's do that as a follow-up, that's a remainder from the initial implementation, it's noted as a to do item in #29556 |
Sounds good to me. |
OK, going to merge this then, so the Kleene PR can then be updated. |
Once the IntegerArray PR is in, I will also take a look at consolidating both classes |
I'll update the Kleene PR now, and the Integer one after if I have a chance. |
I did a quick PR for the any/all: #30062 |
Follow-up on #29597 and #29555, now actually using the
pd.NA
scalar inBooleanArray
.