Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fillna(value=None, method="constant") #28124

Closed
wants to merge 1 commit into from
Closed

Allow fillna(value=None, method="constant") #28124

wants to merge 1 commit into from

Conversation

valtron
Copy link

@valtron valtron commented Aug 24, 2019

Sometimes data has object columns that contain both NaNs and None and we want to standardize them to None. Currently fillna doesn't allow value=None because None is used to mean "no value", and method=None to mean "fill with the value", so when both are none it's considered invalid. (On columns with dtypes that have their own NAs, like floats, timestamps, etc., filling with None leaves them as-is.)

To get around this, I added method="constant" which should also be taken to mean "fill with the value". method="constant" is required if filling with None (so that value=None, method=None still throws), otherwise it works the same as before.

As far as I can tell, doing it this way shouldn't break any existing code. I updated/added some tests for the new behaviour. If it looks like this change might be accepted, I'll update the docs as well.

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Aug 24, 2019

Hello @valtron! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 442:89: E501 line too long (92 > 88 characters)

Comment last updated at 2019-08-24 05:29:33 UTC

@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Aug 24, 2019
@gfyoung gfyoung requested a review from jreback August 24, 2019 22:56
@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2019

I think I'm -1 on this as it adds complexity for a rather niche case. What real advantage are you hoping to get out of replacing NA with None?

@TomAugspurger
Copy link
Contributor

I'm also initially against this. We have a few things to work out with NA values (#28095). It's not clear how None will be handled there.

If we wanted to support this, the easier way would be

default_fill_value = object()
def fillna(..., value=default_fill_value, ...):

If the user passes None, it'd be up to the dtype to determine whether or not that's a valid fill value. But we'll need to settle on whether this is desirable before moving forward.

@valtron if you have use cases where None is a useful value, you may want to speak up in #28095.

@valtron
Copy link
Author

valtron commented Aug 26, 2019

@WillAyd I've had object columns that contain NaN (and possibly a mixture of None and Nan) and currently I do c.loc[c.isna()] = None to standardize things and simplify stuff further down the line. I prefer None to NaN (otherwise I could use c.fillna(np.nan) to standardize) because the values have to be compared later and I don't want to write (c1 == c2) or (pd.isna() and pd.isna()).

BTW, currently method=None means "fill with constant given by value" (cf. numpy.pad) so this isn't a "complex" change.

@TomAugspurger I tried implementing it that way (using a sentinel for the default value) at first, but it requires more changes, and wouldn't be backward-compatible (e.g. fillna(None, method="pad") would become an error).

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2019

@TomAugspurger I tried implementing it that way (using a sentinel for the default value) at first, but it requires more changes, and wouldn't be backward-compatible (e.g. fillna(None, method="pad") would become an error).

Can you expand on what "more changes" entails? Understood concern on the latter piece but generally this seems like a more ideal approach to get what you are after

@valtron
Copy link
Author

valtron commented Aug 28, 2019

I spent a bit more time on the alternate implementation (https://github.com/valtron/pandas/commit/66e987faca4a677928479e402302ddefb5099398); I don't have all the tests passing yet, but so far I'm guessing the remaining changes would all involve tracking down fill_values in other functions and changing them to default to MISSING, change the logic from checking from None -> MISSING, etc.

@WillAyd
Copy link
Member

WillAyd commented Sep 13, 2019

Thanks for the contribution! I think this is a hold for now though, and as mentioned before probably worth voicing thoughts in #28095 to support this feature

@WillAyd WillAyd closed this Sep 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants