Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add dt.fillna() function to impute missing values #3311

Merged
merged 26 commits into from
Aug 9, 2022

Conversation

samukweku
Copy link
Contributor

@samukweku samukweku commented Jul 14, 2022

Add dt.fillna() function to replace missing values with the previous/subsequent non-missing.

WIP for #3279

@samukweku samukweku added the new feature Feature requests for new functionality label Jul 14, 2022
@samukweku samukweku self-assigned this Jul 14, 2022
@samukweku
Copy link
Contributor Author

I can piggyback on #3310 on resolving the padding warnings, once you are done with the update. For fillna, I feel all columns should be able to take advantage of it, excluding maybe the array columns. If you do not mind @oleksiyskononenko , how do I go about fixing the error when running the function on string and time/date columns? thanks

@samukweku
Copy link
Contributor Author

I'm also thinking it might be ok, as a convenience option to fill nulls here with a scalar, similar to what replace/if else would do @oleksiyskononenko

@samukweku samukweku mentioned this pull request Jul 14, 2022
8 tasks
@oleksiyskononenko
Copy link
Contributor

Yes, I think fillna() could be applicable to all the columns. Internally, DATE32 should be processed at INT32 and TIME64 as INT64.

@samukweku
Copy link
Contributor Author

@oleksiyskononenko when you can kindly have a look at my mock code; haven't gotten far with it, as I am still unsure about the RowIndex. My attempts are expensive when converting the boolean column to a row index, from microseconds to milliseconds for 5 rows.

@oleksiyskononenko
Copy link
Contributor

I've merged main onto this PR, because we have some changes in the building pipeline.

@oleksiyskononenko oleksiyskononenko changed the title [ENH] Fillna [ENH] Add dt.fillna() function to impute missing values Aug 9, 2022
@oleksiyskononenko oleksiyskononenko added this to the Release 1.1.0 milestone Aug 9, 2022
@oleksiyskononenko
Copy link
Contributor

@samukweku I did some changes to this PR, mostly cosmetic. See if they all make sense to you, otherwise I guess we are ready to merge it. Thanks!

wf.get_frame_id(i),
wf.get_column_id(i)
);
if (!is_grouped){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleksiyskononenko is the null count check irrelevant? I thought it would skip if there were no nulls? or does the is_grouped boolean somehow cover that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this check if it comes for free. However, to actually calculate the number of nulls in the column one needs to loop over the column data in a loop that is kind of similar to the loop which we implement in fillna(). Since most of the columns have some missing data, it means that in most of the cases we will have to go through the data twice. My feeling that in this case it is better to remove the check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we can actually do here is to check if the stats is already computed and the number of NA's is already known, something similar to https://github.com/h2oai/datatable/blob/main/src/core/column/column_impl.cc#L175

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed it was already precomputed and available?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, until touched it is not there and needs to be computed. Let me push an update, so that we reuse it if available.

@samukweku
Copy link
Contributor Author

@oleksiyskononenko thanks for the review; I have a question regarding the null count check which was removed. asides that, it is good to merge. thanks !

@oleksiyskononenko oleksiyskononenko merged commit 7e70947 into main Aug 9, 2022
@oleksiyskononenko oleksiyskononenko deleted the samukweku/fillna branch August 9, 2022 10:15
samukweku added a commit that referenced this pull request Aug 10, 2022
Add `dt.fillna()` function to replace missing values with the previous/subsequent non-missing.

WIP for #3279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Feature requests for new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants