-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add dt.fillna()
function to impute missing values
#3311
Conversation
I can piggyback on #3310 on resolving the padding warnings, once you are done with the update. For fillna, I feel all columns should be able to take advantage of it, excluding maybe the array columns. If you do not mind @oleksiyskononenko , how do I go about fixing the error when running the function on string and time/date columns? thanks |
I'm also thinking it might be ok, as a convenience option to fill nulls here with a scalar, similar to what |
Yes, I think |
@oleksiyskononenko when you can kindly have a look at my mock code; haven't gotten far with it, as I am still unsure about the RowIndex. My attempts are expensive when converting the boolean column to a row index, from microseconds to milliseconds for 5 rows. |
82725b8
to
8306a55
Compare
291241c
to
8012369
Compare
I've merged |
dt.fillna()
function to impute missing values
@samukweku I did some changes to this PR, mostly cosmetic. See if they all make sense to you, otherwise I guess we are ready to merge it. Thanks! |
src/core/expr/fexpr_fillna.cc
Outdated
wf.get_frame_id(i), | ||
wf.get_column_id(i) | ||
); | ||
if (!is_grouped){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksiyskononenko is the null count check irrelevant? I thought it would skip if there were no nulls? or does the is_grouped
boolean somehow cover that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have this check if it comes for free. However, to actually calculate the number of nulls in the column one needs to loop over the column data in a loop that is kind of similar to the loop which we implement in fillna()
. Since most of the columns have some missing data, it means that in most of the cases we will have to go through the data twice. My feeling that in this case it is better to remove the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we can actually do here is to check if the stats is already computed and the number of NA's is already known, something similar to https://github.com/h2oai/datatable/blob/main/src/core/column/column_impl.cc#L175
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed it was already precomputed and available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, until touched it is not there and needs to be computed. Let me push an update, so that we reuse it if available.
@oleksiyskononenko thanks for the review; I have a question regarding the null count check which was removed. asides that, it is good to merge. thanks ! |
Add `dt.fillna()` function to replace missing values with the previous/subsequent non-missing. WIP for #3279
Add
dt.fillna()
function to replace missing values with the previous/subsequent non-missing.WIP for #3279