Skip to content

Commit

Permalink
[ENH] Add dt.fillna() function to impute missing values (#3311)
Browse files Browse the repository at this point in the history
Add `dt.fillna()` function to replace missing values with the previous/subsequent non-missing.

WIP for #3279
  • Loading branch information
samukweku authored Aug 9, 2022
1 parent 888cc21 commit 7e70947
Show file tree
Hide file tree
Showing 13 changed files with 479 additions and 3 deletions.
152 changes: 152 additions & 0 deletions docs/api/dt/fillna.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@

.. xfunction:: datatable.fillna
:src: src/core/expr/fexpr_fillna.cc pyfn_fillna
:tests: tests/dt/test-fillna.py
:cvar: doc_dt_fillna
:signature: fillna(cols, reverse=False)

.. x-version-added:: 1.1.0

For each column from `cols` fill the missing values with the
previous or subsequent non-missing values. In the presence of :func:`by()`
the filling is performed group-wise.

Parameters
----------
cols: FExpr
Input columns.

reverse: bool
If ``False``, the missing values are filled by using the closest
previous non-missing values as a replacement. if ``True``,
the closest subsequent non-missing values are used.

return: FExpr
f-expression that converts input columns into the columns filled
with the previous/subsequent non-missing values.


Examples
--------

Create a sample datatable frame::

>>> from datatable import dt, f, by
>>> DT = dt.Frame({'building': ['a', 'a', 'b', 'b', 'a', 'a', 'b', 'b'],
... 'var1': [1.5, None, 2.1, 2.2, 1.2, 1.3, 2.4, None],
... 'var2': [100, 110, 105, None, 102, None, 103, 107],
... 'var3': [10, 11, None, None, None, None, None, None],
... 'var4': [1, 2, 3, 4, 5, 6, 7, 8]})
| building var1 var2 var3 var4
| str32 float64 int32 int32 int32
-- + -------- ------- ----- ----- -----
0 | a 1.5 100 10 1
1 | a NA 110 11 2
2 | b 2.1 105 NA 3
3 | b 2.2 NA NA 4
4 | a 1.2 102 NA 5
5 | a 1.3 NA NA 6
6 | b 2.4 103 NA 7
7 | b NA 107 NA 8
[8 rows x 5 columns]

Fill down on a single column::
>>> DT[:, dt.fillna(f.var1)]
| var1
| float64
-- + -------
0 | 1.5
1 | 1.5
2 | 2.1
3 | 2.2
4 | 1.2
5 | 1.3
6 | 2.4
7 | 2.4
[8 rows x 1 column]

Fill up on a single column::

>>> DT[:, dt.fillna(f.var1, reverse = True)]
| var1
| float64
-- + -------
0 | 1.5
1 | 2.1
2 | 2.1
3 | 2.2
4 | 1.2
5 | 1.3
6 | 2.4
7 | NA
[8 rows x 1 column]


Fill down on multiple columns::

>>> DT[:, dt.fillna(f['var1':])]
| var1 var2 var3 var4
| float64 int32 int32 int32
-- + ------- ----- ----- -----
0 | 1.5 100 10 1
1 | 1.5 110 11 2
2 | 2.1 105 11 3
3 | 2.2 105 11 4
4 | 1.2 102 11 5
5 | 1.3 102 11 6
6 | 2.4 103 11 7
7 | 2.4 107 11 8
[8 rows x 4 columns]


Fill up on multiple columns::

>>> DT[:, dt.fillna(f['var1':], reverse = True)]
| var1 var2 var3 var4
| float64 int32 int32 int32
-- + ------- ----- ----- -----
0 | 1.5 100 10 1
1 | 2.1 110 11 2
2 | 2.1 105 NA 3
3 | 2.2 102 NA 4
4 | 1.2 102 NA 5
5 | 1.3 103 NA 6
6 | 2.4 103 NA 7
7 | NA 107 NA 8
[8 rows x 4 columns]


Fill down in the presence of :func:`by()`::

>>> DT[:, dt.fillna(f['var1':]), by('building')]
| building var1 var2 var3 var4
| str32 float64 int32 int32 int32
-- + -------- ------- ----- ----- -----
0 | a 1.5 100 10 1
1 | a 1.5 110 11 2
2 | a 1.2 102 11 5
3 | a 1.3 102 11 6
4 | b 2.1 105 NA 3
5 | b 2.2 105 NA 4
6 | b 2.4 103 NA 7
7 | b 2.4 107 NA 8
[8 rows x 5 columns]


Fill up in the presence of :func:`by()`::

>>> DT[:, dt.fillna(f['var1':], reverse = True), by('building')]
| building var1 var2 var3 var4
| str32 float64 int32 int32 int32
-- + -------- ------- ----- ----- -----
0 | a 1.5 100 10 1
1 | a 1.2 110 11 2
2 | a 1.2 102 NA 5
3 | a 1.3 NA NA 6
4 | b 2.1 105 NA 3
5 | b 2.2 103 NA 4
6 | b 2.4 103 NA 7
7 | b NA 107 NA 8
[8 rows x 5 columns]
4 changes: 4 additions & 0 deletions docs/api/fexpr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,9 @@
* - :meth:`.cumsum()`
- Same as :func:`dt.cumsum()`.

* - :meth:`.fillna()`
- Same as :func:`dt.fillna()`.

* - :meth:`.first()`
- Same as :func:`dt.first()`.

Expand Down Expand Up @@ -303,6 +306,7 @@
.cumprod() <fexpr/cumprod>
.cumsum() <fexpr/cumsum>
.extend() <fexpr/extend>
.fillna() <fexpr/fillna>
.first() <fexpr/first>
.last() <fexpr/last>
.len() <fexpr/len>
Expand Down
7 changes: 7 additions & 0 deletions docs/api/fexpr/fillna.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@

.. xmethod:: datatable.FExpr.fillna
:src: src/core/expr/fexpr.cc PyFExpr::fillna
:cvar: doc_FExpr_fillna
:signature: fillna(reverse=False)

Equivalent to :func:`dt.fillna(cols, reverse=False)`.
3 changes: 3 additions & 0 deletions docs/api/index-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,8 @@ Functions
- Calculate the cumulative sum of values per column
* - :func:`cov()`
- Calculate covariance between two columns
* - :func:`fillna()`
- Impute missing values
* - :func:`max()`
- Find the largest element per column
* - :func:`mean()`
Expand Down Expand Up @@ -252,6 +254,7 @@ Other
cut() <dt/cut>
dt <dt/dt>
f <dt/f>
fillna() <dt/fillna>
first() <dt/first>
fread() <dt/fread>
g <dt/g>
Expand Down
1 change: 0 additions & 1 deletion docs/manual/comparison_with_rdatatable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -666,7 +666,6 @@ equivalent in ``datatable`` yet, that we would likely implement

- Missing values functions

- `nafill <https://rdatatable.gitlab.io/data.table/reference/nafill.html>`__
- `fcoalesce <https://rdatatable.gitlab.io/data.table/reference/coalesce.html>`__

Also, at the moment, custom aggregations in the ``j`` section are not supported
Expand Down
3 changes: 3 additions & 0 deletions docs/releases/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,9 @@
-[new] Class :class:`dt.FExpr` now has method :meth:`.countna()`,
which behaves exactly as the equivalent base level function :func:`dt.countna()`.
-[new] Added function :func:`dt.fillna()`, as well as :meth:`.fillna()` method,
to impute missing values. [#3279]
-[enh] Function :func:`dt.re.match()` now supports case insensitive matching. [#3216]
-[enh] Function :func:`dt.qcut()` can now be used in a groupby context. [#3165]
Expand Down
2 changes: 2 additions & 0 deletions src/core/documentation.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ extern const char* doc_dt_cummin;
extern const char* doc_dt_cumprod;
extern const char* doc_dt_cumsum;
extern const char* doc_dt_cut;
extern const char* doc_dt_fillna;
extern const char* doc_dt_first;
extern const char* doc_dt_fread;
extern const char* doc_dt_ifelse;
Expand Down Expand Up @@ -289,6 +290,7 @@ extern const char* doc_FExpr_cummin;
extern const char* doc_FExpr_cumprod;
extern const char* doc_FExpr_cumsum;
extern const char* doc_FExpr_extend;
extern const char* doc_FExpr_fillna;
extern const char* doc_FExpr_first;
extern const char* doc_FExpr_last;
extern const char* doc_FExpr_max;
Expand Down
15 changes: 15 additions & 0 deletions src/core/expr/fexpr.cc
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,21 @@ DECLARE_METHOD(&PyFExpr::cumsum)
->docs(dt::doc_FExpr_cumsum);



oobj PyFExpr::fillna(const XArgs& args) {
auto fillnaFn = oobj::import("datatable", "fillna");
oobj reverse = args[0]? args[0].to_oobj() : py::obool(false);
return fillnaFn.call({this, reverse});
}

DECLARE_METHOD(&PyFExpr::fillna)
->name("fillna")
->docs(dt::doc_FExpr_fillna)
->arg_names({"reverse"})
->n_positional_or_keyword_args(1)
->n_required_args(0);


oobj PyFExpr::first(const XArgs&) {
auto firstFn = oobj::import("datatable", "first");
return firstFn.call({this});
Expand Down
1 change: 1 addition & 0 deletions src/core/expr/fexpr.h
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ class PyFExpr : public py::XObject<PyFExpr> {
py::oobj cumprod(const py::XArgs&);
py::oobj cumsum(const py::XArgs&);
py::oobj extend(const py::XArgs&);
py::oobj fillna(const py::XArgs&);
py::oobj first(const py::XArgs&);
py::oobj last(const py::XArgs&);
py::oobj max(const py::XArgs&);
Expand Down
138 changes: 138 additions & 0 deletions src/core/expr/fexpr_fillna.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
//------------------------------------------------------------------------------
// Copyright 2022 H2O.ai
//
// Permission is hereby granted, free of charge, to any person obtaining a
// copy of this software and associated documentation files (the "Software"),
// to deal in the Software without restriction, including without limitation
// the rights to use, copy, modify, merge, publish, distribute, sublicense,
// and/or sell copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
// IN THE SOFTWARE.
//------------------------------------------------------------------------------
#include "documentation.h"
#include "expr/fexpr_func.h"
#include "expr/eval_context.h"
#include "python/xargs.h"
#include "parallel/api.h"
namespace dt {
namespace expr {


class FExpr_FillNA : public FExpr_Func {
private:
ptrExpr arg_;
bool reverse_;
size_t : 56;

public:
FExpr_FillNA(ptrExpr &&arg, bool reverse)
: arg_(std::move(arg)),
reverse_(reverse)
{}


std::string repr() const override {
std::string out = "fillna";
out += '(';
out += arg_->repr();
out += ", reverse=";
out += reverse_? "True" : "False";
out += ')';
return out;
}


template <bool REVERSE>
static RowIndex fill_rowindex(Column& col, const Groupby& gby) {
Buffer buf = Buffer::mem(static_cast<size_t>(col.nrows()) * sizeof(int32_t));
auto indices = static_cast<int32_t*>(buf.xptr());

dt::parallel_for_dynamic(
gby.size(),
[&](size_t gi) {
size_t i1, i2;
gby.get_group(gi, &i1, &i2);
size_t fill_id = REVERSE? i2 - 1 : i1;

if (REVERSE) {
for (size_t i = i2; i-- > i1;) {
size_t is_valid = col.get_element_isvalid(i);
fill_id = is_valid? i : fill_id;
indices[i] = static_cast<int32_t>(fill_id);
}
} else {
for (size_t i = i1; i < i2; ++i) {
size_t is_valid = col.get_element_isvalid(i);
fill_id = is_valid? i : fill_id;
indices[i] = static_cast<int32_t>(fill_id);
}
}

}
);

return RowIndex(std::move(buf), RowIndex::ARR32|RowIndex::SORTED);
}


Workframe evaluate_n(EvalContext &ctx) const override {
Workframe wf = arg_->evaluate_n(ctx);
Groupby gby = Groupby::single_group(wf.nrows());
if (ctx.has_groupby()) {
wf.increase_grouping_mode(Grouping::GtoALL);
gby = ctx.get_groupby();
}

for (size_t i = 0; i < wf.ncols(); ++i) {
Column coli = wf.retrieve_column(i);
bool is_grouped = ctx.has_group_column(
wf.get_frame_id(i),
wf.get_column_id(i)
);

auto stats = coli.get_stats_if_exist();
bool na_stats_exists = stats && stats->is_computed(Stat::NaCount);
bool has_nas = na_stats_exists? stats->nacount()
: true;

if (has_nas && !is_grouped){
RowIndex ri = reverse_? fill_rowindex<true>(coli, gby)
: fill_rowindex<false>(coli, gby);
coli.apply_rowindex(ri);
}
wf.replace_column(i, std::move(coli));
}

return wf;
}

};


static py::oobj pyfn_fillna(const py::XArgs &args) {
auto column = args[0].to_oobj();
auto reverse = args[1].to<bool>(false);
return PyFExpr::make(new FExpr_FillNA(as_fexpr(column), reverse));
}


DECLARE_PYFN(&pyfn_fillna)
->name("fillna")
->docs(doc_dt_fillna)
->arg_names({"column", "reverse"})
->n_required_args(1)
->n_positional_args(1)
->n_positional_or_keyword_args(1);


}} // dt::expr
Loading

0 comments on commit 7e70947

Please sign in to comment.