Implement cumulative functions #3279

samukweku · 2022-05-14T20:46:22Z

The list of functions to be implemented and the corresponding PR's

cumsum() [ENH] Implement cumsum() function #3257
cumprod() [ENH] Add cumprod() function for cumulative product calculation #3304
cummax() [ENH] Add cummin() and cummax() functions for cumulative min/max calculation #3288
cummin() [ENH] Add cummin() and cummax() functions for cumulative min/max calculation #3288
cumcount() [ENH] Add cumcount() and ngroup() functions #3310
ngroup() - not strictly cumulative [ENH] Add cumcount() and ngroup() functions #3310
fillna() for forward/backward fill [ENH] Add dt.fillna() function to impute missing values #3311
fillna() for filling with a value [ENH] Enhance dt.fillna() to support filling with a particular value #3344
~~- [ ] rank~~ continued on similar to pandas.Series.rank() #3148
~~- [ ] rolling aggregations~~ continued on Rolling aggregate support based on windows within a DT #1500

The text was updated successfully, but these errors were encountered:

Implement `cumsum()` function. WIP for #3279

ghost · 2022-05-17T14:26:57Z

I think these are very useful functions. Great to see them soon!

vopani · 2022-05-28T08:28:03Z

These would be great and long due.

Is there an ETA of when this would be available (next release)?

samukweku · 2022-05-28T08:32:58Z

@vopani currently working on them ... cant specify an ETA though ... contributions are welcome also. You can test the cumsum function by downloading the latest dev version

samukweku · 2022-05-31T08:11:45Z

@vopani Kindly create a minimal example of the cumsum for both datatable and pandas; it is easier to grok.

vopani · 2022-05-31T09:07:31Z

@samukweku Yeah apologies, let me run some more tests and share better examples / feedback.

Thanks again for these cumulative functions, they are super useful and help in bridging the gap to pandas.

samukweku · 2022-05-31T09:43:22Z

Thanks also to @oleksiyskononenko and @st-pasha for their guidance thru my C++ journey

Implement `cumsum()` function. WIP for #3279

samukweku · 2022-06-10T23:03:04Z

tests comparison with np.cumsum, where numpy is twice as fast, and seems to be run on a single core:

import numpy as np
from datatable import dt, f

In [12]: a = np.arange(10_000_000)

In [13]: %timeit a.cumsum()
22.2 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: DT = dt.Frame(a)

In [15]: %timeit DT[:, f.C0.cumsum()]
46.7 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

oleksiyskononenko · 2022-06-10T23:08:41Z

@samukweku we need to address #3081 to improve performance in the case when there is no group-by context. For grouped frames we're fully parallel now.

For cumulative functions we actually don't have a separate issue and I'm not sure if we can call cumsum() etc to be reducers. But anyways, our logic is now parallel in terms of the groups, but not in terms of the actual data. We probably need to improve it.

There is still a question why numpy is x2 faster on a single core. We need to do some profiling to see if there are any bottlenecks in our code.

samukweku · 2022-06-10T23:35:50Z

thanks @oleksiyskononenko , maybe you can explain more what you mean by parallelisation in terms of the actual data. How is profiling done in C++? by the way, is there any way to do interactive work in C++?

…ation (#3288) WIP for #3279

samukweku · 2022-06-17T03:26:51Z

@oleksiyskononenko I was reading up on cumulative sum, and found a possible performance option with Fenwick tree. What are your thoughts on it? is it worth the effort?

As an aside, when we call get_edit_datatable, does it return a vector? where do I trace this to see its implementation

I'm also thinking of building on your example of templates and convert the cumsum implementation to support cumprod, since they are essentially the same, just the change of the operator.

oleksiyskononenko · 2022-06-17T19:33:19Z

@samukweku

maybe you can explain more what you mean by parallelisation in terms of the actual data.

It means that we parallelize loops to go over the frame rows, currently our parallel loops for cumulative functions/reducers go over the groups.

How is profiling done in C++?

To start with you can read couple of threads on SO, something like https://stackoverflow.com/questions/375913/how-can-i-profile-c-code-running-on-linux

by the way, is there any way to do interactive work in C++?

Probably yes, I don't have experience with that.

I was reading up on cumulative sum, and found a possible performance option with Fenwick tree. What are your thoughts on it? is it worth the effort?

First, let's looks through the code to see if there are any obvious bottlenecks.
Second, we should decide if the cumsum() performance is acceptable or not. Is it currently x2 slower than numpy?
Then we could discuss ways to improve performance if needed.

As an aside, when we call get_edit_datatable, does it return a vector? where do I trace this to see its implementation

We don't have get_edit_datatable() method. Perhaps you mean get_data_editable()? It returns a pointer to the column's underlying buffers. Implementation depends on the column type. See, for instance:
https://github.com/h2oai/datatable/blob/main/src/core/column/sentinel_fw.cc#L234-L238
https://github.com/h2oai/datatable/blob/main/src/core/column/sentinel_str.cc#L118-L122

I'm also thinking of building on your example of templates and convert the cumsum implementation to support cumprod, since they are essentially the same, just the change of the operator.

Sounds like a great idea. Btw, one more difference between cumsum() and cumprod() is the default value: 0 vs 1.

- add `cumprod()` function for cumulative product calculation; - refactor `cumsum()` internals to also support `cumprod()`; WIP for #3279

samukweku · 2022-06-30T00:38:02Z

@oleksiyskononenko @Peter-Pasta @vopani should we stick to the name cumcount or use row_number. I bring up row_number because of sql. checking to see which one is more intuitive

oleksiyskononenko · 2022-06-30T04:27:08Z

@samukweku What functionality you want to achieve with this function?

samukweku · 2022-06-30T05:33:30Z

@oleksiyskononenko it is essentially row numbers, but it is more helpful when you want the row number per group ... #2892 is the reason behind the cumulative functions. In pandas it is referred to as cumcount, in datatable it can be pulled off with 1..N or something similar ... in Postgres/sql it is usually referred to as row_number. I hope I explained well enough?

vopani · 2022-06-30T06:11:06Z

cumcount is a very natural (and common) term used by Data Scientists.
Which is also why pandas uses the same.

samukweku · 2022-06-30T07:22:18Z

Thanks for the feedback @vopani we are on the right track then with naming. I will go ahead and create the function with cumcount as the name

oleksiyskononenko · 2022-06-30T08:34:05Z

Actually, in datatable there is already a function called count() that is used to

Calculate the number of non-missing values for each column

see https://datatable.readthedocs.io/en/latest/api/dt/count.html for more details.

So the function named cumcount() is expected to be a cumulative equivalent of count(). But it will not be in this case. My feeling is that if we call a function smthcount() it is expected to count something, while in reality this function only returns the corresponding row id's that there is no need to count, they're already known internally.

vopani · 2022-06-30T09:50:40Z

The datatable count() is very unnatural and unintuitive, which should ideally be renamed :-)
Even SQL count means #rows.

But ok, for the benefit of this discussion, maybe using or cumrows() or cumrowcount() might be suitable. We can also have the cumcount() that is the cumulative equivalent of count(). It is a useful aggregation but many users might misinterpret its result initially. Hopefully sufficient documentation will help.

samukweku · 2022-06-30T10:16:19Z

I think count depending on its usage can also count non missing values.

I feel using cumcount will be familiar with users from pandas and data science generally. But row_number feels very explicit Unless there is a more intuitive name. cumrowcount seems odd to me.

samukweku · 2022-06-30T10:18:25Z

ROW_NUMBER function is a SQL ranking function that assigns a sequential rank number to each new record in a partition.

That's the definition from SQL server

samukweku · 2022-06-30T10:20:37Z

cumcount - Number each item in each group from 0 to the length of that group - 1.

Definition from pandas

oleksiyskononenko · 2022-06-30T10:22:50Z

@vopani Well, I'm not sure why you think it is an unnatural and unintuitive name. The same name/behavior is used in, at least, pandas and pyarrow. datatable just sticks to the same name and convention.

What I find unintuitive though, is pandas using the name count() to calculate non-missing values and cumcount() to produce a simple row counter that doesn't care about the element's validity.

As for the function name, I'm not sure what could be the good name at this point. We probably need to review what others like pandas, numpy, pyarrow, data.table, etc. are doing to have a good guess.

Cosmetic improvements of docs for `cumcount()` and `ngroups()`. WIP for #3279

Add `dt.fillna()` function to replace missing values with the previous/subsequent non-missing. WIP for #3279

samukweku · 2022-08-25T23:33:22Z

@oleksiyskononenko pending when you are done with #3333, I might continue with the remaining functions here. At the moment, I'm looking to expand on fillna to support scalars, and also the rank function. Stuck a bit in my fillna progress; I will create a PR for you to have a look at. For the rank function, which function can I use to get the groups and order when sort is applied? Do you mind pointing me in the right direction? Thanks

oleksiyskononenko · 2022-08-29T19:41:57Z

@samukweku Yeah, let me take a look at #3333. As for the

which function can I use to get the groups and order when sort is applied?

Not sure what you mean. There is a Groupby::get_group() method, is it what you need?

samukweku · 2022-09-02T21:42:33Z

@oleksiyskononenko for sorting, I believe groups are created, with a numbering (guessing). Example:

DT = dt.range([2,2,3,4,4])

if DT is sorted, it should have 3 groups : 2, 3, 4, and the numbering order should be 0,0,1,2,2. The same idea applies if a groupby is present, but the numbering order is within each group, and restarts for each group. Is there somewhere in evalcontext where this is available after sorting is called on the dataframe? I was hoping to just reuse that instead. I want to use that to implement the rank function, and make it similar in API as SQL's (which I deem to be simple enough to grasp, with a manageable complexity IMO)

oleksiyskononenko · 2022-09-02T23:24:34Z

Sorting will only return you the RowIndex and grouping will return both the RowIndex and the group information, see https://github.com/h2oai/datatable/blob/main/src/core/sort.cc#L1411-L1495

You can use the both rowindex/offsets to implement the rank() function. First, you need to group the column (if not already grouped) and then calculate rank based on the returned grouping information.

oleksiyskononenko · 2022-09-02T23:32:05Z

My feeling is that before jumping into the rank() implementation, you need to go through implementation details of a groupby: rowindex, offsets, etc. Once you understand their purpose and implementation, it won't be difficult to implement the rank(). Don't hesitate to ask if you have any questions. We can also do that on gitter if needed.

WIP for #3279

#3344) Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`. WIP for #3279

oleksiyskononenko · 2022-10-26T21:42:34Z

@samukweku What if we add parameter reverse to dt.cummin(), dt.cummax(), dt.cumsum() and dt.cumprod(), just like we did for dt.fillna() and dt.cumcount()?

Btw, what do we mean by the "rolling aggregations" on this issue?

samukweku · 2022-10-26T22:09:22Z

Not a bad idea!

Rolling aggregations are usually associated with time series like get mean for every 3 days or 3 hours. Pandas has rolling and expanding. SQL has it but more within the windows function `... Over () between range 3 days...``` sth like that. It would be best handled by someone who has a good understanding of time series in general, with a finance bias probably. I dont

oleksiyskononenko · 2022-10-27T20:53:06Z

I see, so this is basically some moving window calculations. Then, may be we open a new issue with listing all the rolling aggregations you have in mind?

oleksiyskononenko · 2022-10-27T20:54:59Z

As for the reverse parameter, it is still possible to do this with the current implementation:

reversing a frame;
applying cum*() function;
reversing the frame again.

But having a reverse parameter will make it cleaner and slightly faster, as the user won't need to reverse the frame twice.

samukweku · 2022-10-27T23:48:44Z

I'll make out some time to add the reverse parameter to the existing functions. Still looking forward to you help on the rowall/rowany for the nth function; i cant get around the FExpr_Rowall/rowany

oleksiyskononenko · 2022-10-28T18:10:22Z

@samukweku Yeah, sure. I will take a look at it.

samukweku · 2022-10-29T01:02:10Z

reopened; will be closed after implementation of reverse parameter for cummin, cummax, cumsum, cumprod

…and `cummax()` (#3381) Add `reverse` parameter to control direction of cumulative function's calculations: - when `False`, calculation is done from top to bottom (default); - when `True`, calculation is done from bottom to top. Сloses #3279

#3344) Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`. WIP for #3279

…and `cummax()` (#3381) Add `reverse` parameter to control direction of cumulative function's calculations: - when `False`, calculation is done from top to bottom (default); - when `True`, calculation is done from bottom to top. Сloses #3279

samukweku added the new feature Feature requests for new functionality label May 14, 2022

samukweku mentioned this issue May 14, 2022

[ENH] Implement cumsum() function #3257

Merged

oleksiyskononenko pushed a commit that referenced this issue May 17, 2022

[ENH] Implement cumsum() function (#3257)

34263ec

Implement `cumsum()` function. WIP for #3279

samukweku mentioned this issue May 30, 2022

[ENH] Add cummin() and cummax() functions for cumulative min/max calculation #3288

Merged

samukweku added a commit that referenced this issue Jun 9, 2022

[ENH] Implement cumsum() function (#3257)

2d1f200

Implement `cumsum()` function. WIP for #3279

oleksiyskononenko pushed a commit that referenced this issue Jun 11, 2022

Add cummin() and cummax() functions for cumulative min/max calcul…

4c1c031

…ation (#3288) WIP for #3279

samukweku mentioned this issue Jun 23, 2022

[ENH] Add cumprod() function for cumulative product calculation #3304

Merged

oleksiyskononenko pushed a commit that referenced this issue Jun 28, 2022

Add cumprod() function for cumulative product calculation (#3304)

838cf5e

- add `cumprod()` function for cumulative product calculation; - refactor `cumsum()` internals to also support `cumprod()`; WIP for #3279

samukweku pushed a commit that referenced this issue Aug 10, 2022

Improve docs for cumcount() and ngroups() (#3320)

30f9fa2

Cosmetic improvements of docs for `cumcount()` and `ngroups()`. WIP for #3279

samukweku added a commit that referenced this issue Aug 10, 2022

[ENH] Add dt.fillna() function to impute missing values (#3311)

c3ca41c

Add `dt.fillna()` function to replace missing values with the previous/subsequent non-missing. WIP for #3279

samukweku mentioned this issue Sep 1, 2022

[ENH] Enhance dt.fillna() to support filling with a particular value #3344

Merged

oleksiyskononenko mentioned this issue Sep 9, 2022

Minor docs adjustments to cumcount() and ngroup() #3349

Merged

oleksiyskononenko added a commit that referenced this issue Sep 9, 2022

Minor docs adjustments to cumcount() and ngroup() (#3349)

110cc83

WIP for #3279

samukweku pushed a commit that referenced this issue Sep 15, 2022

Minor docs adjustments to cumcount() and ngroup() (#3349)

a4d84f3

WIP for #3279

samukweku pushed a commit that referenced this issue Sep 17, 2022

Minor docs adjustments to cumcount() and ngroup() (#3349)

d0fac63

WIP for #3279

samukweku pushed a commit that referenced this issue Sep 18, 2022

Minor docs adjustments to cumcount() and ngroup() (#3349)

00faaa8

WIP for #3279

oleksiyskononenko mentioned this issue Oct 28, 2022

similar to pandas.Series.rank() #3148

Open

samukweku mentioned this issue Oct 29, 2022

Rolling Operations #3378

Closed

samukweku closed this as completed Oct 29, 2022

samukweku reopened this Oct 29, 2022

samukweku mentioned this issue Nov 1, 2022

[ENH] Add reverse parameter to cumsum(), cumprod(), cummin() and cummax() #3381

Merged

oleksiyskononenko closed this as completed Nov 29, 2022

samukweku added a commit that referenced this issue Jan 3, 2023

[ENH] Enhance dt.fillna() to support filling with a particular value (

20e3da1

#3344) Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`. WIP for #3279

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cumulative functions #3279

Implement cumulative functions #3279

samukweku commented May 14, 2022 •

edited

Loading

ghost commented May 17, 2022 •

edited by ghost

Loading

vopani commented May 28, 2022

samukweku commented May 28, 2022

samukweku commented May 31, 2022

vopani commented May 31, 2022

samukweku commented May 31, 2022

samukweku commented Jun 10, 2022 •

edited

Loading

oleksiyskononenko commented Jun 10, 2022 •

edited

Loading

samukweku commented Jun 10, 2022

samukweku commented Jun 17, 2022

oleksiyskononenko commented Jun 17, 2022 •

edited

Loading

samukweku commented Jun 30, 2022

oleksiyskononenko commented Jun 30, 2022

samukweku commented Jun 30, 2022

vopani commented Jun 30, 2022

samukweku commented Jun 30, 2022 •

edited

Loading

oleksiyskononenko commented Jun 30, 2022

vopani commented Jun 30, 2022

samukweku commented Jun 30, 2022

samukweku commented Jun 30, 2022

samukweku commented Jun 30, 2022

oleksiyskononenko commented Jun 30, 2022 •

edited

Loading

samukweku commented Aug 25, 2022

oleksiyskononenko commented Aug 29, 2022

samukweku commented Sep 2, 2022 •

edited

Loading

oleksiyskononenko commented Sep 2, 2022 •

edited

Loading

oleksiyskononenko commented Sep 2, 2022 •

edited

Loading

oleksiyskononenko commented Oct 26, 2022 •

edited

Loading

samukweku commented Oct 26, 2022

oleksiyskononenko commented Oct 27, 2022 •

edited

Loading

oleksiyskononenko commented Oct 27, 2022

samukweku commented Oct 27, 2022

oleksiyskononenko commented Oct 28, 2022

samukweku commented Oct 29, 2022

Implement cumulative functions #3279

Implement cumulative functions #3279

Comments

samukweku commented May 14, 2022 • edited Loading

ghost commented May 17, 2022 • edited by ghost Loading

vopani commented May 28, 2022

samukweku commented May 28, 2022

samukweku commented May 31, 2022

vopani commented May 31, 2022

samukweku commented May 31, 2022

samukweku commented Jun 10, 2022 • edited Loading

oleksiyskononenko commented Jun 10, 2022 • edited Loading

samukweku commented Jun 10, 2022

samukweku commented Jun 17, 2022

oleksiyskononenko commented Jun 17, 2022 • edited Loading

samukweku commented Jun 30, 2022

oleksiyskononenko commented Jun 30, 2022

samukweku commented Jun 30, 2022

vopani commented Jun 30, 2022

samukweku commented Jun 30, 2022 • edited Loading

oleksiyskononenko commented Jun 30, 2022

vopani commented Jun 30, 2022

samukweku commented Jun 30, 2022

samukweku commented Jun 30, 2022

samukweku commented Jun 30, 2022

oleksiyskononenko commented Jun 30, 2022 • edited Loading

samukweku commented Aug 25, 2022

oleksiyskononenko commented Aug 29, 2022

samukweku commented Sep 2, 2022 • edited Loading

oleksiyskononenko commented Sep 2, 2022 • edited Loading

oleksiyskononenko commented Sep 2, 2022 • edited Loading

oleksiyskononenko commented Oct 26, 2022 • edited Loading

samukweku commented Oct 26, 2022

oleksiyskononenko commented Oct 27, 2022 • edited Loading

oleksiyskononenko commented Oct 27, 2022

samukweku commented Oct 27, 2022

oleksiyskononenko commented Oct 28, 2022

samukweku commented Oct 29, 2022

samukweku commented May 14, 2022 •

edited

Loading

ghost commented May 17, 2022 •

edited by ghost

Loading

samukweku commented Jun 10, 2022 •

edited

Loading

oleksiyskononenko commented Jun 10, 2022 •

edited

Loading

oleksiyskononenko commented Jun 17, 2022 •

edited

Loading

samukweku commented Jun 30, 2022 •

edited

Loading

oleksiyskononenko commented Jun 30, 2022 •

edited

Loading

samukweku commented Sep 2, 2022 •

edited

Loading

oleksiyskononenko commented Sep 2, 2022 •

edited

Loading

oleksiyskononenko commented Sep 2, 2022 •

edited

Loading

oleksiyskononenko commented Oct 26, 2022 •

edited

Loading

oleksiyskononenko commented Oct 27, 2022 •

edited

Loading