Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cumulative functions #3279

Closed
8 tasks done
samukweku opened this issue May 14, 2022 · 42 comments
Closed
8 tasks done

Implement cumulative functions #3279

samukweku opened this issue May 14, 2022 · 42 comments
Assignees
Labels
new feature Feature requests for new functionality

Comments

@samukweku
Copy link
Contributor

samukweku commented May 14, 2022

The list of functions to be implemented and the corresponding PR's

@samukweku samukweku added the new feature Feature requests for new functionality label May 14, 2022
oleksiyskononenko pushed a commit that referenced this issue May 17, 2022
Implement `cumsum()` function.

WIP for #3279
@ghost
Copy link

ghost commented May 17, 2022

I think these are very useful functions. Great to see them soon!

@vopani
Copy link

vopani commented May 28, 2022

These would be great and long due.

Is there an ETA of when this would be available (next release)?

@samukweku
Copy link
Contributor Author

@vopani currently working on them ... cant specify an ETA though ... contributions are welcome also. You can test the cumsum function by downloading the latest dev version

@samukweku
Copy link
Contributor Author

@vopani Kindly create a minimal example of the cumsum for both datatable and pandas; it is easier to grok.

@vopani
Copy link

vopani commented May 31, 2022

@samukweku Yeah apologies, let me run some more tests and share better examples / feedback.

Thanks again for these cumulative functions, they are super useful and help in bridging the gap to pandas.

@samukweku
Copy link
Contributor Author

Thanks also to @oleksiyskononenko and @st-pasha for their guidance thru my C++ journey

samukweku added a commit that referenced this issue Jun 9, 2022
Implement `cumsum()` function.

WIP for #3279
@samukweku
Copy link
Contributor Author

samukweku commented Jun 10, 2022

tests comparison with np.cumsum, where numpy is twice as fast, and seems to be run on a single core:

import numpy as np
from datatable import dt, f

In [12]: a = np.arange(10_000_000)

In [13]: %timeit a.cumsum()
22.2 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: DT = dt.Frame(a)

In [15]: %timeit DT[:, f.C0.cumsum()]
46.7 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Jun 10, 2022

@samukweku we need to address #3081 to improve performance in the case when there is no group-by context. For grouped frames we're fully parallel now.

For cumulative functions we actually don't have a separate issue and I'm not sure if we can call cumsum() etc to be reducers. But anyways, our logic is now parallel in terms of the groups, but not in terms of the actual data. We probably need to improve it.

There is still a question why numpy is x2 faster on a single core. We need to do some profiling to see if there are any bottlenecks in our code.

@samukweku
Copy link
Contributor Author

thanks @oleksiyskononenko , maybe you can explain more what you mean by parallelisation in terms of the actual data. How is profiling done in C++? by the way, is there any way to do interactive work in C++?

@samukweku
Copy link
Contributor Author

@oleksiyskononenko I was reading up on cumulative sum, and found a possible performance option with Fenwick tree. What are your thoughts on it? is it worth the effort?

As an aside, when we call get_edit_datatable, does it return a vector? where do I trace this to see its implementation

I'm also thinking of building on your example of templates and convert the cumsum implementation to support cumprod, since they are essentially the same, just the change of the operator.

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Jun 17, 2022

@samukweku

maybe you can explain more what you mean by parallelisation in terms of the actual data.

It means that we parallelize loops to go over the frame rows, currently our parallel loops for cumulative functions/reducers go over the groups.

How is profiling done in C++?

To start with you can read couple of threads on SO, something like https://stackoverflow.com/questions/375913/how-can-i-profile-c-code-running-on-linux

by the way, is there any way to do interactive work in C++?

Probably yes, I don't have experience with that.

I was reading up on cumulative sum, and found a possible performance option with Fenwick tree. What are your thoughts on it? is it worth the effort?

First, let's looks through the code to see if there are any obvious bottlenecks.
Second, we should decide if the cumsum() performance is acceptable or not. Is it currently x2 slower than numpy?
Then we could discuss ways to improve performance if needed.

As an aside, when we call get_edit_datatable, does it return a vector? where do I trace this to see its implementation

We don't have get_edit_datatable() method. Perhaps you mean get_data_editable()? It returns a pointer to the column's underlying buffers. Implementation depends on the column type. See, for instance:
https://github.com/h2oai/datatable/blob/main/src/core/column/sentinel_fw.cc#L234-L238
https://github.com/h2oai/datatable/blob/main/src/core/column/sentinel_str.cc#L118-L122

I'm also thinking of building on your example of templates and convert the cumsum implementation to support cumprod, since they are essentially the same, just the change of the operator.

Sounds like a great idea. Btw, one more difference between cumsum() and cumprod() is the default value: 0 vs 1.

oleksiyskononenko pushed a commit that referenced this issue Jun 28, 2022
- add `cumprod()` function for cumulative product calculation;
- refactor `cumsum()` internals to also support `cumprod()`;

WIP for #3279
@samukweku
Copy link
Contributor Author

@oleksiyskononenko @Peter-Pasta @vopani should we stick to the name cumcount or use row_number. I bring up row_number because of sql. checking to see which one is more intuitive

@oleksiyskononenko
Copy link
Contributor

@samukweku What functionality you want to achieve with this function?

@samukweku
Copy link
Contributor Author

@oleksiyskononenko it is essentially row numbers, but it is more helpful when you want the row number per group ... #2892 is the reason behind the cumulative functions. In pandas it is referred to as cumcount, in datatable it can be pulled off with 1..N or something similar ... in Postgres/sql it is usually referred to as row_number. I hope I explained well enough?

@vopani
Copy link

vopani commented Jun 30, 2022

cumcount is a very natural (and common) term used by Data Scientists.
Which is also why pandas uses the same.

@samukweku
Copy link
Contributor Author

samukweku commented Jun 30, 2022

Thanks for the feedback @vopani we are on the right track then with naming. I will go ahead and create the function with cumcount as the name

@oleksiyskononenko
Copy link
Contributor

Actually, in datatable there is already a function called count() that is used to

Calculate the number of non-missing values for each column

see https://datatable.readthedocs.io/en/latest/api/dt/count.html for more details.

So the function named cumcount() is expected to be a cumulative equivalent of count(). But it will not be in this case. My feeling is that if we call a function smthcount() it is expected to count something, while in reality this function only returns the corresponding row id's that there is no need to count, they're already known internally.

@vopani
Copy link

vopani commented Jun 30, 2022

The datatable count() is very unnatural and unintuitive, which should ideally be renamed :-)
Even SQL count means #rows.

But ok, for the benefit of this discussion, maybe using or cumrows() or cumrowcount() might be suitable. We can also have the cumcount() that is the cumulative equivalent of count(). It is a useful aggregation but many users might misinterpret its result initially. Hopefully sufficient documentation will help.

@samukweku
Copy link
Contributor Author

I think count depending on its usage can also count non missing values.

I feel using cumcount will be familiar with users from pandas and data science generally. But row_number feels very explicit Unless there is a more intuitive name. cumrowcount seems odd to me.

@samukweku
Copy link
Contributor Author

ROW_NUMBER function is a SQL ranking function that assigns a sequential rank number to each new record in a partition.

That's the definition from SQL server

@samukweku
Copy link
Contributor Author

cumcount - Number each item in each group from 0 to the length of that group - 1.

Definition from pandas

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Jun 30, 2022

@vopani Well, I'm not sure why you think it is an unnatural and unintuitive name. The same name/behavior is used in, at least, pandas and pyarrow. datatable just sticks to the same name and convention.

What I find unintuitive though, is pandas using the name count() to calculate non-missing values and cumcount() to produce a simple row counter that doesn't care about the element's validity.

As for the function name, I'm not sure what could be the good name at this point. We probably need to review what others like pandas, numpy, pyarrow, data.table, etc. are doing to have a good guess.

samukweku pushed a commit that referenced this issue Aug 10, 2022
Cosmetic improvements of docs for `cumcount()` and `ngroups()`.

WIP for #3279
samukweku added a commit that referenced this issue Aug 10, 2022
Add `dt.fillna()` function to replace missing values with the previous/subsequent non-missing.

WIP for #3279
@samukweku
Copy link
Contributor Author

@oleksiyskononenko pending when you are done with #3333, I might continue with the remaining functions here. At the moment, I'm looking to expand on fillna to support scalars, and also the rank function. Stuck a bit in my fillna progress; I will create a PR for you to have a look at. For the rank function, which function can I use to get the groups and order when sort is applied? Do you mind pointing me in the right direction? Thanks

@oleksiyskononenko
Copy link
Contributor

@samukweku Yeah, let me take a look at #3333. As for the

which function can I use to get the groups and order when sort is applied?

Not sure what you mean. There is a Groupby::get_group() method, is it what you need?

@samukweku
Copy link
Contributor Author

samukweku commented Sep 2, 2022

@oleksiyskononenko for sorting, I believe groups are created, with a numbering (guessing). Example:

DT = dt.range([2,2,3,4,4])

if DT is sorted, it should have 3 groups : 2, 3, 4, and the numbering order should be 0,0,1,2,2. The same idea applies if a groupby is present, but the numbering order is within each group, and restarts for each group. Is there somewhere in evalcontext where this is available after sorting is called on the dataframe? I was hoping to just reuse that instead. I want to use that to implement the rank function, and make it similar in API as SQL's (which I deem to be simple enough to grasp, with a manageable complexity IMO)

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Sep 2, 2022

Sorting will only return you the RowIndex and grouping will return both the RowIndex and the group information, see https://github.com/h2oai/datatable/blob/main/src/core/sort.cc#L1411-L1495

You can use the both rowindex/offsets to implement the rank() function. First, you need to group the column (if not already grouped) and then calculate rank based on the returned grouping information.

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Sep 2, 2022

My feeling is that before jumping into the rank() implementation, you need to go through implementation details of a groupby: rowindex, offsets, etc. Once you understand their purpose and implementation, it won't be difficult to implement the rank(). Don't hesitate to ask if you have any questions. We can also do that on gitter if needed.

samukweku pushed a commit that referenced this issue Sep 15, 2022
samukweku pushed a commit that referenced this issue Sep 17, 2022
samukweku pushed a commit that referenced this issue Sep 18, 2022
oleksiyskononenko pushed a commit that referenced this issue Oct 26, 2022
#3344)

Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`.

WIP for #3279
@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Oct 26, 2022

@samukweku What if we add parameter reverse to dt.cummin(), dt.cummax(), dt.cumsum() and dt.cumprod(), just like we did for dt.fillna() and dt.cumcount()?

Btw, what do we mean by the "rolling aggregations" on this issue?

@samukweku
Copy link
Contributor Author

Not a bad idea!

Rolling aggregations are usually associated with time series like get mean for every 3 days or 3 hours. Pandas has rolling and expanding. SQL has it but more within the windows function `... Over () between range 3 days...``` sth like that. It would be best handled by someone who has a good understanding of time series in general, with a finance bias probably. I dont

@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Oct 27, 2022

I see, so this is basically some moving window calculations. Then, may be we open a new issue with listing all the rolling aggregations you have in mind?

@oleksiyskononenko
Copy link
Contributor

As for the reverse parameter, it is still possible to do this with the current implementation:

  • reversing a frame;
  • applying cum*() function;
  • reversing the frame again.

But having a reverse parameter will make it cleaner and slightly faster, as the user won't need to reverse the frame twice.

@samukweku
Copy link
Contributor Author

I'll make out some time to add the reverse parameter to the existing functions. Still looking forward to you help on the rowall/rowany for the nth function; i cant get around the FExpr_Rowall/rowany

@oleksiyskononenko
Copy link
Contributor

@samukweku Yeah, sure. I will take a look at it.

@samukweku
Copy link
Contributor Author

reopened; will be closed after implementation of reverse parameter for cummin, cummax, cumsum, cumprod

oleksiyskononenko pushed a commit that referenced this issue Nov 20, 2022
…and `cummax()` (#3381)

Add `reverse` parameter to control direction of cumulative function's calculations: 
- when `False`, calculation is done from top to bottom (default);
- when `True`, calculation is done from bottom to top.

Сloses #3279
samukweku added a commit that referenced this issue Jan 3, 2023
#3344)

Enhance `dt.fillna()` to support filling missing values with a particular value, that could be a scalar, sequence or an `FExpr`.

WIP for #3279
samukweku added a commit that referenced this issue Jan 3, 2023
…and `cummax()` (#3381)

Add `reverse` parameter to control direction of cumulative function's calculations: 
- when `False`, calculation is done from top to bottom (default);
- when `True`, calculation is done from bottom to top.

Сloses #3279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Feature requests for new functionality
Projects
None yet
Development

No branches or pull requests

3 participants