-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cumulative functions #3279
Comments
Implement `cumsum()` function. WIP for #3279
I think these are very useful functions. Great to see them soon! |
These would be great and long due. Is there an ETA of when this would be available (next release)? |
@vopani currently working on them ... cant specify an ETA though ... contributions are welcome also. You can test the |
@vopani Kindly create a minimal example of the cumsum for both datatable and pandas; it is easier to grok. |
@samukweku Yeah apologies, let me run some more tests and share better examples / feedback. Thanks again for these cumulative functions, they are super useful and help in bridging the gap to pandas. |
Thanks also to @oleksiyskononenko and @st-pasha for their guidance thru my C++ journey |
Implement `cumsum()` function. WIP for #3279
tests comparison with import numpy as np
from datatable import dt, f
In [12]: a = np.arange(10_000_000)
In [13]: %timeit a.cumsum()
22.2 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [14]: DT = dt.Frame(a)
In [15]: %timeit DT[:, f.C0.cumsum()]
46.7 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
@samukweku we need to address #3081 to improve performance in the case when there is no group-by context. For grouped frames we're fully parallel now. For cumulative functions we actually don't have a separate issue and I'm not sure if we can call There is still a question why numpy is x2 faster on a single core. We need to do some profiling to see if there are any bottlenecks in our code. |
thanks @oleksiyskononenko , maybe you can explain more what you mean by parallelisation in terms of the actual data. How is profiling done in C++? by the way, is there any way to do interactive work in C++? |
@oleksiyskononenko I was reading up on cumulative sum, and found a possible performance option with Fenwick tree. What are your thoughts on it? is it worth the effort? As an aside, when we call get_edit_datatable, does it return a vector? where do I trace this to see its implementation I'm also thinking of building on your example of templates and convert the cumsum implementation to support cumprod, since they are essentially the same, just the change of the operator. |
It means that we parallelize loops to go over the frame rows, currently our parallel loops for cumulative functions/reducers go over the groups.
To start with you can read couple of threads on SO, something like https://stackoverflow.com/questions/375913/how-can-i-profile-c-code-running-on-linux
Probably yes, I don't have experience with that.
First, let's looks through the code to see if there are any obvious bottlenecks.
We don't have
Sounds like a great idea. Btw, one more difference between |
@oleksiyskononenko @Peter-Pasta @vopani should we stick to the name |
@samukweku What functionality you want to achieve with this function? |
@oleksiyskononenko it is essentially row numbers, but it is more helpful when you want the row number per group ... #2892 is the reason behind the cumulative functions. In pandas it is referred to as cumcount, in datatable it can be pulled off with |
|
Thanks for the feedback @vopani we are on the right track then with naming. I will go ahead and create the function with cumcount as the name |
Actually, in datatable there is already a function called
see https://datatable.readthedocs.io/en/latest/api/dt/count.html for more details. So the function named |
The datatable But ok, for the benefit of this discussion, maybe using or |
I think I feel using |
That's the definition from SQL server |
Definition from pandas |
@vopani Well, I'm not sure why you think it is an unnatural and unintuitive name. The same name/behavior is used in, at least, pandas and pyarrow. datatable just sticks to the same name and convention. What I find unintuitive though, is pandas using the name As for the function name, I'm not sure what could be the good name at this point. We probably need to review what others like pandas, numpy, pyarrow, data.table, etc. are doing to have a good guess. |
Cosmetic improvements of docs for `cumcount()` and `ngroups()`. WIP for #3279
Add `dt.fillna()` function to replace missing values with the previous/subsequent non-missing. WIP for #3279
@oleksiyskononenko pending when you are done with #3333, I might continue with the remaining functions here. At the moment, I'm looking to expand on |
@samukweku Yeah, let me take a look at #3333. As for the
Not sure what you mean. There is a |
@oleksiyskononenko for sorting, I believe groups are created, with a numbering (guessing). Example: DT = dt.range([2,2,3,4,4]) if |
Sorting will only return you the You can use the both rowindex/offsets to implement the |
My feeling is that before jumping into the |
@samukweku What if we add parameter Btw, what do we mean by the "rolling aggregations" on this issue? |
Not a bad idea! Rolling aggregations are usually associated with time series like get mean for every 3 days or 3 hours. Pandas has rolling and expanding. SQL has it but more within the windows function `... Over () between range 3 days...``` sth like that. It would be best handled by someone who has a good understanding of time series in general, with a finance bias probably. I dont |
I see, so this is basically some moving window calculations. Then, may be we open a new issue with listing all the rolling aggregations you have in mind? |
As for the
But having a |
I'll make out some time to add the reverse parameter to the existing functions. Still looking forward to you help on the rowall/rowany for the nth function; i cant get around the FExpr_Rowall/rowany |
@samukweku Yeah, sure. I will take a look at it. |
reopened; will be closed after implementation of |
The list of functions to be implemented and the corresponding PR's
cumsum()
[ENH] Implementcumsum()
function #3257cumprod()
[ENH] Addcumprod()
function for cumulative product calculation #3304cummax()
[ENH] Addcummin()
andcummax()
functions for cumulative min/max calculation #3288cummin()
[ENH] Addcummin()
andcummax()
functions for cumulative min/max calculation #3288cumcount()
[ENH] Addcumcount()
andngroup()
functions #3310ngroup()
- not strictly cumulative [ENH] Addcumcount()
andngroup()
functions #3310fillna()
for forward/backward fill [ENH] Adddt.fillna()
function to impute missing values #3311fillna()
for filling with a value [ENH] Enhancedt.fillna()
to support filling with a particular value #3344- [ ] rankcontinued on similar topandas.Series.rank()
#3148- [ ] rolling aggregationscontinued on Rolling aggregate support based on windows within a DT #1500The text was updated successfully, but these errors were encountered: