Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: apply np.ufunc.accumulate along the columns/blocks (to preserve dtypes) #39275

Open
jorisvandenbossche opened this issue Jan 19, 2021 · 5 comments
Assignees
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@jorisvandenbossche
Copy link
Member

Follow-up on #39260 (comment)

Currently, an "accumulate" ufunc is applied on the full DataFrame at once, with the consequence that it doesn't preserve dtypes if you have mixed numeric columns, eg:

In [4]: df = pd.DataFrame({"a": [1, 3, 2, 4], "b": [0.1, 4.0, 3.0, 2.0]})

In [5]: df
Out[5]: 
   a    b
0  1  0.1
1  3  4.0
2  2  3.0
3  4  2.0

In [6]: np.maximum.accumulate(df)
Out[6]: 
     a    b
0  1.0  0.1
1  3.0  4.0
2  3.0  4.0
3  4.0  4.0

It is certainly possible for the default case (corresponding to .accumulate(axis=0)) to apply this ufunc on each column or block, to preserve the column dtypes. When axis=1 is passed to the ufunc this is not possible.

See at the linked PR discussion above for some more details at what is involved to implement this.

@jorisvandenbossche jorisvandenbossche added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 19, 2021
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jan 19, 2021
@AnnaDaglis
Copy link
Contributor

Take

@jorisvandenbossche
Copy link
Member Author

@AnnaDaglis Thanks for taking a look at this! If you need any pointers, let me know

@AnnaDaglis
Copy link
Contributor

AnnaDaglis commented Jan 29, 2021

@jorisvandenbossche Yes, please, would appreciate some pointers! I found the 2) point in #39260 (comment) relating to axis somewhat challenging. E.g. if we have a DatetimeTZBlock, we would not actually need to change to axis=1, as it throws an error. So we would need to go back to axis=0 there. E.g. the following throws an error.

df = pd.DataFrame(date_range("20210129", periods=4, tz="UTC")) 
getattr(np.maximum, "accumulate")(df._mgr.blocks[0].values, axis=1)

Some changes in the code along these lines work fine on the toy examples I tried, but break a lot of tests.

df = pd.DataFrame(date_range("20210129", periods=4, tz="UTC")) 
getattr(np.maximum, "accumulate")(df._mgr.blocks[0].values, axis=0)

Would be great to have your thoughts/ideas/pointers! :)

@jorisvandenbossche
Copy link
Member Author

E.g. if we have a DatetimeTZBlock, we would not actually need to change to axis=1, as it throws an error. So we would need to go back to axis=0 there.

Yes, in general the ExtensionBlock (or subclasses like DatetimeTZBlock) is only 1D, and so for those the axis should not be changed, only for the blocks storing their data as 2D.

Now, an alternative could also be to apply the ufunc column-wise instead of per block. The we don't need to deal with this axis difference.
Dummy code would be something like result = [ufunc(arr, ...) for arr in df._iter_column_arrays()]; pd.DataFrame._from_arrays(result, df.columns, df.index, verify_integrity=False)

@AnnaDaglis
Copy link
Contributor

AnnaDaglis commented Jan 29, 2021

@jorisvandenbossche The alternative approach looks somewhat "cleaner" to me, thank you! Will try to implement it.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

3 participants