Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of coordinate information from groupby.apply() on a stacked object #1483

Closed
byersiiasa opened this issue Jul 19, 2017 · 7 comments
Closed
Labels
bug needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports plan to close May be closeable, needs more eyeballs topic-groupby

Comments

@byersiiasa
Copy link

byersiiasa commented Jul 19, 2017

I use this stack, groupby, unstack quite frequently. e.g. here

An issue I have is that after groupby('allpoints').apply(), the coordinate names do not get carried through. i.e. the coordinate names are now: allpoints_level_0 and allpoints_level_1. Then after unstacking I rename them back to lat/lon etc. Do you ever encounter this?

Is there a way to carry them through and is this an issue for others?

import xarray as xr  
import numpy as np
ds = xr.DataArray(np.ndarray((180,360,2000)), coords={'lat':np.arange(90,-90,-1), 'lon':np.arange(-180,180), 'time':range(2000)})
ds

<xarray.DataArray (lat: 180, lon: 360, time: 2000)>
array([[[ 0.623891, -0.044304, ...,  1.015785,  0.009088],
        [-0.7375  ,  0.380369, ...,  0.788351, -0.69295 ],
        ..., 
        [ 0.171894,  0.517164, ..., -0.946908, -0.597802],
        [ 0.353743,  0.005539, ..., -1.436965, -0.190099]],
....
Coordinates:
  * lat      (lat) int32 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 ...
  * lon      (lon) int32 -180 -179 -178 -177 -176 -175 -174 -173 -172 -171 ...
  * time     (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ..

Now we stack the data by allpoints. Note that the info about original coordinates (lat / lon) is still there...
dst = ds.stack(allpoints=['lat','lon'])

<xarray.DataArray (time: 2000, allpoints: 64800)>
array([[ 0.623891, -0.7375  ,  0.053525, ...,  0.379701,  0.130618,  0.11094 ],
       [-0.044304,  0.380369, -0.410632, ..., -0.739881,  0.203219, -0.506303],
       [-1.762024, -1.019424,  2.580218, ...,  1.491677,  1.189149, -0.072223],
       ..., 
       [-0.896298,  0.333163, -1.751641, ...,  1.90315 ,  2.642813, -0.913787],
       [ 1.015785,  0.788351,  0.379997, ...,  0.864934,  0.889001, -1.363458],
       [ 0.009088, -0.69295 , -1.276184, ...,  1.220656,  0.895599,  0.848757]])
Coordinates:
  * time       (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * allpoints  (allpoints) MultiIndex
  - lat        (allpoints) int64 90 90 90 90 90 90 90 90 90 90 90 90 90 90 ...
  - lon        (allpoints) int64 -180 -179 -178 -177 -176 -175 -174 -173 ...

Now apply groupby().apply()
dsg=dst.groupby('allpoints').apply(my_custom_function)

<xarray.DataArray (allpoints: 64800)>
array([ 0.013697,  0.006272,  0.009744, ..., -0.016265, -0.002108, -0.014733])
Coordinates:
  * allpoints          (allpoints) MultiIndex
  - allpoints_level_0  (allpoints) int64 -89 -89 -89 -89 -89 -89 -89 -89 -89 ...
  - allpoints_level_1  (allpoints) int64 -180 -179 -178 -177 -176 -175 -174 ...

So now we have lost the 'lat','lon'. However if we skip the groupby part and go straight to unstack, this would be carried through.
dst.unstack('allpoints')

<xarray.DataArray (time: 2000, lat: 180, lon: 360)>
array([[[ 0.623891, -0.7375  , ...,  0.171894,  0.353743],
        [ 1.780691, -0.747431, ...,  0.038754,  0.615228],
        ..., 
Coordinates:
  * time     (time) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * lat      (lat) int64 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 ...
  * lon      (lon) int64 -180 -179 -178 -177 -176 -175 -174 -173 -172 -171 ... 
@byersiiasa
Copy link
Author

** Maybe not an issue for others or I am missing something...
Or perhaps this is intended behaviour?
Thanks for clarification!

@darothen
Copy link

Instead of computing the mean over your non-stacked dimension by

dsg = dst.groupby('allpoints').mean()

why not just instead call

dsg = dst.mean('time', keep_attrs=True)

so that you just collapse the time dimension and preserve the attributes on your data? Then you can unstack() and everything should still be there. The idiom of stacking/applying/unstacking is really useful to fit your data to the interface of a numpy or scipy function that will do all the heavy lifting with a vectorized routine for you - isn't using groupby in this way really slow?

@byersiiasa
Copy link
Author

@darothen
yes you are right - this is definitely not a good way to apply mean - I was just using mean as a (poor) example trying not to over-complicate or distract from the issue.
But, as you suggest, this is what I do when needing to apply customised functions like from scipy... which, can end up being slow.

@shoyer
Copy link
Member

shoyer commented Jul 20, 2017

This wasn't intentional. If we can fix it in a straightforward fashion, we definitely should.

@stale
Copy link

stale bot commented Oct 4, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Oct 4, 2020
@dcherian dcherian removed the stale label Oct 4, 2020
@max-sixty max-sixty added the needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports label Sep 18, 2024
@max-sixty
Copy link
Collaborator

Sorry this didn't get traction. Can we add an MCVE? Currently my_custom_function is undefined

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Sep 18, 2024
@dcherian
Copy link
Contributor

This is fixed if I use lambda x: x for my_custom_function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports plan to close May be closeable, needs more eyeballs topic-groupby
Projects
None yet
Development

No branches or pull requests

5 participants