Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray=1.15.1 regression: Groupby drop multi-index #3985

Closed
DancingQuanta opened this issue Apr 20, 2020 · 4 comments
Closed

xarray=1.15.1 regression: Groupby drop multi-index #3985

DancingQuanta opened this issue Apr 20, 2020 · 4 comments
Labels

Comments

@DancingQuanta
Copy link

I have written a function process_stacked_groupby that stack all but one dimension of a dataset/dataarray and perform groupby-apply-combine on the stacked dimension. However, after upgrading to 0.15.1, the function cease to work.

MCVE Code Sample

import xarray as xr

# Dimensions
N = xr.DataArray(np.arange(100), dims='N', name='N')
reps = xr.DataArray(np.arange(5), dims='reps', name='reps')
horizon = xr.DataArray([1, -1], dims='horizon', name='horizon')
horizon.attrs = {'long_name': 'Horizonal', 'units': 'H'}
vertical = xr.DataArray(np.arange(1, 4), dims='vertical', name='vertical')
vertical.attrs = {'long_name': 'Vertical', 'units': 'V'}

# Variables
x = xr.DataArray(np.random.randn(len(N), len(reps), len(horizon), len(vertical)),
                 dims=['N', 'reps', 'horizon', 'vertical'],
                 name='x')
y = x * 0.1
y.name = 'y'

# Merge x, y
data = xr.merge([x, y])

# Assign coords
data = data.assign_coords(reps=reps, vertical=vertical, horizon=horizon)

# Function that stack all but one diensions and groupby over the stacked dimension.
def process_stacked_groupby(ds, dim, func, *args):
    
    # Function to apply to stacked groupby
    def apply_fn(ds, dim, func, *args):
        
        # Get groupby dim
        groupby_dim = list(ds.dims)
        groupby_dim.remove(dim)
        groupby_var = ds[groupby_dim]
        
        # Unstack groupby dim
        ds2 = ds.unstack(groupby_dim).squeeze()
        
        # perform function
        ds3 = func(ds2, *args)

        # Add mulit-index groupby_var to result
        ds3 = (ds3
               .reset_coords(drop=True)
               .assign_coords(groupby_var)
               .expand_dims(groupby_dim)
             )
        return ds3
    
    # Get list of dimensions
    groupby_dims = list(ds.dims)
    
    # Remove dimension not grouped
    groupby_dims.remove(dim)
    
    # Stack all but one dimensions
    stack_dim = '_'.join(groupby_dims)
    ds2 = ds.stack({stack_dim: groupby_dims})
    
    # Groupby and apply
    ds2 = ds2.groupby(stack_dim, squeeze=False).map(apply_fn, args=(dim, func, *args))
    
    # Unstack
    ds2 = ds2.unstack(stack_dim)
    
    # Restore attrs
    for dim in groupby_dims:
        ds2[dim].attrs = ds[dim].attrs
    
    return ds2

# Function to apply on groupby
def fn(ds):
    return ds

# Run groupby with applied function
data.pipe(process_stacked_groupby, 'N', fn)

Expected Output

Prior to xarray=0.15.0, the above code produce a result that I wanted.

The function should be able to

  1. stack chosen dimensions
  2. groupby the stacked dimension
  3. apply a function on each group
    a. The function actually passes along another function with unstacked group coord
    b. Add multi-index stacked group coord back to the results of this function
  4. combine the groups
  5. Unstack stacked dimension

Problem Description

After upgrading to 0.15.1, the above code stopped working.
The error occurred at the line

    # Unstack
    ds2 = ds2.unstack(stack_dim)

with ValueError: cannot unstack dimensions that do not have a MultiIndex: ['horizon_reps_vertical'].
This is on 5th step where the resulting combined object was found not to contain any multi-index.
Somewhere in the 4th step, the combination of groups have lost the multi-index stacked dimension.

Versions

0.15.1

@DancingQuanta
Copy link
Author

Using git bisect I am able to narrow down to this PR #3807 that introduced the regression.

@dcherian dcherian mentioned this issue May 5, 2020
23 tasks
@mathause
Copy link
Collaborator

Could you simplify your example? I don't manage to wrap my head around it.

@dcherian
Copy link
Contributor

@DancingQuanta can you post a simplified example? There are many manipulations here so it is hard to figure out what changed.

@DancingQuanta
Copy link
Author

Sorry for the late reply.
I have been using this function in my projects and as such it is minimum functional.
However, I will try to investigate a simpler example that replicate the issue.

Lastly, perhaps you have a better idea for groupby over multi-dimension without stacking the dimensions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants