Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiindex scalar coords, fixes #1408 #1412

Closed
wants to merge 9 commits into from

Conversation

fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented May 17, 2017

To fix #1408,
This modification works, but actually I do not fully satisfied yet.
There are if statements in many places.

The major changes I made are

  1. variable.__getitem__ now returns an OrderedDict if a single element is selected from MultiIndex.
  2. indexing.remap_level_indexers also returns selected_dims which is a map from the original dimension to the selected dims which will be a scalar coordinate.

Change 1 keeps level-coordinates even after ds.isel(yx=0).
Change 2 enables to track which levels are selected, then the selected levels are changed to a scalar coordinate.

I guess much smarter solution should exist.
I would be happy if anyone gives me a comment.

@@ -256,23 +258,6 @@ def _replace_maybe_drop_dims(self, variable, name=__default):
if set(v.dims) <= allowed_dims)
return self._replace(variable, coords, name)

def _replace_indexes(self, indexes):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this method and use Dataset._replace_indexes instead, to reduce duplicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember well, we used duplicates to avoid using _to_temp_dataset and _from_temp_dataset in __getitem__. But now that _replace_indexes has more logic implemented, maybe it is a good idea to reduce duplicates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Yes, in DataArray.__getitem__ and also in DataAarray.isel, _to_temp_dataset and _from_temp_dataset are now being used.

@@ -362,6 +362,22 @@ def _maybe_unstack(self, obj):
del obj.coords[dim]
return obj

def _maybe_stack(self, applied):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method becomes necessary, because now we cannot do xr.concat([ds.isel(yx=i)] for i in range(*)], dim='yx') because ds.isel(yx=i) does not have yx anymore.

@shoyer
Copy link
Member

shoyer commented May 17, 2017

variable.getitem now returns an OrderedDict if a single element is selected from MultiIndex.

I don't like this change. It breaks an important invariant, which is that indexing a Variable returns another Variable.

I do agree with indexing along a MultiIndex dimension should unpacking the tuple for coordinates, but only for coordinates. So this needs to be somewhere in the Dataset.isel logic, not Variable.isel.

Consider indexing ds['yx'] from your example in the linked issue. With the current version of xarray:

In [7]: ds['yx']
Out[7]:
<xarray.DataArray 'yx' (yx: 6)>
array([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)], dtype=object)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'a' 'a' 'b' 'b' 'b'
  - x        (yx) int64 1 2 3 1 2 3

In [8]: ds['yx'][0]
Out[8]:
<xarray.DataArray 'yx' ()>
array(('a', 1), dtype=object)
Coordinates:
    yx       object ('a', 1)

We want to change the indexing behavior to this:

In [8]: ds['yx'][0]
Out[8]:
<xarray.DataArray 'yx' ()>
array(('a', 1), dtype=object)
Coordinates:
    y        object 'a'
    x        int64 1

But we don't want to change what happens to the DataArray itself -- it should still be a scalar object array.

I tested this example on your PR branch, and it actually crashes with KeyError.

…ement is selected from MultiIndex.

Instead, added _maybe_split function in Dataset
@fujiisoup
Copy link
Member Author

@shoyer
Thanks for the comment.

It breaks an important invariant, which is that indexing a Variable returns another Variable.

I totally agree with you.

In the last commit, I moved the unpacking functionality into Dataset, and restored the modification in Variable class I made.
I think the current is cleaner than my previous one, but I'm not yet comfortable with it.
There are a lot of functions or if-statements related to MultiIndex in different places.
I guess they should be bundled in one place.

Adding functions is easy but simplifying them are difficult...

If anyone show a direction, I will try the improvement.

@benbovy
Copy link
Member

benbovy commented May 18, 2017

A possible direction to reduce the if statements in many different places would be to just return pos_indexers in indexing.remap_level_indexers - as it was the case before adding multi-index support - and instead put in Dataset.isel all the logic for checking MultiIndex and maybe convert it to Index and/or scalar coordinates and maybe rename dimension.

This would simplify many things, although I haven't thought about about all other possible issues it would create (perfomance, etc.). Also, DataArray.loc doesn't seem to use Dataset.isel.

Here is another case related to this PR. From the example in the linked issue, the current behavior is

In [9]: ds.isel(yx=[0, 1])
Out[9]: 
<xarray.Dataset>
Dimensions:  (yx: 2)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'a'
  - x        (yx) int64 1 2
Data variables:
    foo      (yx) int64 1 2

Do we want to also change the behavior to this?

In [10]: ds.isel(yx=[0, 1])
Out[10]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 1 2
    y        object 'a'
Data variables:
    foo      (x) int64 1 2

To me it looks like it is a bit too magical, but just wondering what you think...

@shoyer
Copy link
Member

shoyer commented May 18, 2017

To me it looks like it is a bit too magical, but just wondering what you think...

Agreed, this also seems too magical to me.

@fujiisoup
Copy link
Member Author

I also agree. It seems too magical.

But I slightly changed my mind.
I notice what I really want to have is not particular scalar coordinate in MultiIndex,
but 'unified' interface between normal Vraiable and MultiIndex.

The current structure is illustrated as follows,

current_coord

The MultiIndex has different characteristics from normal Variable.
For example, if we do ds.sel(x=2), it makes a scalar coordinate and normal Variable.
The backward process might be .expand_dims().stack().
This is different from normal Variable behavior.
And because of it, MultiIndex should be treated in special way in every place.
(Deprecating the automatic-renaming does not change things so much.)

I am wondering if we could have the following class structure things become simpler

my_proposal

In this picture, MultiIndex can have scalar as its level and .isel() produces it.
This process can be traced backward by .expand_dims() or .concat() as in normal Variable.

I understand it is different from pandas.MultiIndex structure, and we need to expand our wrapper extensively if we decide to realize it (as written in red).
But I feel this symmetric structure could make it easy to expand MultiIndex functionalities in future.

Any thoughts are welcome.
(Should move discussion to another issue?)

@shoyer
Copy link
Member

shoyer commented May 20, 2017

@fujiisoup Yes, the solution of writing a MultiIndex wrapper for xarray looks much cleaner to me. I like the look of this proposal! (Those diagrams are also very helpful)

I guess this could be implemented as a pandas.MultiIndex along with a list of scalar coordinates?

@benbovy
Copy link
Member

benbovy commented May 20, 2017

I also agree that a MultiIndex wrapper would be to be the way to go. I think I understand the diagrams, but what remains a bit unclear to me is how this could be implemented.

In particular, how would this wrapper work with IndexVariable?

Currently, IndexVariable warps either a pandas.Index or a pandas.MultiIndex and for the latter case IndexVariable.get_level_variable can generate new IndexVariable objects so that MultiIndex levels are accessible as "virtual coordinates".

Would IndexVariable warp a MultiIndex wrapper instead (levels + scalars), and also be able to generate new scalar Variable objects that will be accessible as virtual coordinates?

This is maybe slightly off topic, but more generally I'm also wondering how this would co-exist with potentially other kinds of multi-level indexes (see this comment).

@fujiisoup
Copy link
Member Author

@benbovy
Thanks for the valuable comments.
Actually I can not fully imagine how the actual implementation looks like currently,
but I also think the virtual variable access needs some tricks.
This is an essential functionality of the MultiIndex-coordinate,
I will try to investigate it.

Thanks.

@fujiisoup fujiisoup mentioned this pull request May 25, 2017
9 tasks
@fujiisoup
Copy link
Member Author

Replaced by a new PR #1426 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.sel does not keep selected coordinate value in case with MultiIndex
3 participants