Multiindex scalar coords, fixes #1408 #1412

fujiisoup · 2017-05-17T14:25:50Z

Closes .sel does not keep selected coordinate value in case with MultiIndex #1408
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

To fix #1408,
This modification works, but actually I do not fully satisfied yet.
There are if statements in many places.

The major changes I made are

variable.__getitem__ now returns an OrderedDict if a single element is selected from MultiIndex.
indexing.remap_level_indexers also returns selected_dims which is a map from the original dimension to the selected dims which will be a scalar coordinate.

Change 1 keeps level-coordinates even after ds.isel(yx=0).
Change 2 enables to track which levels are selected, then the selected levels are changed to a scalar coordinate.

I guess much smarter solution should exist.
I would be happy if anyone gives me a comment.

if a single element is selected from a MultiIndex.

…MultiIndex.

fujiisoup · 2017-05-17T14:28:55Z

xarray/core/dataarray.py

@@ -256,23 +258,6 @@ def _replace_maybe_drop_dims(self, variable, name=__default):
                                 if set(v.dims) <= allowed_dims)
        return self._replace(variable, coords, name)

-    def _replace_indexes(self, indexes):


I removed this method and use Dataset._replace_indexes instead, to reduce duplicates.

If I remember well, we used duplicates to avoid using _to_temp_dataset and _from_temp_dataset in __getitem__. But now that _replace_indexes has more logic implemented, maybe it is a good idea to reduce duplicates?

Thanks.
Yes, in DataArray.__getitem__ and also in DataAarray.isel, _to_temp_dataset and _from_temp_dataset are now being used.

fujiisoup · 2017-05-17T14:36:19Z

xarray/core/groupby.py

@@ -362,6 +362,22 @@ def _maybe_unstack(self, obj):
                    del obj.coords[dim]
        return obj

+    def _maybe_stack(self, applied):


This method becomes necessary, because now we cannot do xr.concat([ds.isel(yx=i)] for i in range(*)], dim='yx') because ds.isel(yx=i) does not have yx anymore.

shoyer · 2017-05-17T17:42:58Z

variable.getitem now returns an OrderedDict if a single element is selected from MultiIndex.

I don't like this change. It breaks an important invariant, which is that indexing a Variable returns another Variable.

I do agree with indexing along a MultiIndex dimension should unpacking the tuple for coordinates, but only for coordinates. So this needs to be somewhere in the Dataset.isel logic, not Variable.isel.

Consider indexing ds['yx'] from your example in the linked issue. With the current version of xarray:

In [7]: ds['yx']
Out[7]:
<xarray.DataArray 'yx' (yx: 6)>
array([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)], dtype=object)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'a' 'a' 'b' 'b' 'b'
  - x        (yx) int64 1 2 3 1 2 3

In [8]: ds['yx'][0]
Out[8]:
<xarray.DataArray 'yx' ()>
array(('a', 1), dtype=object)
Coordinates:
    yx       object ('a', 1)

We want to change the indexing behavior to this:

In [8]: ds['yx'][0]
Out[8]:
<xarray.DataArray 'yx' ()>
array(('a', 1), dtype=object)
Coordinates:
    y        object 'a'
    x        int64 1

But we don't want to change what happens to the DataArray itself -- it should still be a scalar object array.

I tested this example on your PR branch, and it actually crashes with KeyError.

…ement is selected from MultiIndex. Instead, added _maybe_split function in Dataset

fujiisoup · 2017-05-18T04:48:29Z

@shoyer
Thanks for the comment.

It breaks an important invariant, which is that indexing a Variable returns another Variable.

I totally agree with you.

In the last commit, I moved the unpacking functionality into Dataset, and restored the modification in Variable class I made.
I think the current is cleaner than my previous one, but I'm not yet comfortable with it.
There are a lot of functions or if-statements related to MultiIndex in different places.
I guess they should be bundled in one place.

Adding functions is easy but simplifying them are difficult...

If anyone show a direction, I will try the improvement.

benbovy · 2017-05-18T09:56:28Z

A possible direction to reduce the if statements in many different places would be to just return pos_indexers in indexing.remap_level_indexers - as it was the case before adding multi-index support - and instead put in Dataset.isel all the logic for checking MultiIndex and maybe convert it to Index and/or scalar coordinates and maybe rename dimension.

This would simplify many things, although I haven't thought about about all other possible issues it would create (perfomance, etc.). Also, DataArray.loc doesn't seem to use Dataset.isel.

Here is another case related to this PR. From the example in the linked issue, the current behavior is

In [9]: ds.isel(yx=[0, 1])
Out[9]: 
<xarray.Dataset>
Dimensions:  (yx: 2)
Coordinates:
  * yx       (yx) MultiIndex
  - y        (yx) object 'a' 'a'
  - x        (yx) int64 1 2
Data variables:
    foo      (yx) int64 1 2

Do we want to also change the behavior to this?

In [10]: ds.isel(yx=[0, 1])
Out[10]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 1 2
    y        object 'a'
Data variables:
    foo      (x) int64 1 2

To me it looks like it is a bit too magical, but just wondering what you think...

shoyer · 2017-05-18T14:42:57Z

To me it looks like it is a bit too magical, but just wondering what you think...

Agreed, this also seems too magical to me.

fujiisoup · 2017-05-19T12:47:28Z

I also agree. It seems too magical.

But I slightly changed my mind.
I notice what I really want to have is not particular scalar coordinate in MultiIndex,
but 'unified' interface between normal Vraiable and MultiIndex.

The current structure is illustrated as follows,

The MultiIndex has different characteristics from normal Variable.
For example, if we do ds.sel(x=2), it makes a scalar coordinate and normal Variable.
The backward process might be .expand_dims().stack().
This is different from normal Variable behavior.
And because of it, MultiIndex should be treated in special way in every place.
(Deprecating the automatic-renaming does not change things so much.)

I am wondering if we could have the following class structure things become simpler

In this picture, MultiIndex can have scalar as its level and .isel() produces it.
This process can be traced backward by .expand_dims() or .concat() as in normal Variable.

I understand it is different from pandas.MultiIndex structure, and we need to expand our wrapper extensively if we decide to realize it (as written in red).
But I feel this symmetric structure could make it easy to expand MultiIndex functionalities in future.

Any thoughts are welcome.
(Should move discussion to another issue?)

shoyer · 2017-05-20T15:11:24Z

@fujiisoup Yes, the solution of writing a MultiIndex wrapper for xarray looks much cleaner to me. I like the look of this proposal! (Those diagrams are also very helpful)

I guess this could be implemented as a pandas.MultiIndex along with a list of scalar coordinates?

benbovy · 2017-05-20T16:31:43Z

I also agree that a MultiIndex wrapper would be to be the way to go. I think I understand the diagrams, but what remains a bit unclear to me is how this could be implemented.

In particular, how would this wrapper work with IndexVariable?

Currently, IndexVariable warps either a pandas.Index or a pandas.MultiIndex and for the latter case IndexVariable.get_level_variable can generate new IndexVariable objects so that MultiIndex levels are accessible as "virtual coordinates".

Would IndexVariable warp a MultiIndex wrapper instead (levels + scalars), and also be able to generate new scalar Variable objects that will be accessible as virtual coordinates?

This is maybe slightly off topic, but more generally I'm also wondering how this would co-exist with potentially other kinds of multi-level indexes (see this comment).

fujiisoup · 2017-05-21T07:13:37Z

@benbovy
Thanks for the valuable comments.
Actually I can not fully imagine how the actual implementation looks like currently,
but I also think the virtual variable access needs some tricks.
This is an essential functionality of the MultiIndex-coordinate,
I will try to investigate it.

Thanks.

fujiisoup · 2017-05-25T11:04:55Z

Replaced by a new PR #1426 .

fujiisoup added 8 commits May 15, 2017 21:44

Make IndexVariable.__getitem__ return an OrderedDict of variables

4876ee8

if a single element is selected from a MultiIndex.

Still error in test_groupby, since it requres .isel() and concat for …

800a7d9

…MultiIndex.

Added Groupby._maybe_stack.

3fac440

Added small docstring.

df96d8d

starting sel()

6db36a8

Remove DataArray._replace_index and make use of Dataset._replace_index.

0af9da2

Add whatsnew

cdd839d

Remove unecessary flatten call.

42f4f2e

fujiisoup commented May 17, 2017

View reviewed changes

Restore Variable.__getitem__ to return a Variable even if a signle el…

185abd0

…ement is selected from MultiIndex. Instead, added _maybe_split function in Dataset

fujiisoup mentioned this pull request May 25, 2017

scalar_level in MultiIndex #1426

Closed

9 tasks

fujiisoup closed this May 25, 2017

fujiisoup mentioned this pull request May 28, 2017

inconsistent behavior in stack/unstack along one dimension #1431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiindex scalar coords, fixes #1408 #1412

Multiindex scalar coords, fixes #1408 #1412

fujiisoup commented May 17, 2017 •

edited

Loading

fujiisoup May 17, 2017

benbovy May 18, 2017

fujiisoup May 18, 2017

fujiisoup May 17, 2017

shoyer commented May 17, 2017

fujiisoup commented May 18, 2017

benbovy commented May 18, 2017 •

edited

Loading

shoyer commented May 18, 2017

fujiisoup commented May 19, 2017

shoyer commented May 20, 2017

benbovy commented May 20, 2017

fujiisoup commented May 21, 2017

fujiisoup commented May 25, 2017

Multiindex scalar coords, fixes #1408 #1412

Multiindex scalar coords, fixes #1408 #1412

Conversation

fujiisoup commented May 17, 2017 • edited Loading

fujiisoup May 17, 2017

Choose a reason for hiding this comment

benbovy May 18, 2017

Choose a reason for hiding this comment

fujiisoup May 18, 2017

Choose a reason for hiding this comment

fujiisoup May 17, 2017

Choose a reason for hiding this comment

shoyer commented May 17, 2017

fujiisoup commented May 18, 2017

benbovy commented May 18, 2017 • edited Loading

shoyer commented May 18, 2017

fujiisoup commented May 19, 2017

shoyer commented May 20, 2017

benbovy commented May 20, 2017

fujiisoup commented May 21, 2017

fujiisoup commented May 25, 2017

fujiisoup commented May 17, 2017 •

edited

Loading

benbovy commented May 18, 2017 •

edited

Loading