Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when supplying a tuple of dimensions to DataArray.sortby() #4821

Open
simonkeys opened this issue Jan 18, 2021 · 18 comments
Open

Error when supplying a tuple of dimensions to DataArray.sortby() #4821

simonkeys opened this issue Jan 18, 2021 · 18 comments

Comments

@simonkeys
Copy link

What happened:

A KeyError is thrown when supplying a tuple of dimensions to DataArray.sortby()

What you expected to happen:

The dataarray to be sorted according to the dimensions.

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np

da=xr.DataArray(np.random.rand(3,3), coords=(('x', range(3, 0, -1)), ('y', range(3, 0, -1))))
da.sortby(da.dims)

Anything else we need to know?:

If the tuple is cast to a list it works correctly:

da.sortby(list(da.dims))

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.12 (default, Nov 11 2020, 22:22:08)
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.46.4)]
python-bits: 64
OS: Darwin
OS-release: 18.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.2
pandas: 1.1.4
numpy: 1.19.4
scipy: 1.5.4
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2020.12.0
distributed: None
matplotlib: 3.3.3
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 50.3.2
pip: 20.3.3
conda: None
pytest: 6.2.1
IPython: 7.16.1
sphinx: None

@mathause
Copy link
Collaborator

Yes, I am marking that as a bug. The problem is here:

xarray/xarray/core/dataset.py

Lines 5612 to 5613 in 2956067

if not isinstance(variables, list):
variables = [variables]

as the tuple is not a list it ends up as [('x', 'y')]. That should probably be

if isinstance(variables, str):
    variables = [variables]

or

if not isinstance(variables, (list, tuple)):
    variables = [variables]

@keewis
Copy link
Collaborator

keewis commented Jan 19, 2021

I think it's the way it is because tuple is hashable, so it can be used as a name:

ds = xr.Dataset({("a", 0): ("x", [1, 0])})
ds.sortby(("a", 0))

not sure if there's a particular use case where supporting this is crucial

@mathause
Copy link
Collaborator

Ah yes you are right... However, we are not consistent with this, e.g. that works for reductions air.mean(("lat", "lon"))

@mesejo
Copy link
Contributor

mesejo commented Jan 23, 2021

I would like to work on this

@max-sixty
Copy link
Collaborator

I would propose we either:

  • Explicitly allow tuples as dimension and variable names. This is probably the most mypy-compatible, since we use Hashable in lots of places
  • Force dimension names to be strings, but allow variable names to be Hashable. This might be the most practical.
  • Disallow tuples for dimension names, but otherwise allow Hashable. Then tuples work the same as lists when supplied as arguments.

@keewis
Copy link
Collaborator

keewis commented Jan 23, 2021

All of these options are breaking changes so we would probably have to go through a deprecation cycle. 1 might be the most consistent and least confusing, but if we want to keep the current behavior we could require wrapping a tuple in another sequence to access dimension names.

@thomashirtz
Copy link
Contributor

@mesejo Do you still work on it ? If not, I am interested to work on it

@max-sixty
Copy link
Collaborator

Any thoughts from others on what we should do, of the three options above? CC @pydata/xarray

@thomashirtz
Copy link
Contributor

thomashirtz commented Jun 5, 2021

(@max-sixty , I originally took this one because I wanted to do one more ~easy PR. I just saw you modified the tags, lmk if this one is problematic or if you know other issues I could tackle)

@dcherian
Copy link
Contributor

dcherian commented Jun 5, 2021

I think 2 and 3 become really confusing with swap_dims (for e.g.) So my vote is for 1. In all three cases, the solution to this issue is to suggest passing lists in the error message?

@max-sixty
Copy link
Collaborator

max-sixty commented Jun 5, 2021

Great, agree @dcherian . So maybe we:

  • Check whether the dim exists (there could be a dim with tuple), which is already done in 1371 below
  • If it doesn't exist, then check whether it's a string before passing it to _get_virtual_variable below. If it's a tuple, that message can suggest passing it as a list.
    • Or we do the check in _get_virtual_variable?
   1366     def _construct_dataarray(self, name: Hashable) -> "DataArray":
   1367         """Construct a DataArray by indexing this dataset"""
   1368         from .dataarray import DataArray
   1369
   1370         try:
   1371             variable = self._variables[name]
   1372         except KeyError:
-> 1373             _, name, variable = _get_virtual_variable(
   1374                 self._variables, name, self._level_coords, self.dims
   1375             )

@thomashirtz changing the labels back! Thanks in advance!

@mesejo
Copy link
Contributor

mesejo commented Jun 8, 2021

@thomashirtz Feel free to work on it.

@thomashirtz
Copy link
Contributor

You seemed to all agree on the solution 1:

Explicitly allow tuples as dimension and variable names. This is probably the most mypy-compatible, since we use Hashable in lots of places

If you allow explicitly tuple, doesn't it mean I need to edit this part (as mentioned by @mathause) :

if not isinstance(variables, (list, tuple)):
    variables = [variables]

(I tried and it works well)

But you said after :

If it's a tuple, that message can suggest passing it as a list.

(I find it a little bit weird to give dimensions in list, no ? I thought that generally tuples were more adapted to provide dims)

Can you tell me which test-case I can implement to fulfill all the requirements needed ? I can't think of test cases that I can implement except the one suggested first, and that one passes with the first modification

Also the code you mentioned is for one dim, but in this case we can have a list of dims, it does means that I need to do a 'for loop' for checking the existence of the dims, right ?

@max-sixty
Copy link
Collaborator

If you allow explicitly tuple, doesn't it mean I need to edit this part

Yes!

(I find it a little bit weird to give dimensions in list, no ? I thought that generally tuples were more adapted to provide dims)

Yes, I feel similarly. For one xr.Dataset(dict(var=(('a','b'), np.random.rand(3)))) is an error, rather than a one-dimensioned dataset with a single dim called ('a','b'). But OTOH I can very much empathize with accepting any Hashable for dim names most functions, which includes tuples. So I am fine formalizing this (apart from the constructor).

Can you tell me which test-case I can implement to fulfill all the requirements needed ? I can't think of test cases that I can implement except the one suggested first, and that one passes with the first modification

How about a test that checks the example above suggests a list, rather than raising a vanilla KeyError?

Cheers @thomashirtz

@thomashirtz
Copy link
Contributor

I think I am quite confused by the fact that 'one' dimension can be a tuple (also I couldn't find a way to successfully create a dataset with a dim being a tuple, even by tweaking your example using list)

Was it what you imagined for the dataset construction ? (If I understand right, a virtual variable can't be a tuple)

    def _construct_dataarray(self, name: Hashable) -> "DataArray":
        """Construct a DataArray by indexing this dataset"""
        from .dataarray import DataArray

        try:
            variable = self._variables[name]
        except KeyError:
            if isinstance(name, tuple):
                raise TypeError(f'The dimension "{name}" provided is a tuple, you may be intended to pass it as a list')
            else:
            _, name, variable = _get_virtual_variable(
                self._variables, name, self._level_coords, self.dims
            )

@keewis
Copy link
Collaborator

keewis commented Jun 11, 2021

here's an example:

In [2]: xr.Dataset({"a": ([("a", "b")], [1])})
Out[2]: 
<xarray.Dataset>
Dimensions:  (('a', 'b'): 1)
Dimensions without coordinates: ('a', 'b')
Data variables:
    a        (('a', 'b')) int64 1

@thomashirtz
Copy link
Contributor

thomashirtz commented Jun 13, 2021

Ok, I think I understand better now
Almost finished, I just have one test that fails currently, it is this one:

def test_getitem_hashable(self):
data = create_test_data()
data[(3, 4)] = data["var1"] + 1
expected = data["var1"] + 1
expected.name = (3, 4)
assert_identical(expected, data[(3, 4)])
with pytest.raises(KeyError, match=r"('var1', 'var2')"):
data[("var1", "var2")]

The error changed from a KeyError to the error that I wrote in the constructor :

>               raise TypeError(f'The dimension provided is a tuple, you may intended to pass a list')
E               TypeError: The dimension provided is a tuple, you may intended to pass a list

I am not sure what I should do with it, should I delete it ?

(Thanks for your help :))

@max-sixty
Copy link
Collaborator

Yes, we want to replace that test with your test — either in the same test function or a new test function.

Though it maybe it should still be a KeyError, I think probably it should be. But feel free to submit the PR and it's easiest to review there.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants