Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Labeled repr #1044

Closed
chris-b1 opened this issue Oct 12, 2016 · 8 comments
Closed

Labeled repr #1044

chris-b1 opened this issue Oct 12, 2016 · 8 comments

Comments

@chris-b1
Copy link
Contributor

It may be nice to take advantage of labels to show a different, labeled repr - especially for more than 3 dimensions, I personally find the the numpy array one hard to read.

Some sample data and the current repr

In [103]: d = xr.DataArray(np.arange(200).reshape((2,5,2,10)), dims=('a', 'b', 'c', 'd'),
     ...:                  coords={'a': ['A', 'B'], 'b': ['Cat 1', 'Cat 2', 'Cat 3', 'Cat 4', 'Cat 5'],
     ...:                          'c': ['J', 'K']})

In [104]: d
Out[104]: 
<xarray.DataArray (a: 2, b: 5, c: 2, d: 10)>
array([[[[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9],
         [ 10,  11,  12,  13,  14,  15,  16,  17,  18,  19]],

        [[ 20,  21,  22,  23,  24,  25,  26,  27,  28,  29],
         [ 30,  31,  32,  33,  34,  35,  36,  37,  38,  39]],

        [[ 40,  41,  42,  43,  44,  45,  46,  47,  48,  49],
         [ 50,  51,  52,  53,  54,  55,  56,  57,  58,  59]],

        [[ 60,  61,  62,  63,  64,  65,  66,  67,  68,  69],
         [ 70,  71,  72,  73,  74,  75,  76,  77,  78,  79]],

        [[ 80,  81,  82,  83,  84,  85,  86,  87,  88,  89],
         [ 90,  91,  92,  93,  94,  95,  96,  97,  98,  99]]],


       [[[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
         [110, 111, 112, 113, 114, 115, 116, 117, 118, 119]],

        [[120, 121, 122, 123, 124, 125, 126, 127, 128, 129],
         [130, 131, 132, 133, 134, 135, 136, 137, 138, 139]],

        [[140, 141, 142, 143, 144, 145, 146, 147, 148, 149],
         [150, 151, 152, 153, 154, 155, 156, 157, 158, 159]],

        [[160, 161, 162, 163, 164, 165, 166, 167, 168, 169],
         [170, 171, 172, 173, 174, 175, 176, 177, 178, 179]],

        [[180, 181, 182, 183, 184, 185, 186, 187, 188, 189],
         [190, 191, 192, 193, 194, 195, 196, 197, 198, 199]]]])
Coordinates:
  * a        (a) <U1 'A' 'B'
  * b        (b) <U5 'Cat 1' 'Cat 2' 'Cat 3' 'Cat 4' 'Cat 5'
  * c        (c) <U1 'J' 'K'
  * d        (d) int64 0 1 2 3 4 5 6 7 8 9

The labeled repr could instead look something (not exactly) like this?

<xarray.DataArray (a: 2, b: 5, c: 2, d: 10)>

a: 'A'
b: 'Cat 1'
c x d: 
         0   2   3   4   5   6   7   8   9  10
     J   0   1   2   3   4   5   6   7   8   9
     K  10  11  12  13  14  15  16  17  18  19


a: 'A'
b: 'Cat 2'
c x d
    <repeat>
...

Coordinates:
  * a        (a) <U1 'A' 'B'
  * b        (b) <U5 'Cat 1' 'Cat 2' 'Cat 3' 'Cat 4' 'Cat 5'
  * c        (c) <U1 'J' 'K'
  * d        (d) int64 0 1 2 3 4 5 6 7 8 9
@shoyer
Copy link
Member

shoyer commented Oct 12, 2016

Agreed, I'm never been really happy with our use of the NumPy repr for >2 dimensions. It's quite hard to match up the labels.

Something like this would be a meaningful improvement! I would encourage experimentation on this.

@fmaussion
Copy link
Member

fmaussion commented Oct 12, 2016

Good idea! I am in favor of as few repr as possible, i.e. maybe the first few values in each dimension.

@max-sixty
Copy link
Collaborator

I think dupe of #680

@benbovy
Copy link
Member

benbovy commented Oct 13, 2016

After seeing the discussion in #680, I'm wondering if showing the firsts values of the flattened array wouldn't be enough here, e.g., something like this:

>>> d
<xarray.DataArray (a: 2, b: 5, c: 2, d: 10)>
  array          int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ...
Coordinates:
  * a        (a) <U1 'A' 'B'
  * b        (b) <U5 'Cat 1' 'Cat 2' 'Cat 3' 'Cat 4' 'Cat 5'
  * c        (c) <U1 'J' 'K'
  * d        (d) int64 0 1 2 3 4 5 6 7 8 9

This example is more consistent with the repr of Dataset data variables, and similarly we could customize the repr of dask arrays and lazy arrays (loaded from netcdf files) like this:

>>> d.chunk((10, 5, 5, 10))
<xarray.DataArray (a: 2, b: 5, c: 2, d: 10)>
  dask.array     int64 chunksize=(10, 5, 5, 10)
Coordinates:
  * a        (a) <U1 'A' 'B'
  * b        (b) <U5 'Cat 1' 'Cat 2' 'Cat 3' 'Cat 4' 'Cat 5'
  * c        (c) <U1 'J' 'K'
  * d        (d) int64 0 1 2 3 4 5 6 7 8 9
>>> d.name = 'myvar'
>>> d.to_netcdf('data.nc')
>>> xr.open_dataset('data.nc').myvar
<xarray.DataArray 'myvar' (a: 2, b: 5, c: 2, d: 10)>
  lazy-array     int64
Coordinates:
  * a        (a) <U1 'A' 'B'
  * b        (b) <U5 'Cat 1' 'Cat 2' 'Cat 3' 'Cat 4' 'Cat 5'
  * c        (c) <U1 'J' 'K'
  * d        (d) int64 0 1 2 3 4 5 6 7 8 9

@fmaussion
Copy link
Member

I agree, but I see one or two cases where it could be useful to have the first few values for each dim. For example with geopotential data on pressure levels, it could be interesting to see how the data varies with height on the third dim. But this is a detail, not very important.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Oct 13, 2016

There could be some display options exposed to manage this - for instance I personally would not like a flat array - but see how it could make sense.

Additionally / alternatively, the repr I'm talking (small slice of values laid out with coordinate labels) could called something other than __repr__ - something like pandas .head() although may be a better name to use here.

@benbovy
Copy link
Member

benbovy commented Oct 13, 2016

In most cases I found the DataArray repr useful for quickly checking the dimensions (both names and sizes), the attributes and the types/values of both data and labels (I mean just checking here if the values are consistent regarding their units, acceptable ranges, etc.), but rarely for in-depth checking of the data values along each dimension, hence my suggestion of a flat (subset) array.

To inspect the data of high dimensional datarrays, I've mainly used the indexing logic of xarray to extract slices of <3 dimensions. However, I admit that for quick inspection purposes I actually like your suggestion of having a specific repr method that would allow showing small data slices as labeled tables, especially if we choose to always use a flat array for the repr of Dataarray (i.e., even when the number of dimensions <3). Why not something like:

>>> d.slice_repr(a=0, b=0)
d   0   1   2   3   4   5   6   7   8   9
c                                        
J   0   1   2   3   4   5   6   7   8   9
K  10  11  12  13  14  15  16  17  18  19

This is equivalent to

>>> dslice = d.isel(a=0, b=0)
>>> pd.DataFrame(data=dslice.data, index=dslice.c, columns=dslice.d)

Except that slice_repr() would return a string instead of a data object (or an array or a dataframe).
Not sure about the name and/or signature of slice_repr(), though.

@stale
Copy link

stale bot commented Jan 25, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jan 25, 2019
@stale stale bot closed this as completed Feb 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants