Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netcdf char array not being decoded to string in compound dtype #1977

Open
jacklovell opened this issue Mar 9, 2018 · 4 comments
Open

Netcdf char array not being decoded to string in compound dtype #1977

jacklovell opened this issue Mar 9, 2018 · 4 comments

Comments

@jacklovell
Copy link

Code Sample, a copy-pastable example if possible

script_nc_file = "bolo_geom_fromscript.nc"
with xr.open_dataset(script_nc_file, group='/bolo/sxd', concat_characters=True) as ds:
    ds = ds.copy(deep=True)
print(ds.slits)
<xarray.DataArray 'slits' (slit_no: 4)>
array([ ([b'B', b'o', b'l', b'o', b'm', b'e', b't', b'e', b'r', b'S', b'l', b'i', b't', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b''], [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'O', b'u', b't', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'1', b'', b'', b'', b'', b''], 1, (-0.06458486,  0.21803484, -0.97380162), ( 0.95881973,  0.28401534,  0.), (-0.52069675,  1.77104629, -1.564     ),  0.005,  0.005, [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'O', b'u', b't', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'1', b'', b'', b'', b'', b''], 0),
       ([b'B', b'o', b'l', b'o', b'm', b'e', b't', b'e', b'r', b'S', b'l', b'i', b't', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b''], [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'O', b'u', b't', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'2', b'', b'', b'', b'', b''], 1, (-0.16038567,  0.54145294, -0.82529095), ( 0.95881973,  0.28401534,  0.), (-0.5278879 ,  1.76891617, -1.564     ),  0.005,  0.005, [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'O', b'u', b't', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'2', b'', b'', b'', b'', b''], 1),
       ([b'B', b'o', b'l', b'o', b'm', b'e', b't', b'e', b'r', b'S', b'l', b'i', b't', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b''], [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'U', b'p', b'p', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'3', b'', b'', b'', b'', b''], 1, (-0.26470454,  0.89362754, -0.36243804), ( 0.95881973,  0.28401534,  0.), (-0.31231469,  1.06756025, -1.57072314),  0.005,  0.005, [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'U', b'p', b'p', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'3', b'', b'', b'', b'', b''], 2),
       ([b'B', b'o', b'l', b'o', b'm', b'e', b't', b'e', b'r', b'S', b'l', b'i', b't', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b'', b''], [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'U', b'p', b'p', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'4', b'', b'', b'', b'', b''], 1, (-0.19640032,  0.66303636,  0.72236396), ( 0.95881973,  0.28401534,  0.), (-0.31950584,  1.06543013, -1.57072314),  0.005,  0.005, [b'M', b'A', b'S', b'T', b'-', b'U', b' ', b'S', b'X', b'D', b' ', b'-', b' ', b'U', b'p', b'p', b'e', b'r', b' ', b'S', b'l', b'i', b't', b' ', b'4', b'', b'', b'', b'', b''], 3)], 
      dtype={'names':['Object_type','ID','Version','basis_1','basis_2','centre_point','width','height','slit_id','slit_no'], 'formats':[('S1', (30,)),('S1', (30,)),'<i4',[('x', '<f8'), ('y', '<f8'), ('z', '<f8')],[('x', '<f8'), ('y', '<f8'), ('z', '<f8')],[('x', '<f8'), ('y', '<f8'), ('z', '<f8')],'<f8','<f8',('S1', (30,)),'<i4'], 'offsets':[0,30,60,64,88,112,136,144,152,184], 'itemsize':192, 'aligned':True})
Coordinates:
  * slit_no  (slit_no) int64 0 1 2 3
Attributes:
    units: m

Problem description

When opening the attached dataset ,the char arrays in the compound dtype are not being converted into strings, despite concat_characters=True.
bolo_geom_fromscript.nc.zip

Expected Output

The char arrays should be converted into strings (or at the very least, bytes if an encoding is not present):

import netCDF4 as nc
longest_str = np.max([ds.slits.values[field].shape[-1] for field in ds.slits.values.dtype.fields 
                      if ds.slits.values[field].dtype.kind in ('S', 'U')])
str_dtype = '<U{}'.format(longest_str)
cartesian_coord = np.dtype([('x', np.float64), ('y', np.float64), ('z', np.float64)])
aperture_dtype_str = np.dtype([('Object_type', str_dtype), 
                               ('ID', str_dtype), ('Version', np.int32),
                               ('basis_1', cartesian_coord), ('basis_2', cartesian_coord), 
                               ('centre_point', cartesian_coord),
                               ('width', np.float64), ('height', np.float64),
                               ('slit_id', str_dtype), ('slit_no', np.int32)])
ds['slits_str'] = xr.DataArray(np.empty(ds.slits.size, aperture_dtype_str), 
                               coords=[('slit_no', ds.coords['slit_no'])])
for key in ds.slits.values.dtype.fields:
    if key in ('Object_type', 'ID', 'slit_id'):
        string_key = nc.chartostring(ds.slits.values[key])
        ds.slits_str.values[key] = string_key
    else:
        ds.slits_str.values[key] = ds.slits.values[key]
print(ds.slits_str)

<xarray.DataArray 'slits_str' (slit_no: 4)>
array([ ('BolometerSlit', 'MAST-U SXD - Outer Slit 1', 1, (-0.06458486,  0.21803484, -0.97380162), ( 0.95881973,  0.28401534,  0.), (-0.52069675,  1.77104629, -1.564     ),  0.005,  0.005, 'MAST-U SXD - Outer Slit 1', 0),
       ('BolometerSlit', 'MAST-U SXD - Outer Slit 2', 1, (-0.16038567,  0.54145294, -0.82529095), ( 0.95881973,  0.28401534,  0.), (-0.5278879 ,  1.76891617, -1.564     ),  0.005,  0.005, 'MAST-U SXD - Outer Slit 2', 1),
       ('BolometerSlit', 'MAST-U SXD - Upper Slit 3', 1, (-0.26470454,  0.89362754, -0.36243804), ( 0.95881973,  0.28401534,  0.), (-0.31231469,  1.06756025, -1.57072314),  0.005,  0.005, 'MAST-U SXD - Upper Slit 3', 2),
       ('BolometerSlit', 'MAST-U SXD - Upper Slit 4', 1, (-0.19640032,  0.66303636,  0.72236396), ( 0.95881973,  0.28401534,  0.), (-0.31950584,  1.06543013, -1.57072314),  0.005,  0.005, 'MAST-U SXD - Upper Slit 4', 3)], 
      dtype=[('Object_type', '<U30'), ('ID', '<U30'), ('Version', '<i4'), ('basis_1', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), ('basis_2', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), ('centre_point', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), ('width', '<f8'), ('height', '<f8'), ('slit_id', '<U30'), ('slit_no', '<i4')])
Coordinates:
  * slit_no  (slit_no) int64 0 1 2 3

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-32-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

xarray: 0.10.0
pandas: 0.22.0
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.3.1
h5netcdf: None
Nio: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.15.3
matplotlib: 2.1.0
cartopy: None
seaborn: 0.8.0
setuptools: 38.5.1
pip: 9.0.1
conda: 4.4.11
pytest: 3.2.1
IPython: 6.2.1
sphinx: 1.6.3

@shoyer
Copy link
Member

shoyer commented Mar 10, 2018

We currently don't have any special code for handling of compound data types in xarray, but I'm open to adding that if necessary. Note that you currently cannot write compound dtypes to netCDF files.

I think netCDF4-python is doing something sub-optimal here, but I'll comment in the other issue tracker.

@jacklovell
Copy link
Author

Now that Unidata/netcdf4-python#778 has been merged, it should be a bit easier to support this in xarray too. Though as previously mentioned, it will require no longer calling var.set_auto_chartostring(False) for compound types.

@stale
Copy link

stale bot commented Feb 14, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Feb 14, 2020
@dcherian dcherian removed the stale label Feb 14, 2020
@kmuehlbauer
Copy link
Contributor

For the sake of completeness, this is still and issue in xarray 2024.10.0 but characters are returned concatenated using engine="h5netcdf".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants