Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic bug with bytestring decoding #3991

Closed
lamorton opened this issue Apr 21, 2020 · 2 comments
Closed

Nondeterministic bug with bytestring decoding #3991

lamorton opened this issue Apr 21, 2020 · 2 comments

Comments

@lamorton
Copy link

I have an HDF5 dataset with a scalar variable called 'name' that is actual a 0-D NumPy array with dtype '|S8'. (Not my choice, this is what I get from someone else...) Occasionally, the loading fails.

MCVE Code Sample

#Set up the file
import h5py
f = h5py.File("error_demo.h5",mode='w')
f.create_dataset('name',shape=(),dtype="|S8",data=np.array([b'f(Pt,TE)'],dtype='|S8'))
f.close()

#Produce the error -- you may need to adjust the number of times you run the loop
import xarray as xr
for i in range(10):
    xr.load_dataset("error_demo.h5")

Expected Output

<xarray.Dataset>
Dimensions: ()
Data variables:
name <U8 'f(Pt,TE)'

Problem Description

The resulting error message
Traceback (most recent call last):

  File "<ipython-input-3-b8e48f28a262>", line 1, in <module>
    mcout62 = xr.load_dataset("57062/mcout000011.h5",group=r"part/ions/dE(r,z,D)")

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py", line 261, in load_dataset
    return ds.load()

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/core/dataset.py", line 659, in load
    v.load()

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py", line 375, in load
    self._data = np.asarray(self._data)

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py", line 677, in __array__
    self._ensure_cached()

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py", line 674, in _ensure_cached
    self.array = NumpyIndexingAdapter(np.asarray(self.array))

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py", line 653, in __array__
    return np.asarray(self.array, dtype=dtype)

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py", line 557, in __array__
    return np.asarray(array[self.key], dtype=None)

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 73, in __getitem__
    key, self.shape, indexing.IndexingSupport.OUTER, self._getitem

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py", line 837, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)

  File "/Users/lmorton/opt/anaconda3/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 85, in _getitem
    array = getitem(original_array, key)

  File "netCDF4/_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.__getitem__

  File "netCDF4/_netCDF4.pyx", line 5384, in netCDF4._netCDF4.Variable._get

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 9: invalid start byte

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 (default, Jan 8 2020, 13:42:34)
[Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 19.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.7.3

xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.11.0
distributed: 2.11.0
matplotlib: 3.1.3
cartopy: None
seaborn: 0.10.0
numbagg: None
setuptools: 46.0.0.post20200309
pip: 20.0.2
conda: 4.8.3
pytest: 5.3.5
IPython: 7.12.0
sphinx: 2.4.0

@kmuehlbauer
Copy link
Contributor

@lamorton You can look at this in two different ways.
First, try to use dtype="|S9" on both occasions, since this will apply the zero-termination which is needed for the string to be interpreted correctly by netcdf (IIRC).

import h5py
f = h5py.File("error_demo.h5",mode='w')
f.create_dataset('name',shape=(),dtype="|S9",data=np.array([b'f(Pt,TE)'],dtype='|S9'))
f.close()
for i in range(20):
   xr.load_dataset("error_demo.h5")
<xarray.Dataset>
Dimensions:  ()
Data variables:
    name     <U8 'f(Pt,TE)'

Second, use backend='h5netcdf' to read it sidestepping netcdf:

import h5py
f = h5py.File("error_demo.h5",mode='w')
f.create_dataset('name',shape=(),dtype="|S8",data=np.array([b'f(Pt,TE)'],dtype='|S8'))
f.close()
for i in range(20):
    ds = xr.load_dataset("error_demo.h5", backend='h5netcdf')
<xarray.Dataset>
Dimensions:  ()
Data variables:
    name     |S8 b'f(Pt,TE)'

From your expectations It seems that the zero-padding is the culprit, since the output of the 'h5netcdf' backend doesn't quite fit.
Unfortunately I do not have links at hand with further comments on string handling differences between netcdf/hdf5. There are quite some...

@lamorton
Copy link
Author

Thanks, I'll close this, since it looks like an issue of bad input. I can't use h5netcdf due to conda env nonsense, but I've worked around it by just dropping the 'name' variable during loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants