Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf engine, triggering ValueError: non-broadcastable output on application when loading single elements #4471

Closed
gerritholl opened this issue Sep 29, 2020 · 13 comments · Fixed by #4485

Comments

@gerritholl
Copy link
Contributor

gerritholl commented Sep 29, 2020

What happened:

When I try to open a NetCDF file using the h5netcdf engine, accessing a single data point before scale factors have been applied results in ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (1,). The MCVE (see below) results in:

Traceback (most recent call last):
  File "mwe93.py", line 4, in <module>
    ds["Rad"][400, 300].load()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/dataarray.py", line 808, in load
    ds = self._to_temp_dataset().load(**kwargs)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/dataset.py", line 662, in load
    v.load()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/variable.py", line 439, in load
    self._data = np.asarray(self._data)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 685, in __array__
    self._ensure_cached()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 682, in _ensure_cached
    self.array = NumpyIndexingAdapter(np.asarray(self.array))
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 655, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 560, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/coding/variables.py", line 70, in __array__
    return self.func(self.array)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/coding/variables.py", line 220, in _scale_offset_decoding
    data *= scale_factor
ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (1,)

What you expected to happen:

I expect the data access to work similarly as when opening with other engines.

Minimal Complete Verifiable Example:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
with xarray.open_dataset(fn, engine="h5netcdf") as ds:
    ds["Rad"][400, 300].load()

Anything else we need to know?:

An earlier version of this issue, and some comments, refer to fsspec or working on open files, but that proved to have nothing to do with the problem.

Environment:

I've confirmed this issue installing xarray from latest master, which means xarray 0.16.2.dev11+gf821fe20 at the time of writing,

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:55:52)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.82-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.2.dev11+gf821fe20
pandas: 1.1.2
numpy: 1.19.1
scipy: 1.5.2
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.6
cfgrib: None
iris: None
bottleneck: None
dask: 2.27.0
distributed: 2.27.0
matplotlib: 3.3.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20200917
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None

@djhoese
Copy link
Contributor

djhoese commented Sep 29, 2020

Just tested this with decode_cf=False and the rest of the loading process seems fine (note: I used engine='h5netcdf' too).

@gerritholl
Copy link
Contributor Author

I just tested this with some more combinations:

  • decode_cf=True, mask_and_scale=False, everything seems fine.
  • decode_cf=False, mask_and_scale=True, everything seems fine.
  • decode_cf=True, mask_and_scale=True results in the ValueError and associated traceback.

@gerritholl
Copy link
Contributor Author

Probably related: when reading an open file through a file system instance, the _FillValue, scale_factor, and add_offset are arrays of length one. When opening by passing a filename, those are all scalar (as expected):

import xarray
from fsspec.implementations.local import LocalFileSystem
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
ds1 = xarray.open_dataset(fn, decode_cf=True, mask_and_scale=False)
print(ds1["esun"].attrs["_FillValue"])
print(ds1["Rad"].attrs["scale_factor"])
with LocalFileSystem().open(fn) as of:
    ds2 = xarray.open_dataset(of, decode_cf=True, mask_and_scale=False)
    print(ds2["esun"].attrs["_FillValue"])
    print(ds2["Rad"].attrs["scale_factor"])

Result:

-999.0
0.001564351
[-999.]
[0.00156435]

I strongly suspect that this is what causes the ValueError, and in any case it also causes downstream problems even if opening succeeds as per the previous comment.

@gerritholl gerritholl changed the title Reading from fsspec s3 results in ValueError: non-broadcastable output Reading from open file with fsspec.open gives 1-d fill_value, scale_factor, and add_offset, triggering ValueError: non-broadcastable output on application Oct 1, 2020
@gerritholl
Copy link
Contributor Author

Some further digging shows it's due to differences between the h5netcdf and netcdf4 backends:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
with xarray.open_dataset(fn, decode_cf=False, mask_and_scale=False, engine="netcdf4") as ds:
    print(ds["esun"].attrs["_FillValue"])
    print(ds["Rad"].attrs["scale_factor"])
with xarray.open_dataset(fn, decode_cf=False, mask_and_scale=False, engine="h5netcdf") as ds:
    print(ds["esun"].attrs["_FillValue"])
    print(ds["Rad"].attrs["scale_factor"])

Results in:

-999.0
0.001564351
[-999.]
[0.00156435]

@gerritholl gerritholl changed the title Reading from open file with fsspec.open gives 1-d fill_value, scale_factor, and add_offset, triggering ValueError: non-broadcastable output on application Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d with h5netcdf backend, 0-d with netcdf4 backend, which may trigger ValueError: non-broadcastable output on application Oct 1, 2020
@gerritholl
Copy link
Contributor Author

However, a simple `xarray.open_dataset(fn, engine="h5netcdf") still fails with ValueError only if passed an open file, so there appear to be still other differences apart from the dimensionality of the variable attributes depending on backend.

@gerritholl
Copy link
Contributor Author

gerritholl commented Oct 2, 2020

My last comment was inaccurate. Although the open succeeds, the non-scalar scale factor does trigger failure upon accessing data (due to lazy loading) even without any open file:

import xarray
fn = "OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
with xarray.open_dataset(fn, engine="h5netcdf") as ds:
    print(ds["Rad"][400, 300])

The data file is publicly available at:

s3://noaa-goes16/ABI-L1b-RadF/2017/073/20/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc

@gerritholl
Copy link
Contributor Author

Interestingly, the problem is prevented if one adds

ds.load()

before the print statement.

@gerritholl
Copy link
Contributor Author

The ds.load() prevents the traceback because it means the entire n-d data variable is multiplied with the 1-d scale factor. Similarly, requesting a slice (ds["Rad"][400:402, 300:302]) also prevents a traceback. The traceback occurs if a single value is requested, because then Python will complain about multiplying a scalar with a 1-d array. I'm not entirely sure why, but would be a numpy issue:

In [7]: a = np.array(0)

In [8]: b = np.array([0])

In [9]: a * b
Out[9]: array([0])

In [10]: a *= b
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-0d04f348f081> in <module>
----> 1 a *= b

ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (1,)

@gerritholl gerritholl changed the title Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d with h5netcdf backend, 0-d with netcdf4 backend, which may trigger ValueError: non-broadcastable output on application Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf backend, may trigger ValueError: non-broadcastable output on application when loading single elements Oct 2, 2020
@gerritholl gerritholl changed the title Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf backend, may trigger ValueError: non-broadcastable output on application when loading single elements Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf engine, triggering ValueError: non-broadcastable output on application when loading single elements Oct 2, 2020
@gerritholl
Copy link
Contributor Author

According to The NetCDF User's Guide, attributes are supposed to be vectors:

The current version treats all attributes as vectors; scalar values are treated as single-element vectors.

That suggests that, strictly speaking, the h5netcdf engine is right and the netcdf4 engine is wrong, and that other components (such as where the scale factor and add_offset are applied) need to be adapted to handle arrays of length 1 for those values.

@dcherian
Copy link
Contributor

dcherian commented Oct 2, 2020

other components (such as where the scale factor and add_offset are applied) need to be adapted to handle arrays of length 1 for those values.

Great diagnosis @gerritholl . This could be fixed here (I think):

class CFScaleOffsetCoder(VariableCoder):
"""Scale and offset variables according to CF conventions.
Follows the formula:
decode_values = encoded_values * scale_factor + add_offset
"""
def encode(self, variable, name=None):
dims, data, attrs, encoding = unpack_for_encoding(variable)
if "scale_factor" in encoding or "add_offset" in encoding:
dtype = _choose_float_dtype(data.dtype, "add_offset" in encoding)
data = data.astype(dtype=dtype, copy=True)
if "add_offset" in encoding:
data -= pop_to(encoding, attrs, "add_offset", name=name)
if "scale_factor" in encoding:
data /= pop_to(encoding, attrs, "scale_factor", name=name)
return Variable(dims, data, attrs, encoding)
Are you up for sending in a PR?

@gerritholl
Copy link
Contributor Author

I can try to fix this in a PR, I just need to be sure what the fix should look like - to change the dimensionality of attributes (has the potential to break backward compatibility) or to adapt other components to handle either scalars or length 1 arrays (safer alternative, but may occur in more locations both inside and outside xarray, so in this case perhaps a note in the documentation could be in order as well). I don't know if xarray thrives for consistency between what the different engines expose on opening the same file.

@dcherian
Copy link
Contributor

dcherian commented Oct 4, 2020

adapt other components to handle either scalars or length 1 arrays

I think we can make this change safely in the decoding machinery. As you point out, it will be backwards compatible.

@shoyer
Copy link
Member

shoyer commented Oct 5, 2020

I agree, xarray's decoding should be robust as to whether these attributes are scalars or vectors of length one.

This should probably be considered a bug in h5netcdf, which I guess should the assumption from netCDF4-python that vector attributes of length 1 are scalars.

(h5netcdf can store true scalar attributes in HDF5 files, but it's probably better to be consistent with netCDF)

gerritholl added a commit to gerritholl/xarray that referenced this issue Oct 5, 2020
The h5netcdf engine exposes single-valued attributes as arrays of shape
(1,), which is correct according to the NetCDF standard, but may cause
a problem when reading a value of shape () before the scale_factor and
add_offset have been applied.  This PR adds a check for the dimensionality
of add_offset and scale_factor and ensures they are scalar before they
are used for further processing, adds a unit test to verify that this
works correctly, and a note to the documentation to warn users of this
difference between the h5netcdf and netcdf4 engines.

Fixes pydata#4471.
gerritholl added a commit to gerritholl/xarray that referenced this issue Oct 5, 2020
The h5netcdf engine exposes single-valued attributes as arrays of shape
(1,), which is correct according to the NetCDF standard, but may cause
a problem when reading a value of shape () before the scale_factor and
add_offset have been applied.  This PR adds a check for the dimensionality
of add_offset and scale_factor and ensures they are scalar before they
are used for further processing, adds a unit test to verify that this
works correctly, and a note to the documentation to warn users of this
difference between the h5netcdf and netcdf4 engines.

Fixes pydata#4471.
gerritholl added a commit to gerritholl/xarray that referenced this issue Oct 6, 2020
Add a whats-new entry for the fix to issue pydata#4471, corresponding to PR pydata#4485.
dcherian added a commit that referenced this issue Oct 11, 2020
* Handle scale_factor and add_offset as scalar

The h5netcdf engine exposes single-valued attributes as arrays of shape
(1,), which is correct according to the NetCDF standard, but may cause
a problem when reading a value of shape () before the scale_factor and
add_offset have been applied.  This PR adds a check for the dimensionality
of add_offset and scale_factor and ensures they are scalar before they
are used for further processing, adds a unit test to verify that this
works correctly, and a note to the documentation to warn users of this
difference between the h5netcdf and netcdf4 engines.

Fixes #4471.

* DOC: Add whats-new entry for fixing 4471

Add a whats-new entry for the fix to issue #4471, corresponding to PR #4485.

* Update doc/io.rst

Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants