Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf engine, triggering ValueError: non-broadcastable output on application when loading single elements #4471

gerritholl · 2020-09-29T08:15:48Z

What happened:

When I try to open a NetCDF file using the h5netcdf engine, accessing a single data point before scale factors have been applied results in ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (1,). The MCVE (see below) results in:

Traceback (most recent call last):
  File "mwe93.py", line 4, in <module>
    ds["Rad"][400, 300].load()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/dataarray.py", line 808, in load
    ds = self._to_temp_dataset().load(**kwargs)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/dataset.py", line 662, in load
    v.load()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/variable.py", line 439, in load
    self._data = np.asarray(self._data)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 685, in __array__
    self._ensure_cached()
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 682, in _ensure_cached
    self.array = NumpyIndexingAdapter(np.asarray(self.array))
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 655, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/core/indexing.py", line 560, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/coding/variables.py", line 70, in __array__
    return self.func(self.array)
  File "/data/gholl/miniconda3/envs/py38b/lib/python3.8/site-packages/xarray/coding/variables.py", line 220, in _scale_offset_decoding
    data *= scale_factor
ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (1,)

What you expected to happen:

I expect the data access to work similarly as when opening with other engines.

Minimal Complete Verifiable Example:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
with xarray.open_dataset(fn, engine="h5netcdf") as ds:
    ds["Rad"][400, 300].load()

Anything else we need to know?:

An earlier version of this issue, and some comments, refer to fsspec or working on open files, but that proved to have nothing to do with the problem.

Environment:

I've confirmed this issue installing xarray from latest master, which means xarray 0.16.2.dev11+gf821fe20 at the time of writing,

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:55:52)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.82-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.2.dev11+gf821fe20
pandas: 1.1.2
numpy: 1.19.1
scipy: 1.5.2
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.8.1
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.6
cfgrib: None
iris: None
bottleneck: None
dask: 2.27.0
distributed: 2.27.0
matplotlib: 3.3.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20200917
pip: 20.2.3
conda: None
pytest: 6.0.2
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

djhoese · 2020-09-29T13:18:16Z

Just tested this with decode_cf=False and the rest of the loading process seems fine (note: I used engine='h5netcdf' too).

gerritholl · 2020-10-01T07:33:11Z

I just tested this with some more combinations:

decode_cf=True, mask_and_scale=False, everything seems fine.
decode_cf=False, mask_and_scale=True, everything seems fine.
decode_cf=True, mask_and_scale=True results in the ValueError and associated traceback.

gerritholl · 2020-10-01T08:20:20Z

Probably related: when reading an open file through a file system instance, the _FillValue, scale_factor, and add_offset are arrays of length one. When opening by passing a filename, those are all scalar (as expected):

import xarray
from fsspec.implementations.local import LocalFileSystem
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
ds1 = xarray.open_dataset(fn, decode_cf=True, mask_and_scale=False)
print(ds1["esun"].attrs["_FillValue"])
print(ds1["Rad"].attrs["scale_factor"])
with LocalFileSystem().open(fn) as of:
    ds2 = xarray.open_dataset(of, decode_cf=True, mask_and_scale=False)
    print(ds2["esun"].attrs["_FillValue"])
    print(ds2["Rad"].attrs["scale_factor"])

Result:

-999.0
0.001564351
[-999.]
[0.00156435]

I strongly suspect that this is what causes the ValueError, and in any case it also causes downstream problems even if opening succeeds as per the previous comment.

gerritholl · 2020-10-01T09:42:35Z

Some further digging shows it's due to differences between the h5netcdf and netcdf4 backends:

import xarray
fn = "/data/gholl/cache/fogtools/abi/2017/03/14/20/06/7/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
with xarray.open_dataset(fn, decode_cf=False, mask_and_scale=False, engine="netcdf4") as ds:
    print(ds["esun"].attrs["_FillValue"])
    print(ds["Rad"].attrs["scale_factor"])
with xarray.open_dataset(fn, decode_cf=False, mask_and_scale=False, engine="h5netcdf") as ds:
    print(ds["esun"].attrs["_FillValue"])
    print(ds["Rad"].attrs["scale_factor"])

Results in:

-999.0
0.001564351
[-999.]
[0.00156435]

gerritholl · 2020-10-01T09:47:00Z

However, a simple `xarray.open_dataset(fn, engine="h5netcdf") still fails with ValueError only if passed an open file, so there appear to be still other differences apart from the dimensionality of the variable attributes depending on backend.

gerritholl · 2020-10-02T10:07:17Z

My last comment was inaccurate. Although the open succeeds, the non-scalar scale factor does trigger failure upon accessing data (due to lazy loading) even without any open file:

import xarray
fn = "OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc"
with xarray.open_dataset(fn, engine="h5netcdf") as ds:
    print(ds["Rad"][400, 300])

The data file is publicly available at:

s3://noaa-goes16/ABI-L1b-RadF/2017/073/20/OR_ABI-L1b-RadF-M3C07_G16_s20170732006100_e20170732016478_c20170732016514.nc

gerritholl · 2020-10-02T10:10:45Z

Interestingly, the problem is prevented if one adds

ds.load()

before the print statement.

gerritholl · 2020-10-02T11:07:33Z

The ds.load() prevents the traceback because it means the entire n-d data variable is multiplied with the 1-d scale factor. Similarly, requesting a slice (ds["Rad"][400:402, 300:302]) also prevents a traceback. The traceback occurs if a single value is requested, because then Python will complain about multiplying a scalar with a 1-d array. I'm not entirely sure why, but would be a numpy issue:

In [7]: a = np.array(0)

In [8]: b = np.array([0])

In [9]: a * b
Out[9]: array([0])

In [10]: a *= b
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-0d04f348f081> in <module>
----> 1 a *= b

ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (1,)

gerritholl · 2020-10-02T12:32:40Z

According to The NetCDF User's Guide, attributes are supposed to be vectors:

The current version treats all attributes as vectors; scalar values are treated as single-element vectors.

That suggests that, strictly speaking, the h5netcdf engine is right and the netcdf4 engine is wrong, and that other components (such as where the scale factor and add_offset are applied) need to be adapted to handle arrays of length 1 for those values.

dcherian · 2020-10-02T17:50:36Z

other components (such as where the scale factor and add_offset are applied) need to be adapted to handle arrays of length 1 for those values.

Great diagnosis @gerritholl . This could be fixed here (I think):

xarray/xarray/coding/variables.py

Lines 245 to 263 in 333e8db

    
           class CFScaleOffsetCoder(VariableCoder): 
        
               """Scale and offset variables according to CF conventions. 
        
               Follows the formula: 
        
                   decode_values = encoded_values * scale_factor + add_offset 
        
               """ 
        
               def encode(self, variable, name=None): 
        
                   dims, data, attrs, encoding = unpack_for_encoding(variable) 
        
                   if "scale_factor" in encoding or "add_offset" in encoding: 
        
                       dtype = _choose_float_dtype(data.dtype, "add_offset" in encoding) 
        
                       data = data.astype(dtype=dtype, copy=True) 
        
                       if "add_offset" in encoding: 
        
                           data -= pop_to(encoding, attrs, "add_offset", name=name) 
        
                       if "scale_factor" in encoding: 
        
                           data /= pop_to(encoding, attrs, "scale_factor", name=name) 
        
                   return Variable(dims, data, attrs, encoding)

Are you up for sending in a PR?

gerritholl · 2020-10-03T06:59:07Z

I can try to fix this in a PR, I just need to be sure what the fix should look like - to change the dimensionality of attributes (has the potential to break backward compatibility) or to adapt other components to handle either scalars or length 1 arrays (safer alternative, but may occur in more locations both inside and outside xarray, so in this case perhaps a note in the documentation could be in order as well). I don't know if xarray thrives for consistency between what the different engines expose on opening the same file.

dcherian · 2020-10-04T17:12:32Z

adapt other components to handle either scalars or length 1 arrays

I think we can make this change safely in the decoding machinery. As you point out, it will be backwards compatible.

shoyer · 2020-10-05T06:49:52Z

I agree, xarray's decoding should be robust as to whether these attributes are scalars or vectors of length one.

This should probably be considered a bug in h5netcdf, which I guess should the assumption from netCDF4-python that vector attributes of length 1 are scalars.

(h5netcdf can store true scalar attributes in HDF5 files, but it's probably better to be consistent with netCDF)

The h5netcdf engine exposes single-valued attributes as arrays of shape (1,), which is correct according to the NetCDF standard, but may cause a problem when reading a value of shape () before the scale_factor and add_offset have been applied. This PR adds a check for the dimensionality of add_offset and scale_factor and ensures they are scalar before they are used for further processing, adds a unit test to verify that this works correctly, and a note to the documentation to warn users of this difference between the h5netcdf and netcdf4 engines. Fixes pydata#4471.

Add a whats-new entry for the fix to issue pydata#4471, corresponding to PR pydata#4485.

* Handle scale_factor and add_offset as scalar The h5netcdf engine exposes single-valued attributes as arrays of shape (1,), which is correct according to the NetCDF standard, but may cause a problem when reading a value of shape () before the scale_factor and add_offset have been applied. This PR adds a check for the dimensionality of add_offset and scale_factor and ensures they are scalar before they are used for further processing, adds a unit test to verify that this works correctly, and a note to the documentation to warn users of this difference between the h5netcdf and netcdf4 engines. Fixes #4471. * DOC: Add whats-new entry for fixing 4471 Add a whats-new entry for the fix to issue #4471, corresponding to PR #4485. * Update doc/io.rst Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

gerritholl mentioned this issue Sep 29, 2020

fsspec FileSystem support pytroll/satpy#1321

Closed

6 tasks

gerritholl changed the title ~~Reading from fsspec s3 results in ValueError: non-broadcastable output~~ Reading from open file with fsspec.open gives 1-d fill_value, scale_factor, and add_offset, triggering ValueError: non-broadcastable output on application Oct 1, 2020

dcherian added topic-backends bug labels Oct 1, 2020

gerritholl mentioned this issue Oct 5, 2020

Handle scale_factor and add_offset as scalar #4485

Merged

4 tasks

gerritholl added a commit to gerritholl/xarray that referenced this issue Oct 6, 2020

DOC: Add whats-new entry for fixing 4471

75e9adc

Add a whats-new entry for the fix to issue pydata#4471, corresponding to PR pydata#4485.

dcherian closed this as completed in #4485 Oct 11, 2020

rabernat mentioned this issue Dec 1, 2020

Decode_cf fails when scale_factor is a length-1 list #4631

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf engine, triggering ValueError: non-broadcastable output on application when loading single elements #4471

Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf engine, triggering ValueError: non-broadcastable output on application when loading single elements #4471

gerritholl commented Sep 29, 2020 •

edited

Loading

INSTALLED VERSIONS

djhoese commented Sep 29, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 2, 2020 •

edited

Loading

gerritholl commented Oct 2, 2020

gerritholl commented Oct 2, 2020

gerritholl commented Oct 2, 2020

dcherian commented Oct 2, 2020

gerritholl commented Oct 3, 2020

dcherian commented Oct 4, 2020

shoyer commented Oct 5, 2020

Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf engine, triggering ValueError: non-broadcastable output on application when loading single elements #4471

Numeric scalar variable attributes (including fill_value, scale_factor, add_offset) are 1-d instead of 0-d with h5netcdf engine, triggering ValueError: non-broadcastable output on application when loading single elements #4471

Comments

gerritholl commented Sep 29, 2020 • edited Loading

INSTALLED VERSIONS

djhoese commented Sep 29, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 1, 2020

gerritholl commented Oct 2, 2020 • edited Loading

gerritholl commented Oct 2, 2020

gerritholl commented Oct 2, 2020

gerritholl commented Oct 2, 2020

dcherian commented Oct 2, 2020

gerritholl commented Oct 3, 2020

dcherian commented Oct 4, 2020

shoyer commented Oct 5, 2020

gerritholl commented Sep 29, 2020 •

edited

Loading

gerritholl commented Oct 2, 2020 •

edited

Loading