Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random crashes in netcdf when dask client has multiple threads #7990

Closed
4 tasks done
yt87 opened this issue Jul 16, 2023 · 1 comment
Closed
4 tasks done

Random crashes in netcdf when dask client has multiple threads #7990

yt87 opened this issue Jul 16, 2023 · 1 comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member

Comments

@yt87
Copy link

yt87 commented Jul 16, 2023

What happened?

The data files can be found here: https://noaadata.apps.nsidc.org/NOAA/G02202_V4/north/monthly/. The example code below crashes randomly: the file processed when the crash occurs differs between runs. This happens only when threads_per_worker is > 1 in the client() call . n_workers does not matter, at least I could not make it to crash. The traceback points to hdf5.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

from pathlib import Path

import pandas as pd
from dask.distributed import Client

import xarray as xr

client = Client(n_workers=1, threads_per_worker=4)

DATADIR = Path("/mnt/sdc1/icec/NSIDC")
year = 2020

times = pd.date_range(f"{year}-01-01", f"{year}-12-01", freq="MS", name="time")
paths = [
    DATADIR / "monthly" / f"seaice_conc_monthly_nh_{t.strftime('%Y%m')}_f17_v04r00.nc"
    for t in times
]
for n in range(10):
    ds = xr.open_mfdataset(
        paths,
        combine="nested",
        concat_dim="tdim",
        parallel=True,
        engine="netcdf4",
    )
    del ds

HDF5-DIAG: Error detected in HDF5 (1.14.0) thread 0:
  #000: H5G.c line 442 in H5Gopen2(): unable to synchronously open group
    major: Symbol table
    minor: Unable to create file
  #001: H5G.c line 399 in H5G__open_api_common(): can't set object access arguments
    major: Symbol table
    minor: Can't set value
  #002: H5VLint.c line 2669 in H5VL_setup_acc_args(): invalid location identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
  #003: H5VLint.c line 1787 in H5VL_vol_object(): invalid identifier type to function
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.14.0) thread 0:
  #000: H5G.c line 887 in H5Gclose(): not a group ID
    major: Invalid arguments to routine
    minor: Inappropriate type
2023-07-16 00:35:47,833 - distributed.worker - WARNING - Compute Failed
Key:       open_dataset-09a155bb-5079-406a-83c4-737933c409c7
Function:  execute_task
args:      ((<function apply at 0x7f0001edf520>, <function open_dataset at 0x7effe3e35c60>, ['/mnt/sdc1/icec/NSIDC/monthly/seaice_conc_monthly_nh_202001_f17_v04r00.nc'], (<class 'dict'>, [['engine', 'netcdf4'], ['chunks', (<class 'dict'>, [])]])))
kwargs:    {}
Exception: "OSError(-101, 'NetCDF: HDF error')"

2023-07-16 00:35:47,834 - distributed.worker - WARNING - Compute Failed
Key:       open_dataset-14e239f4-7e16-4891-a350-b55979d4a754
Function:  execute_task
args:      ((<function apply at 0x7f0001edf520>, <function open_dataset at 0x7effe3e35c60>, ['/mnt/sdc1/icec/NSIDC/monthly/seaice_conc_monthly_nh_202011_f17_v04r00.nc'], (<class 'dict'>, [['engine', 'netcdf4'], ['chunks', (<class 'dict'>, [])]])))
kwargs:    {}
Exception: "OSError(-101, 'NetCDF: HDF error')"

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[1], line 19
     14 paths = [
     15     DATADIR / "monthly" / f"seaice_conc_monthly_nh_{t.strftime('%Y%m')}_f17_v04r00.nc"
     16     for t in times
     17 ]
     18 for n in range(10):
---> 19     ds = xr.open_mfdataset(
     20         paths,
     21         combine="nested",
     22         concat_dim="tdim",
     23         parallel=True,
     24         engine="netcdf4",
     25     )
     26     del ds

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/api.py:1050, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1045     datasets = [preprocess(ds) for ds in datasets]
   1047 if parallel:
   1048     # calling compute here will return the datasets/file_objs lists,
   1049     # the underlying datasets will still be stored as dask arrays
-> 1050     datasets, closers = dask.compute(datasets, closers)
   1052 # Combine all datasets, closing them in case of a ValueError
   1053 try:

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/api.py:570, in open_dataset()
    558 decoders = _resolve_decoders_kwargs(
    559     decode_cf,
    560     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    566     decode_coords=decode_coords,
    567 )
    569 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 570 backend_ds = backend.open_dataset(
    571     filename_or_obj,
    572     drop_variables=drop_variables,
    573     **decoders,
    574     **kwargs,
    575 )
    576 ds = _dataset_from_backend_dataset(
    577     backend_ds,
    578     filename_or_obj,
   (...)
    588     **kwargs,
    589 )
    590 return ds

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:590, in open_dataset()
    569 def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporting **kwargs
    570     self,
    571     filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
   (...)
    587     autoclose=False,
    588 ) -> Dataset:
    589     filename_or_obj = _normalize_path(filename_or_obj)
--> 590     store = NetCDF4DataStore.open(
    591         filename_or_obj,
    592         mode=mode,
    593         format=format,
    594         group=group,
    595         clobber=clobber,
    596         diskless=diskless,
    597         persist=persist,
    598         lock=lock,
    599         autoclose=autoclose,
    600     )
    602     store_entrypoint = StoreBackendEntrypoint()
    603     with close_on_error(store):

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:391, in open()
    385 kwargs = dict(
    386     clobber=clobber, diskless=diskless, persist=persist, format=format
    387 )
    388 manager = CachingFileManager(
    389     netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    390 )
--> 391 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:338, in __init__()
    336 self._group = group
    337 self._mode = mode
--> 338 self.format = self.ds.data_model
    339 self._filename = self.ds.filepath()
    340 self.is_remote = is_remote_uri(self._filename)

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:400, in ds()
    398 @property
    399 def ds(self):
--> 400     return self._acquire()

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:394, in _acquire()
    393 def _acquire(self, needs_lock=True):
--> 394     with self._manager.acquire_context(needs_lock) as root:
    395         ds = _nc4_require_group(root, self._group, self._mode)
    396     return ds

File ~/mambaforge/envs/icec/lib/python3.10/contextlib.py:135, in __enter__()
    133 del self.args, self.kwds, self.func
    134 try:
--> 135     return next(self.gen)
    136 except StopIteration:
    137     raise RuntimeError("generator didn't yield") from None

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in acquire_context()
    196 @contextlib.contextmanager
    197 def acquire_context(self, needs_lock=True):
    198     """Context manager for acquiring a file."""
--> 199     file, cached = self._acquire_with_cache_info(needs_lock)
    200     try:
    201         yield file

File ~/mambaforge/envs/icec/lib/python3.10/site-packages/xarray/backends/file_manager.py:217, in _acquire_with_cache_info()
    215     kwargs = kwargs.copy()
    216     kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
    218 if self._mode == "w":
    219     # ensure file doesn't get overridden when opened again
    220     self._mode = "a"

File src/netCDF4/_netCDF4.pyx:2464, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:2027, in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -101] NetCDF: HDF error: '/mnt/sdc1/icec/NSIDC/monthly/seaice_conc_monthly_nh_202011_f17_v04r00.nc'

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 6.1.38-1-MANJARO
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.0
libnetcdf: 4.9.2

xarray: 2023.6.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.11.1
netCDF4: 1.6.4
pydap: None
h5netcdf: None
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.7.0
distributed: 2023.7.0
matplotlib: 3.7.1
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: None
sparse: 0.14.0
flox: None
numpy_groupies: None
setuptools: 68.0.0
pip: 23.2
conda: None
pytest: None
mypy: None
IPython: 8.14.0
sphinx: None

@yt87 yt87 added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 16, 2023
@yt87
Copy link
Author

yt87 commented Aug 23, 2023

This seems to be related to #2494 and Unidata/netcdf4-python#844. Unfortunately the latter is still open. Setting parallel=False works for me. It is not an xarray problem, so I am closing the issue.

@yt87 yt87 closed this as completed Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

1 participant