Concurrent acces with multiple processes using open_mfdataset #2494

lanougue · 2018-10-19T10:52:46Z

Hi everyone,

First: thanks to the developers for this amazing xarray library ! Great piece of work !
Here comes my troubles:
I run several (about 500) independant processes (dask distributed) that need simultaneous reading (only) access to a same (group of) netcdf files. I only pass the files-path strings to the processes to avoid pickling a netcdf python-object (issue).

In each process, I run

   with xr.open_mfdataset(myfiles_path, concat_dim='t', engine='h5netcdf') as myfile:
        x = myfile['x'].data
        y = myfile['y'].data

but it leads to typical errors for many concurrent access that fail... : Invalid id or Exception: CancelledError("('mul-484a58bf5830233021e08456b45eb60d', 0, 0)",), ...

I was using netCDF4 module with parallel option set to True, when playing with a single netcdf file and it was running fine:

myfile = Dataset(seedsurf_path,'r', parallel=True)
x = myfile['x']
y = myfile['y']
myfile.close()

Parallel option for open_mfdataset() seems to be dedicated to multithreaded access only. Is there somthing that can be done for multi-processes access ?
Thanks

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 3.12.53-60.30-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.8
pandas: 0.23.4
numpy: 1.12.1
scipy: 0.19.1
netCDF4: 1.2.4
h5netcdf: 0.6.2
h5py: 2.7.0
Nio: None
zarr: 2.2.0
bottleneck: 1.2.1
cyordereddict: None
dask: 0.19.0
distributed: 1.23.0
matplotlib: 2.2.3
cartopy: 0.16.0
seaborn: None
setuptools: 40.2.0
pip: 18.0
conda: None
pytest: None
IPython: 6.5.0
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2018-10-19T17:21:45Z

This may be fixed if you try the development version of xarray -- we did a major refactor of xarray's handling of netCDF files, e.g., try pip install https://github.com/pydata/xarray/archive/master.zip.

jhamman · 2018-10-19T17:34:25Z

To clear a few things up, the parallel option in netCDF4.Dataset is not the same as the parallel option in xarray.opne_mfdataset. In xarray, that option is meant to help speed up the time it takes to open many files at once. If you are using dask distributed, this should be done using that scheduler.

If you are only seeing thread parallelism in the open_mfdataset(..., parallel=True) call, I would start by looking at your dask distributed setup.

Can you try this workflow with and without the parallel option and report back:

client = Client(...)
ds = xr.open_mfdataset(myfiles_path, concat_dim='t', engine='h5netcdf', paralel=...)
x = ds['x'].load().data
y = ds['y'].load().data
ds.close()

Provided that you are setting up distributed to use multiple processes, you should get parallelism from multiple processes in this case.

lanougue · 2018-10-22T10:27:04Z

@jhamman I was aware of the difference between the two parallel options. I was thus wondering if I could pass a parallel option to the netcdf4 library via the open_mfdataset() call. I tried to change the engine to netcdf4 and added the backend_kwarg :
backend_kwargs={'parallel':True}
but I get the same error.
I 'll try the suggestion of Stephan to see how it behaves and I will report back.
Thanks

lanougue · 2018-10-26T12:37:30Z

Hi all,
I finally figured out my problem. On each independent process xr.open_mfdataset() seems to naturally try to do some multi-threaded access (even without parallel option ?). Each node of my cluster was configured in such a way that multi-threading was possible (my mistake). Here was my yaml config file used by PBSCluster()

jobqueue:
  pbs:
    name: dask-worker
    # Dask worker options
    cores: 56
    processes: 28

I tough that the parallel=True option was to enable parallelized access for my independent process. It actually enable parallelized access for possible threads of each process.
Now, I have removed parallel=True from xr.open_mfdataset() call and ensure 1 thread by process by changing my config file:

    cores: 28
    processes: 28

Thanks again for your help

lanougue mentioned this issue Oct 19, 2018

Concurrent read segfault Unidata/netcdf4-python#844

Open

lanougue closed this as completed Oct 26, 2018

lanougue mentioned this issue Nov 14, 2018

Workers are extremely slow to start dask/dask-jobqueue#193

Closed

yt87 mentioned this issue Aug 23, 2023

Random crashes in netcdf when dask client has multiple threads #7990

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent acces with multiple processes using open_mfdataset #2494

Concurrent acces with multiple processes using open_mfdataset #2494

lanougue commented Oct 19, 2018

shoyer commented Oct 19, 2018

jhamman commented Oct 19, 2018

lanougue commented Oct 22, 2018

lanougue commented Oct 26, 2018

Concurrent acces with multiple processes using open_mfdataset #2494

Concurrent acces with multiple processes using open_mfdataset #2494

Comments

lanougue commented Oct 19, 2018

Output of xr.show_versions()

shoyer commented Oct 19, 2018

jhamman commented Oct 19, 2018

lanougue commented Oct 22, 2018

lanougue commented Oct 26, 2018

Output of `xr.show_versions()`