Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent acces with multiple processes using open_mfdataset #2494

Closed
lanougue opened this issue Oct 19, 2018 · 4 comments
Closed

Concurrent acces with multiple processes using open_mfdataset #2494

lanougue opened this issue Oct 19, 2018 · 4 comments

Comments

@lanougue
Copy link

Hi everyone,

First: thanks to the developers for this amazing xarray library ! Great piece of work !
Here comes my troubles:
I run several (about 500) independant processes (dask distributed) that need simultaneous reading (only) access to a same (group of) netcdf files. I only pass the files-path strings to the processes to avoid pickling a netcdf python-object (issue).

In each process, I run

   with xr.open_mfdataset(myfiles_path, concat_dim='t', engine='h5netcdf') as myfile:
        x = myfile['x'].data
        y = myfile['y'].data

but it leads to typical errors for many concurrent access that fail... : Invalid id or Exception: CancelledError("('mul-484a58bf5830233021e08456b45eb60d', 0, 0)",), ...

I was using netCDF4 module with parallel option set to True, when playing with a single netcdf file and it was running fine:

myfile = Dataset(seedsurf_path,'r', parallel=True)
x = myfile['x']
y = myfile['y']
myfile.close()

Parallel option for open_mfdataset() seems to be dedicated to multithreaded access only. Is there somthing that can be done for multi-processes access ?
Thanks

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 3.12.53-60.30-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.8
pandas: 0.23.4
numpy: 1.12.1
scipy: 0.19.1
netCDF4: 1.2.4
h5netcdf: 0.6.2
h5py: 2.7.0
Nio: None
zarr: 2.2.0
bottleneck: 1.2.1
cyordereddict: None
dask: 0.19.0
distributed: 1.23.0
matplotlib: 2.2.3
cartopy: 0.16.0
seaborn: None
setuptools: 40.2.0
pip: 18.0
conda: None
pytest: None
IPython: 6.5.0
sphinx: None

@shoyer
Copy link
Member

shoyer commented Oct 19, 2018

This may be fixed if you try the development version of xarray -- we did a major refactor of xarray's handling of netCDF files, e.g., try pip install https://github.com/pydata/xarray/archive/master.zip.

@jhamman
Copy link
Member

jhamman commented Oct 19, 2018

To clear a few things up, the parallel option in netCDF4.Dataset is not the same as the parallel option in xarray.opne_mfdataset. In xarray, that option is meant to help speed up the time it takes to open many files at once. If you are using dask distributed, this should be done using that scheduler.

If you are only seeing thread parallelism in the open_mfdataset(..., parallel=True) call, I would start by looking at your dask distributed setup.

Can you try this workflow with and without the parallel option and report back:

client = Client(...)
ds = xr.open_mfdataset(myfiles_path, concat_dim='t', engine='h5netcdf', paralel=...)
x = ds['x'].load().data
y = ds['y'].load().data
ds.close()

Provided that you are setting up distributed to use multiple processes, you should get parallelism from multiple processes in this case.

@lanougue
Copy link
Author

@jhamman I was aware of the difference between the two parallel options. I was thus wondering if I could pass a parallel option to the netcdf4 library via the open_mfdataset() call. I tried to change the engine to netcdf4 and added the backend_kwarg :
backend_kwargs={'parallel':True}
but I get the same error.
I 'll try the suggestion of Stephan to see how it behaves and I will report back.
Thanks

@lanougue
Copy link
Author

Hi all,
I finally figured out my problem. On each independent process xr.open_mfdataset() seems to naturally try to do some multi-threaded access (even without parallel option ?). Each node of my cluster was configured in such a way that multi-threading was possible (my mistake). Here was my yaml config file used by PBSCluster()

jobqueue:
  pbs:
    name: dask-worker
    # Dask worker options
    cores: 56
    processes: 28

I tough that the parallel=True option was to enable parallelized access for my independent process. It actually enable parallelized access for possible threads of each process.
Now, I have removed parallel=True from xr.open_mfdataset() call and ensure 1 thread by process by changing my config file:

    cores: 28
    processes: 28

Thanks again for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants