-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing Multiprocess Reads #640
Comments
This occurs with both Python 2.7 and 3.4. |
There is an example in examples/threaded_read.py that uses the queue and threading modules. I haven't tried the mulitprocessing module. I did note, however, that in the threaded_read.py example the Dataset is opened within each thread. Some discussion on threading and the GIL can be found in issue #369. |
Are you using NETCDF4? From the HDF5 docs (http://www.hdfgroup.org/hdf5-quest.html#gconc): Users are often surprised to learn that (1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library. Although each thread in these examples is accessing different data, the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneous manipulation from different threads. Examples of HDF5 library global data structures that must be protected are the freespace manager and open file lists. Concurrent reads from the same NETCDF3 file should be fine though. For it to work with NETCDF4, you may have to build a thread-safe version of HDF5. |
This fails on both NetCDF 3 and 4 files with this library; though I'm somewhat more concerned about it failing on the NetCDF 4 files, given the lack of an alternative library. |
I don't understand why this would be unsafe - is the NetCDF library modifying it's own global datastructures on a per file handle basis? Also, that link is broken for me and some others - is there someplace else I could access the docs? |
It's the HDF5 library that has global data structures - which the netcdf4 library uses as the underlying storage layer. Looks like that link has disappeared. I can only find https://support.hdfgroup.org/HDF5/faq/threadsafe.html now. Not 100% sure this is the issue at all, just pointing out the possibility. |
BTW - the fact that the Dataset open increases the run time so much indicating that in your exampel there is not much data being read? In a real-world case, when you're reading much more data, might the overhead of opening the Dataset inside each process be much less? |
My real world data suggests that this cost is still significant - the utility I'm writing processes all of the numeric variables in files with 800 MB in just a few seconds with the scipy netcdf library. |
what do you mean by 'deleting the netcdf dataset'? |
Sorry, I wasn't clear. I meant the deleting the python object (not deleting the file): |
Here's a modified version of your script that works for me, without deleting the Dataset instance. I don't think that should be necessary. from netCDF4 import Dataset
import numpy as np
from multiprocessing import Process, Queue
fname = 'testmp.nc'
nc = Dataset(fname, 'w', format='NETCDF4')
data1 = np.random.randn(500, 500, 500)
data2 = np.random.randn(500, 500, 500)
data3 = np.random.randn(500, 500, 500)
nc.createDimension('x', 500)
nc.createDimension('y', 500)
nc.createDimension('z', 500)
var1 = nc.createVariable('grid1', np.float, ('x', 'y', 'z'))
var2 = nc.createVariable('grid2', np.float, ('x', 'y', 'z'))
var3 = nc.createVariable('grid3', np.float, ('x', 'y', 'z'))
var1[:] = data1
var2[:] = data2
var3[:] = data3
nc.close()
def read_test(results, fname, varname):
dataset = Dataset(fname, 'r')
results.put((varname, dataset.variables[varname][:]))
dataset.close()
dataset = Dataset(fname, 'r')
var_list = dataset.variables.keys()
canon_values = {}
for v in var_list:
canon_values[v] = dataset.variables[v][:]
dataset.close()
results = Queue()
procs = []
for v in var_list:
p = Process(target = read_test, args = (results, fname, v))
p.start()
procs.append(p)
result_values = {}
for i in range(len(var_list)):
v, comp = results.get()
result_values[v] = comp
for v in var_list:
diff = (result_values[v] - canon_values[v]) != 0.0
if np.any(diff):
print("Failed to read correctly! {}".format(result_values[v] - canon_values[v])) |
I haven't had a chance to compare the performance of doing it that way yet, but will do so soon. |
I don't see any differences in behaviour in the example above between NETCDF3 and NETCDF4. Can you be more specific? The scipy netcdf interface uses numpy memory mapped arrays - so it will always be faster for certain kinds of problems than the more general interface that supports the new data structures and different underlying disk formats. |
With your code I wouldn't expect there to be, but with my original example there are, which shouldn't happen given the lack of shared memory usage with netcdf3 |
Also, I use the following code to generate my dataset. 3 variables is not enough to replicate this issue for me.
|
Here's my original code modified to use your file with 200 variables. Works fine for me, and runs in about 1.2 seconds on my mac. from multiprocessing import Process, Queue
fname = 'testmp.nc'
nc = Dataset(fname, 'w', format='NETCDF3_64BIT')
dimshape = {'time': 1000}
for name, size in dimshape.items():
nc.createDimension(name, dimshape[name])
vname_fmt = 'var_{}'
for i in range(200):
vname = vname_fmt.format(i)
v = nc.createVariable(vname, np.float, dimshape.keys())
v[:] = np.random.uniform(0.00390625, 256.0,dimshape.values())
nc.close()
def read_test(results, fname, varname):
dataset = Dataset(fname, 'r')
results.put((varname, dataset.variables[varname][:]))
dataset.close()
dataset = Dataset(fname, 'r')
var_list = dataset.variables.keys()
results = Queue()
procs = []
for v in var_list:
p = Process(target = read_test, args = (results, fname, v))
p.start()
procs.append(p)
result_values = {}
for i in range(len(var_list)):
v, comp = results.get()
result_values[v] = comp
canon_values = {}
for v in var_list:
canon_values[v] = dataset.variables[v][:]
diff = (result_values[v] - canon_values[v]) != 0.0
if np.any(diff):
print("Failed to read correctly! {}".format(result_values[v] - canon_values[v])) |
I've already started implementing this solution; my previous comment was meant as a bug report rather than a request for help. |
I'm having issues reading from a file with multiple processes with netcdf-4.3.3.1-5.el7.x86_64, hdf5-1.8.12-8.el7.x86_64, and the Python interface netCDF4-1.2.4. I'm not certain the hdf5 library I've listed here is compiled with parallel IO support, but I've also tested this on other machines which do have this enabled.
To test this, I use a netcdf file with 200 variables each with 1000 values. My reproduction code is as follows:
The text was updated successfully, but these errors were encountered: