Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Multiprocess Reads #640

Closed
mfdeakin-sandia opened this issue Mar 14, 2017 · 17 comments
Closed

Failing Multiprocess Reads #640

mfdeakin-sandia opened this issue Mar 14, 2017 · 17 comments

Comments

@mfdeakin-sandia
Copy link

mfdeakin-sandia commented Mar 14, 2017

I'm having issues reading from a file with multiple processes with netcdf-4.3.3.1-5.el7.x86_64, hdf5-1.8.12-8.el7.x86_64, and the Python interface netCDF4-1.2.4. I'm not certain the hdf5 library I've listed here is compiled with parallel IO support, but I've also tested this on other machines which do have this enabled.

To test this, I use a netcdf file with 200 variables each with 1000 values. My reproduction code is as follows:

from netCDF4 import Dataset
import numpy as np
from multiprocessing import Process, Queue
import sys

def read_test(results, varname):
    results.put((varname, dataset.variables[varname][:]))

filename = sys.argv[1]
dataset = Dataset(filename, 'r')
var_list = dataset.variables.keys()
results = Queue()
procs = []

for v in var_list:
    p = Process(target = read_test, args = (results, v))
    p.start()
    procs.append(p)

result_values = {}

for i in range(len(var_list)):
    v, comp = results.get()
    result_values[v] = comp

canon_values = {}

for v in var_list:
    canon_values[v] = dataset.variables[v][:]
    diff = (result_values[v] - canon_values[v]) != 0.0
    if np.any(diff):
        print("Failed to read correctly! {}".format(result_values[v] - canon_values[v]))
@mfdeakin-sandia
Copy link
Author

This occurs with both Python 2.7 and 3.4.
A (poor) fix for the code above is to instantiate the dataset object inside of the multiprocessing code. Unfortunately, this is also significantly slower than the dysfunctional code above; the time this script takes to run goes from 0.69 s to 2.4 s on my machine.

@jswhit
Copy link
Collaborator

jswhit commented Mar 16, 2017

There is an example in examples/threaded_read.py that uses the queue and threading modules. I haven't tried the mulitprocessing module. I did note, however, that in the threaded_read.py example the Dataset is opened within each thread. Some discussion on threading and the GIL can be found in issue #369.

@jswhit
Copy link
Collaborator

jswhit commented Mar 16, 2017

Are you using NETCDF4? From the HDF5 docs (http://www.hdfgroup.org/hdf5-quest.html#gconc):

Users are often surprised to learn that (1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library. Although each thread in these examples is accessing different data, the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneous manipulation from different threads. Examples of HDF5 library global data structures that must be protected are the freespace manager and open file lists.

Concurrent reads from the same NETCDF3 file should be fine though. For it to work with NETCDF4, you may have to build a thread-safe version of HDF5.

@mfdeakin-sandia
Copy link
Author

This fails on both NetCDF 3 and 4 files with this library; though I'm somewhat more concerned about it failing on the NetCDF 4 files, given the lack of an alternative library.

@mfdeakin-sandia
Copy link
Author

I don't understand why this would be unsafe - is the NetCDF library modifying it's own global datastructures on a per file handle basis?
We're using multiple processes because it's supposed to be easier to be safe along with the GIL issue, but this issue requires us to restructure things completely.

Also, that link is broken for me and some others - is there someplace else I could access the docs?
Thanks

@jswhit
Copy link
Collaborator

jswhit commented Mar 16, 2017

It's the HDF5 library that has global data structures - which the netcdf4 library uses as the underlying storage layer.

Looks like that link has disappeared. I can only find https://support.hdfgroup.org/HDF5/faq/threadsafe.html now.

Not 100% sure this is the issue at all, just pointing out the possibility.

@jswhit
Copy link
Collaborator

jswhit commented Mar 16, 2017

BTW - the fact that the Dataset open increases the run time so much indicating that in your exampel there is not much data being read? In a real-world case, when you're reading much more data, might the overhead of opening the Dataset inside each process be much less?

@mfdeakin-sandia
Copy link
Author

My real world data suggests that this cost is still significant - the utility I'm writing processes all of the numeric variables in files with 800 MB in just a few seconds with the scipy netcdf library.
Admittedly, this cost might mostly be from deleting the netcdf dataset - I haven't investigated that yet, but am working on it.

@jswhit
Copy link
Collaborator

jswhit commented Mar 16, 2017

what do you mean by 'deleting the netcdf dataset'?

@mfdeakin-sandia
Copy link
Author

Sorry, I wasn't clear. I meant the deleting the python object (not deleting the file):
del dataset
This is required before starting the other processes (or doing it inside of them), otherwise there are still concurrency errors. I should note that I only tested this part with a netcdf4 file.

@jswhit
Copy link
Collaborator

jswhit commented Mar 17, 2017

Here's a modified version of your script that works for me, without deleting the Dataset instance. I don't think that should be necessary.

from netCDF4 import Dataset
import numpy as np
from multiprocessing import Process, Queue

fname = 'testmp.nc'
nc = Dataset(fname, 'w', format='NETCDF4')
data1 = np.random.randn(500, 500, 500)
data2 = np.random.randn(500, 500, 500)
data3 = np.random.randn(500, 500, 500)
nc.createDimension('x', 500)
nc.createDimension('y', 500)
nc.createDimension('z', 500)
var1 = nc.createVariable('grid1', np.float, ('x', 'y', 'z'))
var2 = nc.createVariable('grid2', np.float, ('x', 'y', 'z'))
var3 = nc.createVariable('grid3', np.float, ('x', 'y', 'z'))
var1[:] = data1
var2[:] = data2
var3[:] = data3
nc.close()

def read_test(results, fname, varname):
    dataset = Dataset(fname, 'r')
    results.put((varname, dataset.variables[varname][:]))
    dataset.close()

dataset = Dataset(fname, 'r')
var_list = dataset.variables.keys()
canon_values = {}
for v in var_list:
    canon_values[v] = dataset.variables[v][:]
dataset.close()

results = Queue()
procs = []

for v in var_list:
    p = Process(target = read_test, args = (results, fname, v))
    p.start()
    procs.append(p)

result_values = {}

for i in range(len(var_list)):
    v, comp = results.get()
    result_values[v] = comp

for v in var_list:
    diff = (result_values[v] - canon_values[v]) != 0.0
    if np.any(diff):
        print("Failed to read correctly! {}".format(result_values[v] - canon_values[v]))

@mfdeakin-sandia
Copy link
Author

I haven't had a chance to compare the performance of doing it that way yet, but will do so soon.
Should I open a separate issue for this with NetCDF 3 specifically? I'm not certain they're separate issues, so I've been hesitant to do so, especially since I mostly care about NetCDF 4.
Thanks for working with me on this

@jswhit
Copy link
Collaborator

jswhit commented Mar 20, 2017

I don't see any differences in behaviour in the example above between NETCDF3 and NETCDF4. Can you be more specific?

The scipy netcdf interface uses numpy memory mapped arrays - so it will always be faster for certain kinds of problems than the more general interface that supports the new data structures and different underlying disk formats.

@mfdeakin-sandia
Copy link
Author

With your code I wouldn't expect there to be, but with my original example there are, which shouldn't happen given the lack of shared memory usage with netcdf3

@mfdeakin-sandia
Copy link
Author

Also, I use the following code to generate my dataset. 3 variables is not enough to replicate this issue for me.

from netCDF4 import Dataset
import numpy as np

fname = 'testmp.nc'
dataset = Dataset(fname, 'w', format='NETCDF3_CLASSIC')
dimshape = {'time': 1000}
for name, size in dimshape.items():
    dataset.createDimension(name, dimshape[name])

vname_fmt = 'var_{}'
for i in range(200):
  vname = vname_fmt.format(i)
  v = dataset.createVariable(vname, np.float, dimshape.keys())
  v[:] = np.random.uniform(0.00390625, 256.0, dimshape.values())

@jswhit
Copy link
Collaborator

jswhit commented Mar 28, 2017

Here's my original code modified to use your file with 200 variables. Works fine for me, and runs in about 1.2 seconds on my mac.

from multiprocessing import Process, Queue

fname = 'testmp.nc'
nc = Dataset(fname, 'w', format='NETCDF3_64BIT')
dimshape = {'time': 1000}
for name, size in dimshape.items():
    nc.createDimension(name, dimshape[name])
    vname_fmt = 'var_{}'
    for i in range(200):
        vname = vname_fmt.format(i)
        v = nc.createVariable(vname, np.float, dimshape.keys())
        v[:] = np.random.uniform(0.00390625, 256.0,dimshape.values())
nc.close()

def read_test(results, fname, varname):
    dataset = Dataset(fname, 'r')
    results.put((varname, dataset.variables[varname][:]))
    dataset.close()

dataset = Dataset(fname, 'r')
var_list = dataset.variables.keys()
results = Queue()
procs = []

for v in var_list:
    p = Process(target = read_test, args = (results, fname, v))
    p.start()
    procs.append(p)

result_values = {}

for i in range(len(var_list)):
    v, comp = results.get()
    result_values[v] = comp

canon_values = {}

for v in var_list:
    canon_values[v] = dataset.variables[v][:]
    diff = (result_values[v] - canon_values[v]) != 0.0
    if np.any(diff):
        print("Failed to read correctly! {}".format(result_values[v] - canon_values[v]))

@mfdeakin-sandia
Copy link
Author

I've already started implementing this solution; my previous comment was meant as a bug report rather than a request for help.
Sorry if I wasn't clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants