Failing Multiprocess Reads #640

mfdeakin-sandia · 2017-03-14T17:38:02Z

I'm having issues reading from a file with multiple processes with netcdf-4.3.3.1-5.el7.x86_64, hdf5-1.8.12-8.el7.x86_64, and the Python interface netCDF4-1.2.4. I'm not certain the hdf5 library I've listed here is compiled with parallel IO support, but I've also tested this on other machines which do have this enabled.

To test this, I use a netcdf file with 200 variables each with 1000 values. My reproduction code is as follows:

from netCDF4 import Dataset
import numpy as np
from multiprocessing import Process, Queue
import sys

def read_test(results, varname):
    results.put((varname, dataset.variables[varname][:]))

filename = sys.argv[1]
dataset = Dataset(filename, 'r')
var_list = dataset.variables.keys()
results = Queue()
procs = []

for v in var_list:
    p = Process(target = read_test, args = (results, v))
    p.start()
    procs.append(p)

result_values = {}

for i in range(len(var_list)):
    v, comp = results.get()
    result_values[v] = comp

canon_values = {}

for v in var_list:
    canon_values[v] = dataset.variables[v][:]
    diff = (result_values[v] - canon_values[v]) != 0.0
    if np.any(diff):
        print("Failed to read correctly! {}".format(result_values[v] - canon_values[v]))

mfdeakin-sandia · 2017-03-15T00:47:36Z

This occurs with both Python 2.7 and 3.4.
A (poor) fix for the code above is to instantiate the dataset object inside of the multiprocessing code. Unfortunately, this is also significantly slower than the dysfunctional code above; the time this script takes to run goes from 0.69 s to 2.4 s on my machine.

jswhit · 2017-03-16T19:58:28Z

There is an example in examples/threaded_read.py that uses the queue and threading modules. I haven't tried the mulitprocessing module. I did note, however, that in the threaded_read.py example the Dataset is opened within each thread. Some discussion on threading and the GIL can be found in issue #369.

jswhit · 2017-03-16T20:03:32Z

Are you using NETCDF4? From the HDF5 docs (http://www.hdfgroup.org/hdf5-quest.html#gconc):

Users are often surprised to learn that (1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library. Although each thread in these examples is accessing different data, the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneous manipulation from different threads. Examples of HDF5 library global data structures that must be protected are the freespace manager and open file lists.

Concurrent reads from the same NETCDF3 file should be fine though. For it to work with NETCDF4, you may have to build a thread-safe version of HDF5.

mfdeakin-sandia · 2017-03-16T20:24:21Z

This fails on both NetCDF 3 and 4 files with this library; though I'm somewhat more concerned about it failing on the NetCDF 4 files, given the lack of an alternative library.

mfdeakin-sandia · 2017-03-16T20:41:20Z

I don't understand why this would be unsafe - is the NetCDF library modifying it's own global datastructures on a per file handle basis?
We're using multiple processes because it's supposed to be easier to be safe along with the GIL issue, but this issue requires us to restructure things completely.

Also, that link is broken for me and some others - is there someplace else I could access the docs?
Thanks

jswhit · 2017-03-16T21:06:39Z

It's the HDF5 library that has global data structures - which the netcdf4 library uses as the underlying storage layer.

Looks like that link has disappeared. I can only find https://support.hdfgroup.org/HDF5/faq/threadsafe.html now.

Not 100% sure this is the issue at all, just pointing out the possibility.

jswhit · 2017-03-16T21:09:54Z

BTW - the fact that the Dataset open increases the run time so much indicating that in your exampel there is not much data being read? In a real-world case, when you're reading much more data, might the overhead of opening the Dataset inside each process be much less?

mfdeakin-sandia · 2017-03-16T21:39:01Z

My real world data suggests that this cost is still significant - the utility I'm writing processes all of the numeric variables in files with 800 MB in just a few seconds with the scipy netcdf library.
Admittedly, this cost might mostly be from deleting the netcdf dataset - I haven't investigated that yet, but am working on it.

jswhit · 2017-03-16T21:59:06Z

what do you mean by 'deleting the netcdf dataset'?

mfdeakin-sandia · 2017-03-16T22:06:42Z

Sorry, I wasn't clear. I meant the deleting the python object (not deleting the file):
del dataset
This is required before starting the other processes (or doing it inside of them), otherwise there are still concurrency errors. I should note that I only tested this part with a netcdf4 file.

jswhit · 2017-03-17T16:32:08Z

Here's a modified version of your script that works for me, without deleting the Dataset instance. I don't think that should be necessary.

from netCDF4 import Dataset
import numpy as np
from multiprocessing import Process, Queue

fname = 'testmp.nc'
nc = Dataset(fname, 'w', format='NETCDF4')
data1 = np.random.randn(500, 500, 500)
data2 = np.random.randn(500, 500, 500)
data3 = np.random.randn(500, 500, 500)
nc.createDimension('x', 500)
nc.createDimension('y', 500)
nc.createDimension('z', 500)
var1 = nc.createVariable('grid1', np.float, ('x', 'y', 'z'))
var2 = nc.createVariable('grid2', np.float, ('x', 'y', 'z'))
var3 = nc.createVariable('grid3', np.float, ('x', 'y', 'z'))
var1[:] = data1
var2[:] = data2
var3[:] = data3
nc.close()

def read_test(results, fname, varname):
    dataset = Dataset(fname, 'r')
    results.put((varname, dataset.variables[varname][:]))
    dataset.close()

dataset = Dataset(fname, 'r')
var_list = dataset.variables.keys()
canon_values = {}
for v in var_list:
    canon_values[v] = dataset.variables[v][:]
dataset.close()

results = Queue()
procs = []

for v in var_list:
    p = Process(target = read_test, args = (results, fname, v))
    p.start()
    procs.append(p)

result_values = {}

for i in range(len(var_list)):
    v, comp = results.get()
    result_values[v] = comp

for v in var_list:
    diff = (result_values[v] - canon_values[v]) != 0.0
    if np.any(diff):
        print("Failed to read correctly! {}".format(result_values[v] - canon_values[v]))

mfdeakin-sandia · 2017-03-20T21:11:24Z

I haven't had a chance to compare the performance of doing it that way yet, but will do so soon.
Should I open a separate issue for this with NetCDF 3 specifically? I'm not certain they're separate issues, so I've been hesitant to do so, especially since I mostly care about NetCDF 4.
Thanks for working with me on this

jswhit · 2017-03-20T22:03:52Z

I don't see any differences in behaviour in the example above between NETCDF3 and NETCDF4. Can you be more specific?

The scipy netcdf interface uses numpy memory mapped arrays - so it will always be faster for certain kinds of problems than the more general interface that supports the new data structures and different underlying disk formats.

mfdeakin-sandia · 2017-03-20T22:29:18Z

With your code I wouldn't expect there to be, but with my original example there are, which shouldn't happen given the lack of shared memory usage with netcdf3

mfdeakin-sandia · 2017-03-21T19:58:16Z

Also, I use the following code to generate my dataset. 3 variables is not enough to replicate this issue for me.

from netCDF4 import Dataset
import numpy as np

fname = 'testmp.nc'
dataset = Dataset(fname, 'w', format='NETCDF3_CLASSIC')
dimshape = {'time': 1000}
for name, size in dimshape.items():
    dataset.createDimension(name, dimshape[name])

vname_fmt = 'var_{}'
for i in range(200):
  vname = vname_fmt.format(i)
  v = dataset.createVariable(vname, np.float, dimshape.keys())
  v[:] = np.random.uniform(0.00390625, 256.0, dimshape.values())

jswhit · 2017-03-28T17:07:57Z

Here's my original code modified to use your file with 200 variables. Works fine for me, and runs in about 1.2 seconds on my mac.

from multiprocessing import Process, Queue

fname = 'testmp.nc'
nc = Dataset(fname, 'w', format='NETCDF3_64BIT')
dimshape = {'time': 1000}
for name, size in dimshape.items():
    nc.createDimension(name, dimshape[name])
    vname_fmt = 'var_{}'
    for i in range(200):
        vname = vname_fmt.format(i)
        v = nc.createVariable(vname, np.float, dimshape.keys())
        v[:] = np.random.uniform(0.00390625, 256.0,dimshape.values())
nc.close()

def read_test(results, fname, varname):
    dataset = Dataset(fname, 'r')
    results.put((varname, dataset.variables[varname][:]))
    dataset.close()

dataset = Dataset(fname, 'r')
var_list = dataset.variables.keys()
results = Queue()
procs = []

for v in var_list:
    p = Process(target = read_test, args = (results, fname, v))
    p.start()
    procs.append(p)

result_values = {}

for i in range(len(var_list)):
    v, comp = results.get()
    result_values[v] = comp

canon_values = {}

for v in var_list:
    canon_values[v] = dataset.variables[v][:]
    diff = (result_values[v] - canon_values[v]) != 0.0
    if np.any(diff):
        print("Failed to read correctly! {}".format(result_values[v] - canon_values[v]))

mfdeakin-sandia · 2017-03-28T17:49:47Z

I've already started implementing this solution; my previous comment was meant as a bug report rather than a request for help.
Sorry if I wasn't clear

jswhit closed this as completed Apr 14, 2017

bkatiemills mentioned this issue Nov 26, 2017

addressed parallelization issues with sqlite db backing IQuOD/AutoQC#211

Merged

billsacks mentioned this issue May 2, 2018

Python implementation of CPRNC ESMCI/cime#1031

Closed

jswhit mentioned this issue Sep 3, 2018

Concurrent read segfault #844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing Multiprocess Reads #640

Failing Multiprocess Reads #640

mfdeakin-sandia commented Mar 14, 2017 •

edited by dopplershift

Loading

mfdeakin-sandia commented Mar 15, 2017

jswhit commented Mar 16, 2017

jswhit commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

jswhit commented Mar 16, 2017

jswhit commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

jswhit commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

jswhit commented Mar 17, 2017 •

edited

Loading

mfdeakin-sandia commented Mar 20, 2017

jswhit commented Mar 20, 2017

mfdeakin-sandia commented Mar 20, 2017

mfdeakin-sandia commented Mar 21, 2017

jswhit commented Mar 28, 2017

mfdeakin-sandia commented Mar 28, 2017

Failing Multiprocess Reads #640

Failing Multiprocess Reads #640

Comments

mfdeakin-sandia commented Mar 14, 2017 • edited by dopplershift Loading

mfdeakin-sandia commented Mar 15, 2017

jswhit commented Mar 16, 2017

jswhit commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

jswhit commented Mar 16, 2017

jswhit commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

jswhit commented Mar 16, 2017

mfdeakin-sandia commented Mar 16, 2017

jswhit commented Mar 17, 2017 • edited Loading

mfdeakin-sandia commented Mar 20, 2017

jswhit commented Mar 20, 2017

mfdeakin-sandia commented Mar 20, 2017

mfdeakin-sandia commented Mar 21, 2017

jswhit commented Mar 28, 2017

mfdeakin-sandia commented Mar 28, 2017

mfdeakin-sandia commented Mar 14, 2017 •

edited by dopplershift

Loading

jswhit commented Mar 17, 2017 •

edited

Loading