-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large pickle overhead in ds.to_netcdf() involving dask.delayed functions #2389
Comments
Offhand, I don't know why I'm not super familiar with profiling dask, but it might be worth looking at dask's diagnostics tools (http://dask.pydata.org/en/latest/understanding-performance.html) to understand what's going on here. The appearance of It would also be interesting to see if this changes with the xarray backend refactor from #2261. |
pangeo-data/gangeo#266 sounds somewhat similar. If you increase the size of the involved arrays here, you also end up with warnings about the size of the graph: https://stackoverflow.com/questions/52039697/how-to-avoid-large-objects-in-task-graph I haven't tried with #2261 applied, but I can try that tomorrow. If we interpret the time spent in
I don't really know how they work, but maybe pickeling those NetCDF4ArrayWrapper objects is expensive (ie they contain a reference to something they shouldn't)? |
This seems plausible to me, though the situation is likely improved with #2261. It would be nice if dask had a way to consolidate the serialization of these objects, rather than separately serializing them in each task. It's not obvious to me how to do that in xarray short of manually building task graphs so those CC @mrocklin in case he has thoughts here |
You can make it a separate task (often done by wrapping with dask.delayed) and then use that key within other objets. This does create a data dependency though, which can make the graph somewhat more complex. In normal use of Pickle these things are cached and reused. Unfortunately we can't do this because we're sending the tasks to different machines, each of which will need to deserialize independently. |
If I understand the heuristics used by dask's schedulers correctly, a data
dependency might actually be a good idea here because it would encourage
colocating write tasks on the same machines. We should probably give this a
try.
…On Wed, Aug 29, 2018 at 12:15 PM Matthew Rocklin ***@***.***> wrote:
It would be nice if dask had a way to consolidate the serialization of
these objects, rather than separately serializing them in each task.
You can make it a separate task (often done by wrapping with dask.delayed)
and then use that key within other objets. This does create a data
dependency though, which can make the graph somewhat more complex.
In normal use of Pickle these things are cached and reused. Unfortunately
we can't do this because we're sending the tasks to different machines,
each of which will need to deserialize independently.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2389 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1q8fMKCsVKmxjvANnMFS2Rn_6_6Jks5uVug-gaJpZM4WSBVj>
.
|
I wouldn't expect this to sway things too much, but yes, there is a chance that that would happen. |
Give #2391 a try -- in my testing, it speeds up both examples to only take about 3 second each. |
Ah, that seems to do the trick. There still seems to be some inefficiency in the pickeled graph output, I'm getting a warning about large objects in the graph:
The size scales linearly with the number of chunks (it is 13MB if there are 5000 chunks). |
It seems the xarray object that is sent to the workers contains a reference to the complete graph: vals = da.random.random((5, 1), chunks=(1, 1))
ds = xr.Dataset({'vals': (['a', 'b'], vals)})
write = ds.to_netcdf('file2.nc', compute=False)
key = [val for val in write.dask.keys()
if isinstance(val, str) and val.startswith('NetCDF')][0]
wrapper = write.dask[key]
len(pickle.dumps(wrapper))
# 14652
delayed_store = wrapper.datastore.delayed_store
len(pickle.dumps(delayed_store))
# 14652
dask.visualize(delayed_store) The size jumps to the 1.3MB if I use 500 chunks again. The warning about the large object in the graph disappears if we delete that reference before we execute the graph:
It doesn't to change the runtime though. |
OK, so it seems like the complete solution here should involve refactoring our backend classes to avoid any references to objects storing dask graphs. This is a cleaner solution even regardless of the pickle overhead because it allows us to eliminate all state stored in backend classes. I'll get on that in #2261. |
Removing the self-references to the dask graphs in #2261 seems to resolve the performance issue on its own. I would be interested if #2391 still improves performance in any real world yes cases -- perhaps it helps when working with a real cluster or on large datasets? I can't see any difference in my local benchmarks using dask-distributed. |
If we write a dask array that doesn't involve
dask.delayed
functions usingds.to_netcdf
, there is only little overhead from pickle:But if we use results from
dask.delayed
, pickle takes up most of the time:This additional pickle overhead does not happen if we compute the dataset without writing it to a file.
Output of
%prun -stime -l10 ds.compute()
withoutdask.delayed
:With
dask.delayed
:I am using
dask.distributed
. I haven't tested it with anything else.Software versions
The text was updated successfully, but these errors were encountered: