-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is xarray.to_zarr slow sometimes? #150
Comments
I would try to diagnose this with the dashboard. Do progress bars arrive on the dashboard? If not then my guess is that it's quite expensive to send the graph itself. If they do arrive but nothing completes then I would look at the profile page to see what things are working on or the stack traces in the info pages of the workers to get a sense of what they're crunching on. |
When we were first doing this, we found it useful to separate the creation of the dataset from the writing of the data. That was the motivation for pydata/xarray#1811. It would be good to figure out if the creation of the zarr dataset is slow or writing data is slow. So you might consider trying out the feature in this PR. If no data is ever sent, the dask progress bar should not progress and that would make me think something is getting snagged during the creation of the dataset. |
FYI, if I try with distributed, I get an error. edit: see fsspec/gcsfs#90 |
I can look at the bucket directly via the GCS console and see that a bunch of objects have been created. |
There is really some kind of threshold behavior here...below a certain size, everything is fine. Above that size, nothing. I'll continue to try to debug. |
So I was able to run this with the distributed scheduler by passing What I am finding is very weird. With a small subset of data (e.g.
Not very informative obviously. Would love some more suggestions on how to debug. This is a showstopper in terms of moving forward. |
Strangely, after interrupting |
Ok, so after waiting a long time (~20 minutes), the computation finally started. What could be responsible for this long wait? If it takes 20 minutes for 50 timesteps, I fear that it will take forever for the full 730 timesteps. |
@rabernat - have you tried using pydata/xarray#1811? I'd like to see a time breakdown from each of these three steps %time ds = xr.open_mfdataset('your files *')
%time delayed_obj = ds.to_zarr(store=store, ..., compute=False)
%time delayed_obj.compute() |
Great suggestion Joe. I will give it a try later today. |
I tried with pydata/xarray#1811
The initialization of the dataset is a bit more complicated than this, since it involves several sets of files and merge steps. However, all of this takes no more than 10s.
CPU times: user 4.84 s, sys: 633 ms, total: 5.48 s
~5-10 minutes with no scheduler activity, then the computation starts. Potentially takes many hours / days, since there is a lot of data to push. |
So the fundamental question here is: what is happening during the time between calling |
I've stumbled upon something similar a few weeks ago, although it was not related to zarr, only to dask distributed. I submitted 100 tasks on distributed, and computations would kick in only after a few minutes. No information onto the status page of the UI. Here is what fixed it: https://stackoverflow.com/questions/41471248/how-to-efficiently-submit-tasks-with-large-arguments-in-dask-distributed: problem of efficient serialization/distribution from client to workers. |
Update on this issue: I ran My understanding based on @mrocklin's earlier comment above is that this indicates that it is "quite expensive to send the graph itself", as in >18 hours expensive! How can I examine the graph to determine which parts are expensive to serialize? This is starting to feel like deja vu with pydata/xarray#1770 and fsspec/gcsfs#49. I am using gcsfs 0.5.0, so presumably the issue of the |
Here is my feeble attempt to figure out how big the gcsmap object is import cloudpickle
import sys
%time gcsmap_pickled = cloudpickle.dumps(gcsmap)
sys.getsizeof(gcsmap_pickled) gives
So this is not the culprit. |
Here is an attempt to peek into the enormous dask graph associated with the full delayed_store = ds.to_zarr(store=gcsmap, encoding=encoding, compute=False)
for k, v in delayed_store.dask.items():
print(k, v)
break gives
It looks like there is a |
You might consider calling In a Jupyter notebook this might look like the following: %%prun
x = x.persist() Or replace prun with snakeviz after calling |
I ran this I posted the raw profile data here if anyone wants to take a look.. |
So this is the line of dask that is taking all the time: |
That's quite interesting. I've been playing with this code recently and am
happy to see results like this that might motivate future decisions. I'll
be reworking this section again in the next couple weeks, but won't have
anything done for at least another week.
…On Wed, Mar 14, 2018 at 12:19 PM, Ryan Abernathey ***@***.***> wrote:
So this is the line of dask that is taking all the time:
https://github.com/dask/dask/blob/master/dask/order.py#L111
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#150 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszImb-XOy3ftiZMHWPTNCIHUHglABks5teUMggaJpZM4Sl-Ek>
.
|
Not that it's informative currently, but you might want to watch this PR:
dask/dask#3271
On Wed, Mar 14, 2018 at 4:38 PM, Matthew Rocklin <mrocklin@anaconda.com>
wrote:
… That's quite interesting. I've been playing with this code recently and
am happy to see results like this that might motivate future decisions.
I'll be reworking this section again in the next couple weeks, but won't
have anything done for at least another week.
On Wed, Mar 14, 2018 at 12:19 PM, Ryan Abernathey <
***@***.***> wrote:
> So this is the line of dask that is taking all the time:
> https://github.com/dask/dask/blob/master/dask/order.py#L111
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#150 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AASszImb-XOy3ftiZMHWPTNCIHUHglABks5teUMggaJpZM4Sl-Ek>
> .
>
|
Just glad we are getting to the bottom of it! I am currently blocked on my use case until I can move this full dataset into GCS (which I cannot do now due to the potentially infinite wait time associated with this step). |
I hear you and I agree that this is important to fix. I however am unlikely to fix it for at least a week and maybe two. If you wanted to procede without me you could revert the order.py file to something around what was in in dask 0.16. It should drop in easily without affecting the rest of the system. You also only need to do this locally on the environment from which your client runs. The scheduler and workers won't need this change. You could also try to engage someone like @eriknw who also has the expertise to deal with task ordering issues, but is commonly busy on other projects. |
Or, better yet, investigate the order.py file and the associated tests (I think that the tests are decently informative) and try to find a solution! It's a pretty separable problem that is well suited to Math or CS students if you have any around. The theoretical name of this problem is, I think, the "pebbling problem" |
I probably need to be working less on Pangeo stuff and more on other stuff. So I am quite glad to have a reason to just sit tight for a while! |
You can probably also safely remove that single line
…On Wed, Mar 14, 2018 at 4:52 PM, Ryan Abernathey ***@***.***> wrote:
I probably need to be working less on Pangeo stuff and more on other
stuff. So I am quite glad to have a reason to just sit tight for a while!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#150 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszBNOoPdf9Ncyb_vF5w26AlHlWHP9ks5teYL3gaJpZM4Sl-Ek>
.
|
Now I am doing an even bigger dataset than that one (by about a factor of 2; 876081 tasks total). But I think the underlying performance issue is the same, so that profile should still be useful. |
I don't suppose you have something representative that I can run easily locally for testing without having access to data or custom scripts, do you? (expectation is no, but thought I'd ask) |
The example [random] datasets described in pydata/xarray#1770 should be relevant here. Might have to increase the size / number of variables until you hit these bottlenecks. |
Which one in particular? The dask array one or the xarray one? Are zarr and gcsfs required to observe the poor performance in order? I apologize for the quesitons here, but you're probably the most qualified person to help identify what is neccessary to reproduce the poor performance here. |
The process to store I am using here is the following import zarr
import gcsfs
compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
encoding = {vname: {'compressor': compressor} for vname in ds.variables}
token = 'cache'
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=token)
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/<dataset name>',
gcs=fs, check=True, create=False)
delayed_store = ds.to_zarr(store=gcsmap, encoding=encoding, compute=False)
# retries was just added
persist_store = delayed_store.persist(retries=10) |
I don't know the answer, because I have not done the experiment. I think it would be very useful to compare raw zarr store vs xarray store to understand where the bottlenecks lie. Xarray can generate a lot of chatter in the dask graph, and perhaps this is part of the issue. Here I am, perhaps prematurely, trying to actually push a real dataset before we have optimized this workflow. I hoped I could just brute force it through and move on, but these intermittent errors, coupled with slow upload speed (see #166), means that I am not really able to. |
I think that we should expect speedbumps like this whenever we increase scale by an order of magnitude or try a new technology. This identification and solution of performance issues is, I suspect, one of the major products of this collaboration. I'm quite happy to help guide people to diagnose issues generally and to directly fix them within Dask specifically. However I am probably not very capable when to comes to running full stack experiments like this. |
The problem with debugging these big data problems is that the iteration is so slow. For my current dataset, it takes many hours just to get through I will gladly follow your advice on how to proceed to debug in a systematic way. My thinking would be a hierarchy of experiments.
(Please tell us what debugging info / logs / etc. should be stored to make the most out of this.) The problem is that it's very hard for me to find time to do this. Maybe @kaipak can squeeze some of it in as part of his related work on storage benchmarking. |
My hope is that if you're running into problems with order.py then it doesn't engage gcsfs at all and hopefully not to_zarr (though I don't know how that works). Ideally we would find an example with xarray or dask.array with random data that was easy to reproduce that was similar to your problem that spent an inordinate amount of time in order.py (say, greater than 1ms per task). |
Hopefully, through looking at scaling relations (takes 1s at 10000 tasks, 10s at 100000 tasks, 100s at 1000000 tasks, so must be linear) we can extrapolate from smaller examples to larger cases. |
I have created a self-contained script which benchmarks this issue using just dask, zarr, and gcsfs: Here is a plot of the execution time, which shows some nonlinearity in the scaling for large size: (I have not yet tried to wrap this in xarray, but it should be straightforward.) I have reached the end of my knowledge and time for working on this. (edit: for a few days at least) |
Cool. I appreciate you spending time on this. I'm curious, have you tried this without zarr or without gcsfs and does it make a difference? "No, I didn't try this" is a fine answer, I'm just wondering if they're essential to seeing this problem or not. |
No, did not try without zarr / gcsfs. The benchmark script would be easy to modify for that scenario, should someone else wish to undertake those experiments. |
Here is my attempt to run this script:
|
@rabernat does this faithfully represent your problem? from time import time
import numpy as np
from dask.distributed import Client
import dask.array as dsa
class Empty(object):
def __init__(self, shape, dtype):
self.shape = shape
self.dtype = dtype
def time_store_persist(nt):
"""Time storing a dataset of size nt into GCSFileSystem fs"""
shape = (nt, 50, 1080, 2160)
chunkshape = (1, 1, 1080, 2160)
ar = dsa.random.random(shape, chunks=chunkshape)
store = Empty(shape=ar.shape, dtype=ar.dtype)
delayed_store = ar.store(store, lock=False, compute=False)
with Client(processes=False) as client:
tic = time()
persist_store = delayed_store.persist()
toc = time()
del persist_store
return toc - tic
# the sizes of datasets to try; adjust as you like
data_sizes = 2**(np.arange(0, 8))
# how many profiles to make
nsamples = 4
timing = np.zeros((len(data_sizes), nsamples))
for n, nt in enumerate(data_sizes):
for ns in range(nsamples):
timing[n, ns] = time_store_persist(nt) |
For history here I first tried using |
See the inline comment about initializing the gcs filesystem: # need to run once with `token='browswer'`, and the switch to `token='cache'`
# otherwise the scheduler can't get valid credentials
# (see https://github.com/dask/gcsfs/issues/90)
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token='cache') edit: I recommend pasting the script into a notebook, which is how I developed it in the first place. Some interactivity may be needed to get the gcsfs credentials set up. |
That looks about right, yes. But will the |
Things will fail when they actually start computing, but my understanding is that we're only looking to profile the time to get things to the scheduler, not the time to actually compute them. Is this correct? |
additional possibly relevant observation: when calling |
When I ran snakeviz on that computation I noticed that it was spending a lot of time in |
When performning task ordering we sort tasks based on the number of dependents/dependencies they have. This is critical to low-memory processing. However, sometimes individual tasks have millions of dependencies, for which an n*log(n) sort adds significant overhead. In these cases we give up on sorting, and just hope that the tasks are well ordered naturally (such as is often the case in Python 3.6+ due to sorted dicts and the natural ordering that exists when constructing common graphs) See pangeo-data/pangeo#150 (comment) for a real-world case
The computation spends around half its time in dumps and about half its time in order. The order time is around 1ms per task, which seemed high. I think I've resolved this in dask/dask#3303 |
I just tried my benchmark with dask/dask#3303 (now in dask master). Here are the old results Here are the new results In terms of the real world example... persisting the store operation used to take ~4 hours. Now it takes 2 minutes. I would say that this has been nailed. Thanks @mrocklin so much for your help squashing this really important bottleneck. |
Thanks for doing work to help identify the bottleneck. I plan to release
things today.
…On Wed, Mar 21, 2018 at 9:22 AM, Ryan Abernathey ***@***.***> wrote:
I just tried my benchmark with dask/dask#3303
<dask/dask#3303> (now in dask master). Here are
the old results
[image: image]
<https://user-images.githubusercontent.com/1197350/37712146-5dc7f4cc-2ce9-11e8-87a7-dc7d04c61ae2.png>
Here are the new results
[image: image]
<https://user-images.githubusercontent.com/1197350/37711331-bf3d9f5c-2ce6-11e8-9649-b86e946dd1ad.png>
In terms of the real world example... persisting the store operation used
to take ~4 hours. Now it takes 2 minutes. I would say that this has been
nailed. Thanks @mrocklin <https://github.com/mrocklin> so much for your
help squashing this really important bottleneck.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#150 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszC11VrP4MmYWhDFsVdkmvg8-NJ-Yks5tglQegaJpZM4Sl-Ek>
.
|
woo, awesome job all. Nice plots @rabernat. I've been lurking (sorry for not having the time to help), and just wanted to chime in to give my +1. |
I'm not sure if this is an xarray issue or a dask issue or a gcsfs issue, so I am posting it here.
I have a big dataset (~10TB) stored on my Lamont data storage server that looks like this:
I am trying to push it up to GCS using zarr via gcsfs.
Here is how I do it
This works great if I first subset the data to be something, e.g.
ds = ds.isel(time=slice(0,5))
. I have netdata installed on the server, which allows me to see a fairly steady outbound data rate of around 145,000 kilobits/s.But if I try it on the whole dataset, the computation starts, but no data get sent. The timer on the progress bar just sits there and doesn't move. I can't figure out what is happening during this time. This persists for ~10 minutes, at which point I get bored and restart the notebook.
cc @kaipak, since this is related to uploading data
The text was updated successfully, but these errors were encountered: