-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manipulating Zarr metadata with GCSFS is slow #112
Comments
Here is a profile output suitable for snakeviz |
@mrocklin - are you still using xarray@master and zarr@master? |
Yes
…On Sun, Feb 11, 2018 at 10:19 PM, Joe Hamman ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> - are you still using
***@***.*** and ***@***.***?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#112 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszGWMhOKjTZb5bP-eR_cK_q6HDUf9ks5tT63QgaJpZM4SBpd0>
.
|
This is from my local machine: In [1]: import gcsfs
In [2]: %time gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newman-met-ensemble')
CPU times: user 4.45 s, sys: 1.39 s, total: 5.84 s
Wall time: 47 s
In [3]: %time len(gcsmap)
CPU times: user 117 ms, sys: 4.39 ms, total: 121 ms
Wall time: 122 ms
Out[3]: 1933
In [4]: import pickle
In [5]: %time b = pickle.dumps(gcsmap)
CPU times: user 484 µs, sys: 114 µs, total: 598 µs
Wall time: 609 µs
In [6]: len(b)
Out[6]: 2047
In [7]: %time pickle.loads(b)
CPU times: user 0 ns, sys: 3.75 ms, total: 3.75 ms
Wall time: 3.01 s
Out[7]: <gcsfs.mapping.GCSMap at 0x7f1f8c1a9400> The 3s when calling pickle.loads is in |
I guess this is calling If I recall, we specifically made pickling of the gcsfs object abandon its cached files listing, because it could use a lot of memory and take time to serialize. If you do |
Connecting can be slow the first time per worker, as the various credentials are tried by google. It may be faster to provide a concrete token, if available, or specify Are we making use of the |
In my experience (#117), the slow "first time" has to be repeated for every computation. So if i do vp = ds.somevar.mean().persist()
del vp
vp = ds.somevar.mean().persist() I am faced with the extra overhead (presumably each worker connecting to gcs) twice, or more generally, every time I make a calculation. This is something I would really like to avoid. Any thoughts on the best way to work around this? |
@mrocklin , is there a way we can verify that the gcsfs singleton is being found by the workers, rather than connecting afresh? It should keep hold of the dir entries too, to avoid many calls to info. |
I would probably ask you this question :) Maybe maintain a global count somewhere of the number of connect calls, run a computation, and then check that number across all workers using I think that @rabernat is more affected by 20s delays rather than 3s delays (correct me if I'm wrong) which I think puts the problem at listing files rather than connections. |
When going to larger calculations and cluster sizes, 20s is becoming 60s or more. So it is a real bottleneck |
The following diff seems to be useful for sharing state between like instances in different threads on the same worker:
I am not sure, @rabernat , if you are using any of the experimental code in fsspec/gcsfs#57 or fsspec/gcsfs#67, but that makes individual A small thing to also try: do you see any better performance with
|
The docker containers are currently running on git-master as of three days ago. @rabernat if you wanted to try out other versions you would both pip install locally and also include |
@martindurant - using I'll see if I can find the time to check out one of these branches. |
I'm using fsspec/gcsfs#57 for production runs in a setup that's somewhat similar to yours. Our data buckets, for example, contain 10e5 to 10e6 objects and the global bucket caching approach simply doesn't work. I've been pushing in-progress commits over the last several days for testing, but if you'd like to test it I can stabilize it. |
@asford - Thanks! I'm curious to try your fix. Let me know when you think it is in a state I can use. |
fsspec/gcsfs#57 did not have any noticeable effect on performance. Still taking about 20s of communication to get each worker going. 😢 |
ed: Specifically referring to fsspec/gcsfs#57 Do you have a reasonably easy repro case of this slowdown? Would it be possible for to run one with the |
@asford, assuming you can access our pangeo storage buckets (if not, send me your google account username and I will grant you access), you can run this to reproduce the issue. This is exactly what I'm doing via http://pangeo.pydata.org. I have already run
from my notebook. from dask.distributed import Client
from daskernetes import KubeCluster
import xarray as xr
import gcsfs
# do I need an --upgrade flag here?
cluster = KubeCluster(n_workers=20,
env={'EXTRA_PIP_PACKAGES': 'git+https://github.com/asford/gcsfs.git@per_dir_cache'})
client = Client(cluster)
# (token='cloud') doesn't work with experimental branch
gcs = gcsfs.GCSFileSystem()
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt', gcs=gcs)
# from the notebook this is quite fast, just a few s
ds = xr.open_zarr(gcsmap)
# compute something
sla_timeseries = ds.sla.mean(dim=('latitude', 'longitude')).persist() How do I know if If you tell me how to enable the gcsfs loggers I can try to do it. |
In a basic case you can use:
To enable logging on the channels. I'm sorry, but I don't have sufficient experience using |
Below is a tiny subset of the logging information that came out on the client. (Edit: this happens when I call
|
When I call 1518772540.631 DEBUG gcsfs.core Serialize with state: {'project': 'pangeo-181919', 'access': 'full_control', 'scope': 'https://www.googleapis.com/auth/devstorage.full_control', 'consistency': 'none', 'token': None, 'session': <google.auth.transport.requests.AuthorizedSession object at 0x7f8db1e527b8>, 'method': 'google_default', 'cache_timeout': 60, '_listing_cache': {}} I tried using |
When I actually load persisted data, I get these log messages
|
See #112 (comment) |
|
Agreed that zarr should call through Also noticed the logging issue and fixed it in fsspec/gcsfs#73. |
Question: if I |
This will persist, it actually writes into your site-packages folder within wherever python is installed; this can sometimes cause a problem when installing from conda and pip and setup.py, although in theory they should just clobber one-another. |
fsspec/gcsfs#57 has been updated with fixes for @martindurant's feedback above. |
I tried your latest branch and unfortunately it did not change the performance. But i am still not 100% convinced that I am correctly propagating the extra pip packages to the workers. I don't know if / how I need to specify the |
I'm assuming your build is a pretty recent version of dask docker. In that case it has not installed the updated image. See prepare.sh. This is just a shot in the dark, but can you try:
|
Adding |
👏 We're planning on merging that pull into |
Now that fsspec/gcsfs#57 has been merged, I guess we are just waiting for a |
Does this require a release of dask or of gcsfs?
…On Tue, Feb 20, 2018 at 5:00 PM, Ryan Abernathey ***@***.***> wrote:
Now that fsspec/gcsfs#57 <fsspec/gcsfs#57> has been
merged, I guess we are just waiting for a dask release before marking this
closed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#112 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszEnjSWdnJbINUqrcYqzUyMt4QX56ks5tW0BsgaJpZM4SBpd0>
.
|
Sorry it's gcsfs. Corrected above. |
There will be no need to change dask here. gcsfs can be released if that is useful, but I don't think all of #112 (comment) have been met yet. The merger means that you can now pip install from the repo without specifying a branch. |
@martindurant: thanks for the clarification. I will consider this resolved when the fix has been propagated to the default pangeo.pydata.org environment (rather than using the |
Closed by #128 |
Something similar to this gets felt by the first read on every machine. This make the cluster feel like it's hanging for 20s or so. Somewhat reproducible with the following:
The text was updated successfully, but these errors were encountered: