-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve FUSE #67
Improve FUSE #67
Conversation
NB: these remain print statements for now
Add decorator-based method tracing to `gcsfuse.GCSFS` and `core.GCSFileSystem` interface methods. Add `--verbose` command-line option to `gcsfuse` to support debug logging control.
Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to gcsfuse to support directory listing.
Fix error in GCSFS::read() cache key resolution.
Experimental development at https://github.com/martindurant/gcsfs/tree/dir_fuse may be merged here. (@asford, that includes your version of logging) |
Also, reduce default file block size (reduces read times)
I'm seeing around 1s round-trips for single elements and around 10MB/s total read speed Operationally things like My approach
In [1]: import xarray as xr
In [2]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 48 ms, sys: 8 ms, total: 56 ms
Wall time: 1.19 s
In [3]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 27 ms, sys: 4 ms, total: 31 ms
Wall time: 476 ms
In [4]: x = ds.t_mean.data
In [5]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.77 s, sys: 251 ms, total: 2.02 s
Wall time: 7.28 s
Out[5]: 83.1488
In [6]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.84 s, sys: 118 ms, total: 1.95 s
Wall time: 1.95 s
Out[6]: 83.1488
In [7]: %time x[:, 0, 0].compute().nbytes / 1e6
# still going after a few miniutes
In [8]: %time x[10000, 0, 0].compute()
CPU times: user 13 ms, sys: 1 ms, total: 14 ms
Wall time: 721 ms
Out[8]: array(nan) |
I need to find the root for the difference between the following
I am having trouble using http://pangeo.pydata.org , I find thing become unresponsive after a short time (although this may be my local network problems). |
Latest
It seems that setting the block-size smaller, to increase the speed of metadata lookups, caused problems with GCS's 5MB boundary restriction. This is a code problem I can fix. |
You might consider running the gcsfuse cli application with the cProfile module
However, we'll need to find a way for the process to terminate cleanly, perhaps the following: try:
FUSE(GCSFS(bucket, token=token, project=project_id),
mount_point, nothreads=True, foreground=foreground)
except KeyboardInterrupt:
import sys
sys.exit(0) |
Alternatively we might consider annotating relevant functions in the FUSE application to print out their timing information def log(func):
@functools.wraps(func)
def _(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
end = time.time()
sys.stdout.write(json.dumps({'name': func.__name__, 'duration': end - start}))
return result
return _
@log
def ls(...):
... |
|
Probably not, it's a worst-case access pattern
…On Wed, Feb 7, 2018 at 11:17 AM, Martin Durant ***@***.***> wrote:
x[:, 0, 0].compute() - this requires 12054 reads, 224*464*8=812kB apart
-> 9.3GB, all of which needs to be scanned. Does this complete with zarr in
a reasonable time?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#67 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszLwSUIAQUZbX-GTj-w0hCZnCasyKks5tScyngaJpZM4R2Pcj>
.
|
For reference, this is how things look at the moment https://gist.github.com/martindurant/deb36c8fb4692df23f27a201e71d4c89 If someone can figure out why the big calculation spawns ~1940 open_dataset tasks, and why the memory usage of the gcsfuse process goes into the many GB, would love to hear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant - thanks for the notebook showing where things are at currently. Is this a valid summary of the current status here:
- caching/read-ahead seems to be working well
- first reads are slow, particularly if there are many unnecessary reads (e.g. xr.open_dataset)
gcsfs/cli/gcsfuse.py
Outdated
@@ -14,21 +14,34 @@ | |||
help="Billing Project ID") | |||
@click.option('--foreground/--background', default=True, | |||
help="Run in the foreground or as a background process") | |||
@click.option('--threads/--no-threads', default=True, | |||
help="Run in the foreground or as a background process") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this help description looks to have been copied from above.
gcsfs/gcsfuse.py
Outdated
import cProfile | ||
import atexit | ||
|
||
if True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guessing this is a debug statement that will be removed eventually?
@jhamman , yes, I think that the work here is probably good enough to be cleaned up and merged. I was disappointed not to have done better, and I still don't understand why the memory usage within the gcsfuse process doesn't plateau, but this is certainly better than it was before. |
Sorry, but I won't be able to do thorough code review in a timely fashion. At a high level and quick overview it certainly gets my 👍 and I would suggesting merging. |
NB: these remain print statements for now