Improve FUSE #67

martindurant · 2018-02-01T20:33:39Z

NB: these remain print statements for now

Add decorator-based method tracing to `gcsfuse.GCSFS` and `core.GCSFileSystem` interface methods. Add `--verbose` command-line option to `gcsfuse` to support debug logging control.

Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to gcsfuse to support directory listing.

Fix error in GCSFS::read() cache key resolution.

martindurant · 2018-02-03T23:37:40Z

Experimental development at https://github.com/martindurant/gcsfs/tree/dir_fuse may be merged here.

(@asford, that includes your version of logging)

Also, reduce default file block size (reduces read times)

mrocklin · 2018-02-06T14:34:10Z

I'm seeing around 1s round-trips for single elements and around 10MB/s total read speed

Operationally things like ls seem to work great.

My approach

Log into http://pangeo.pydata.org
Open up a terminal (Ctrl-C "new terminal" enter)
`pip install git+https://github.com/martindurant/gcsfs@dir_fuse --upgrade
mkdir gcs && gcsfuse pangeo-data gcs
ipython

In [1]: import xarray as xr

In [2]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 48 ms, sys: 8 ms, total: 56 ms
Wall time: 1.19 s

In [3]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 27 ms, sys: 4 ms, total: 31 ms
Wall time: 476 ms

In [4]: x = ds.t_mean.data

In [5]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.77 s, sys: 251 ms, total: 2.02 s
Wall time: 7.28 s
Out[5]: 83.1488

In [6]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.84 s, sys: 118 ms, total: 1.95 s
Wall time: 1.95 s
Out[6]: 83.1488

In [7]: %time x[:, 0, 0].compute().nbytes / 1e6
# still going after a few miniutes

In [8]: %time x[10000, 0, 0].compute()
CPU times: user 13 ms, sys: 1 ms, total: 14 ms
Wall time: 721 ms
Out[8]: array(nan)

martindurant · 2018-02-06T18:26:23Z

I need to find the root for the difference between the following

In [5]: %%time
   ...: with gcs.open('pangeo-data/newmann-met-ensemble-netcdf/conus_ens_001.nc', 'rb') as f:
   ...:     for _ in range(100):
   ...:         f.read(2**20)
   ...:
CPU times: user 786 ms, sys: 631 ms, total: 1.42 s
Wall time: 3.8 s

In [11]: %%time
    ...: with open('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', 'rb') as f:
    ...:     for _ in range(100):
    ...:         f.read(2**20)
    ...:
CPU times: user 39 ms, sys: 77 ms, total: 116 ms
Wall time: 22.3 s

I am having trouble using http://pangeo.pydata.org , I find thing become unresponsive after a short time (although this may be my local network problems).

martindurant · 2018-02-06T19:38:46Z

Results of timing the read method of the gcsfuse class:

(x-axis: read call sequence number; y-axis: time takes in gcsfuse.read() in s).

The small spikes are every 5MB as one would expect, and seem to take ~0.2s typically (25MB/s) which isn't great, but similar to the raw time with gcsfs. All the other reads should be essentially free, and I cannot explain why they are not.

Note that reading the gcsfs file directly with smaller chunks (128kB) as the OS/fuse seems to results in only moderately poorer performance.

How would I profile what the fuse.read() function is doing, when called like this within a CLI program?

Edit: there is some significant CPU usage in the gcsfuse process throughout the ~20s while data is being transferred.

martindurant · 2018-02-06T21:27:04Z

Latest

In [2]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 53 ms, sys: 11 ms, total: 64 ms
Wall time: 1.67 s

In [3]: x = ds.t_mean.data

In [4]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.74 s, sys: 207 ms, total: 1.94 s
Wall time: 4.46 s
Out[4]: 83.1488

It seems that setting the block-size smaller, to increase the speed of metadata lookups, caused problems with GCS's 5MB boundary restriction. This is a code problem I can fix.

mrocklin · 2018-02-06T23:38:35Z

You might consider running the gcsfuse cli application with the cProfile module

python -m cProfile -o foo.prof gcsfs/cli/gcsfuse.py pangeo-data gcs

However, we'll need to find a way for the process to terminate cleanly, perhaps the following:

    try:
        FUSE(GCSFS(bucket, token=token, project=project_id),
             mount_point, nothreads=True, foreground=foreground)
    except KeyboardInterrupt:
        import sys
        sys.exit(0)

mrocklin · 2018-02-06T23:40:42Z

Alternatively we might consider annotating relevant functions in the FUSE application to print out their timing information

def log(func):
    @functools.wraps(func)
    def _(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        sys.stdout.write(json.dumps({'name': func.__name__, 'duration': end - start}))
        return result
    return _

@log
def ls(...):
    ...

martindurant · 2018-02-07T16:17:43Z

x[:, 0, 0].compute() - this requires 12054 reads, 224x464x8=812kB apart -> 9.3GB, all of which needs to be scanned. Does this complete with zarr in a reasonable time?

mrocklin · 2018-02-07T16:54:58Z

Probably not, it's a worst-case access pattern

…

On Wed, Feb 7, 2018 at 11:17 AM, Martin Durant ***@***.***> wrote: x[:, 0, 0].compute() - this requires 12054 reads, 224*464*8=812kB apart -> 9.3GB, all of which needs to be scanned. Does this complete with zarr in a reasonable time? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLwSUIAQUZbX-GTj-w0hCZnCasyKks5tScyngaJpZM4R2Pcj> .

martindurant · 2018-02-23T18:51:20Z

For reference, this is how things look at the moment https://gist.github.com/martindurant/deb36c8fb4692df23f27a201e71d4c89

If someone can figure out why the big calculation spawns ~1940 open_dataset tasks, and why the memory usage of the gcsfuse process goes into the many GB, would love to hear.

jhamman

@martindurant - thanks for the notebook showing where things are at currently. Is this a valid summary of the current status here:

caching/read-ahead seems to be working well
first reads are slow, particularly if there are many unnecessary reads (e.g. xr.open_dataset)

jhamman · 2018-02-25T05:31:16Z

gcsfs/cli/gcsfuse.py

@@ -14,21 +14,34 @@
              help="Billing Project ID")
 @click.option('--foreground/--background', default=True,
              help="Run in the foreground or as a background process")
+@click.option('--threads/--no-threads', default=True,
+              help="Run in the foreground or as a background process")


this help description looks to have been copied from above.

jhamman · 2018-02-25T05:34:10Z

gcsfs/gcsfuse.py

+import cProfile
+import atexit
+
+if True:


guessing this is a debug statement that will be removed eventually?

martindurant · 2018-02-25T18:05:15Z

@jhamman , yes, I think that the work here is probably good enough to be cleaned up and merged. I was disappointed not to have done better, and I still don't understand why the memory usage within the gcsfuse process doesn't plateau, but this is certainly better than it was before.

martindurant · 2018-02-25T22:36:48Z

@jhamman , I fixed the couple of things you mention.
@asford , did you have any thought on this at all? Note that I took the opportunity to fix a couple of style things here, and removed logging statements that included string processing of potentially large datasets like directory listings.

asford · 2018-02-26T04:17:41Z

Sorry, but I won't be able to do thorough code review in a timely fashion. At a high level and quick overview it certainly gets my 👍 and I would suggesting merging.

Add more logging to fuse

aa3ad6c

NB: these remain print statements for now

martindurant mentioned this pull request Feb 1, 2018

gcsfuse stalls? #66

Closed

Martin Durant and others added 7 commits February 1, 2018 16:16

experimental caching for fuse

8886e3e

Add per-method debug tracing.

d131f61

Add decorator-based method tracing to `gcsfuse.GCSFS` and `core.GCSFileSystem` interface methods. Add `--verbose` command-line option to `gcsfuse` to support debug logging control.

Bugfix prototype gcsfuse/per_dir_cache integration.

9211b03

Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to gcsfuse to support directory listing.

Fix GCSFS::read cache access.

91adc21

Fix error in GCSFS::read() cache key resolution.

Merge remote-tracking branch 'asford/per_dir_cache' into dir_fuse

fb130f4

file and chunk caching

3ed7fb2

remove some prints

d8a8055

Martin Durant added 2 commits February 4, 2018 14:08

Simple LRU file caching

ff0a88c

Add docstrings

3de01dd

Also, reduce default file block size (reduces read times)

Martin Durant added 10 commits February 8, 2018 12:05

Make number of cached files configurable

0df7762

Better CLI parameter

ee3815f

arg loc

4365636

try again

ec770a4

rational

151dba4

add locks

60a92c4

oops

4affd48

remove rounding

bb3a1b3

profile

e640771

again

a02f3ba

Martin Durant added 10 commits February 8, 2018 14:23

again

b15192b

better prof

27e1155

again

11b2c8a

remove faulty log

d3fbc0d

remove pdb

c6457bb

small

70a78c6

mixed

381a83c

timeout

87546c9

op

2fe8647

more

69362e2

martindurant mentioned this pull request Feb 14, 2018

Manipulating Zarr metadata with GCSFS is slow pangeo-data/pangeo#112

Closed

Martin Durant added 3 commits February 22, 2018 13:47

Merge branch 'master' into dir_fuse & tidy

7d9a332

fix

024e6fe

Merge branch 'master' into more_fuse

d271db6

martindurant changed the title ~~Add more logging to fuse~~ Improve FUSE Feb 22, 2018

Martin Durant added 2 commits February 22, 2018 16:39

Merge branch 'master' into more_fuse

945d5ff

fix for py2

c9737c3

style

1b0aec3

jhamman reviewed Feb 25, 2018

View reviewed changes

clean up

b6210af

martindurant merged commit a7c7901 into fsspec:master Feb 26, 2018

martindurant deleted the more_fuse branch February 26, 2018 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve FUSE #67

Improve FUSE #67

martindurant commented Feb 1, 2018

martindurant commented Feb 3, 2018 •

edited

Loading

mrocklin commented Feb 6, 2018

martindurant commented Feb 6, 2018

martindurant commented Feb 6, 2018 •

edited

Loading

martindurant commented Feb 6, 2018

mrocklin commented Feb 6, 2018

mrocklin commented Feb 6, 2018

martindurant commented Feb 7, 2018 •

edited

Loading

mrocklin commented Feb 7, 2018 via email

martindurant commented Feb 23, 2018 •

edited

Loading

jhamman left a comment

jhamman Feb 25, 2018

jhamman Feb 25, 2018

martindurant commented Feb 25, 2018

martindurant commented Feb 25, 2018

asford commented Feb 26, 2018

Improve FUSE #67

Improve FUSE #67

Conversation

martindurant commented Feb 1, 2018

martindurant commented Feb 3, 2018 • edited Loading

mrocklin commented Feb 6, 2018

My approach

martindurant commented Feb 6, 2018

martindurant commented Feb 6, 2018 • edited Loading

martindurant commented Feb 6, 2018

mrocklin commented Feb 6, 2018

mrocklin commented Feb 6, 2018

martindurant commented Feb 7, 2018 • edited Loading

mrocklin commented Feb 7, 2018 via email

martindurant commented Feb 23, 2018 • edited Loading

jhamman left a comment

Choose a reason for hiding this comment

jhamman Feb 25, 2018

Choose a reason for hiding this comment

jhamman Feb 25, 2018

Choose a reason for hiding this comment

martindurant commented Feb 25, 2018

martindurant commented Feb 25, 2018

asford commented Feb 26, 2018

martindurant commented Feb 3, 2018 •

edited

Loading

martindurant commented Feb 6, 2018 •

edited

Loading

martindurant commented Feb 7, 2018 •

edited

Loading

martindurant commented Feb 23, 2018 •

edited

Loading