Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FUSE #67

Merged
merged 37 commits into from
Feb 26, 2018
Merged

Improve FUSE #67

merged 37 commits into from
Feb 26, 2018

Conversation

martindurant
Copy link
Member

NB: these remain print statements for now

NB: these remain print statements for now
@martindurant martindurant mentioned this pull request Feb 1, 2018
Martin Durant and others added 7 commits February 1, 2018 16:16
Add decorator-based method tracing to `gcsfuse.GCSFS` and
`core.GCSFileSystem` interface methods. Add `--verbose` command-line
option to `gcsfuse` to support debug logging control.
Prototype `per_dir_cache` integration for gcsfuse. Minimal fixup to
gcsfuse to support directory listing.
Fix error in GCSFS::read() cache key resolution.
@martindurant
Copy link
Member Author

martindurant commented Feb 3, 2018

Experimental development at https://github.com/martindurant/gcsfs/tree/dir_fuse may be merged here.

(@asford, that includes your version of logging)

Martin Durant added 2 commits February 4, 2018 14:08
Also, reduce default file block size (reduces read times)
@mrocklin
Copy link
Contributor

mrocklin commented Feb 6, 2018

I'm seeing around 1s round-trips for single elements and around 10MB/s total read speed

Operationally things like ls seem to work great.

My approach

  1. Log into http://pangeo.pydata.org
  2. Open up a terminal (Ctrl-C "new terminal" enter)
  3. `pip install git+https://github.com/martindurant/gcsfs@dir_fuse --upgrade
  4. mkdir gcs && gcsfuse pangeo-data gcs
  5. ipython
In [1]: import xarray as xr

In [2]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 48 ms, sys: 8 ms, total: 56 ms
Wall time: 1.19 s

In [3]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 27 ms, sys: 4 ms, total: 31 ms
Wall time: 476 ms

In [4]: x = ds.t_mean.data

In [5]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.77 s, sys: 251 ms, total: 2.02 s
Wall time: 7.28 s
Out[5]: 83.1488

In [6]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.84 s, sys: 118 ms, total: 1.95 s
Wall time: 1.95 s
Out[6]: 83.1488

In [7]: %time x[:, 0, 0].compute().nbytes / 1e6
# still going after a few miniutes

In [8]: %time x[10000, 0, 0].compute()
CPU times: user 13 ms, sys: 1 ms, total: 14 ms
Wall time: 721 ms
Out[8]: array(nan)

@martindurant
Copy link
Member Author

I need to find the root for the difference between the following

In [5]: %%time
   ...: with gcs.open('pangeo-data/newmann-met-ensemble-netcdf/conus_ens_001.nc', 'rb') as f:
   ...:     for _ in range(100):
   ...:         f.read(2**20)
   ...:
CPU times: user 786 ms, sys: 631 ms, total: 1.42 s
Wall time: 3.8 s

In [11]: %%time
    ...: with open('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', 'rb') as f:
    ...:     for _ in range(100):
    ...:         f.read(2**20)
    ...:
CPU times: user 39 ms, sys: 77 ms, total: 116 ms
Wall time: 22.3 s

I am having trouble using http://pangeo.pydata.org , I find thing become unresponsive after a short time (although this may be my local network problems).

@martindurant
Copy link
Member Author

martindurant commented Feb 6, 2018

Results of timing the read method of the gcsfuse class:

figure_1
(x-axis: read call sequence number; y-axis: time takes in gcsfuse.read() in s).

The small spikes are every 5MB as one would expect, and seem to take ~0.2s typically (25MB/s) which isn't great, but similar to the raw time with gcsfs. All the other reads should be essentially free, and I cannot explain why they are not.

Note that reading the gcsfs file directly with smaller chunks (128kB) as the OS/fuse seems to results in only moderately poorer performance.

How would I profile what the fuse.read() function is doing, when called like this within a CLI program?

Edit: there is some significant CPU usage in the gcsfuse process throughout the ~20s while data is being transferred.

@martindurant
Copy link
Member Author

Latest

In [2]: %time ds = xr.open_dataset('gcs/newmann-met-ensemble-netcdf/conus_ens_001.nc', chunks={'time': 20})
CPU times: user 53 ms, sys: 11 ms, total: 64 ms
Wall time: 1.67 s

In [3]: x = ds.t_mean.data

In [4]: %time x[100:200, :, :].compute().nbytes / 1e6
CPU times: user 1.74 s, sys: 207 ms, total: 1.94 s
Wall time: 4.46 s
Out[4]: 83.1488

It seems that setting the block-size smaller, to increase the speed of metadata lookups, caused problems with GCS's 5MB boundary restriction. This is a code problem I can fix.

@mrocklin
Copy link
Contributor

mrocklin commented Feb 6, 2018

You might consider running the gcsfuse cli application with the cProfile module

python -m cProfile -o foo.prof gcsfs/cli/gcsfuse.py pangeo-data gcs

However, we'll need to find a way for the process to terminate cleanly, perhaps the following:

    try:
        FUSE(GCSFS(bucket, token=token, project=project_id),
             mount_point, nothreads=True, foreground=foreground)
    except KeyboardInterrupt:
        import sys
        sys.exit(0)

@mrocklin
Copy link
Contributor

mrocklin commented Feb 6, 2018

Alternatively we might consider annotating relevant functions in the FUSE application to print out their timing information

def log(func):
    @functools.wraps(func)
    def _(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        sys.stdout.write(json.dumps({'name': func.__name__, 'duration': end - start}))
        return result
    return _

@log
def ls(...):
    ...

@martindurant
Copy link
Member Author

martindurant commented Feb 7, 2018

x[:, 0, 0].compute() - this requires 12054 reads, 224x464x8=812kB apart -> 9.3GB, all of which needs to be scanned. Does this complete with zarr in a reasonable time?

@mrocklin
Copy link
Contributor

mrocklin commented Feb 7, 2018 via email

@martindurant martindurant changed the title Add more logging to fuse Improve FUSE Feb 22, 2018
@martindurant
Copy link
Member Author

martindurant commented Feb 23, 2018

For reference, this is how things look at the moment https://gist.github.com/martindurant/deb36c8fb4692df23f27a201e71d4c89

If someone can figure out why the big calculation spawns ~1940 open_dataset tasks, and why the memory usage of the gcsfuse process goes into the many GB, would love to hear.

Copy link

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant - thanks for the notebook showing where things are at currently. Is this a valid summary of the current status here:

  • caching/read-ahead seems to be working well
  • first reads are slow, particularly if there are many unnecessary reads (e.g. xr.open_dataset)

@@ -14,21 +14,34 @@
help="Billing Project ID")
@click.option('--foreground/--background', default=True,
help="Run in the foreground or as a background process")
@click.option('--threads/--no-threads', default=True,
help="Run in the foreground or as a background process")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this help description looks to have been copied from above.

gcsfs/gcsfuse.py Outdated
import cProfile
import atexit

if True:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guessing this is a debug statement that will be removed eventually?

@martindurant
Copy link
Member Author

@jhamman , yes, I think that the work here is probably good enough to be cleaned up and merged. I was disappointed not to have done better, and I still don't understand why the memory usage within the gcsfuse process doesn't plateau, but this is certainly better than it was before.

@martindurant
Copy link
Member Author

@jhamman , I fixed the couple of things you mention.
@asford , did you have any thought on this at all? Note that I took the opportunity to fix a couple of style things here, and removed logging statements that included string processing of potentially large datasets like directory listings.

@asford
Copy link
Collaborator

asford commented Feb 26, 2018

Sorry, but I won't be able to do thorough code review in a timely fashion. At a high level and quick overview it certainly gets my 👍 and I would suggesting merging.

@martindurant martindurant merged commit a7c7901 into fsspec:master Feb 26, 2018
@martindurant martindurant deleted the more_fuse branch February 26, 2018 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants