Skip to content
This repository has been archived by the owner on May 11, 2019. It is now read-only.

Update: April 15, 2018 #27

Open
kaipak opened this issue Apr 15, 2018 · 2 comments
Open

Update: April 15, 2018 #27

kaipak opened this issue Apr 15, 2018 · 2 comments

Comments

@kaipak
Copy link
Contributor

kaipak commented Apr 15, 2018

Getting KubeCluster/Dask/GCP tests to work properly took a little more elbow grease than I expected, but I'm getting consistent results finally. Previously published tests from my forked repo have been updated with this more comprehensive set. Bear in mind, these tests can take a pretty long time to run given that they conduct lots of runs in order to be statistically meaningful, so many of them are truncated at the moment and may display weird results. Longer tests are currently running on GCP and my laptop and will be uploaded whenever they finish.

https://kaipak.github.io/storage-benchmarks/#/

Here's what we've written so far:

  • 10 GB Zarr/Dask Array/KubeCluster/GCSFS on GCP
    • Params: chunk sizes: chunks=([5, 10, 50], 1000, 1000), n workers: [5, 10, 20, 40, 80]
    • Tests
      • read
      • write
      • compute mean
  • 250 MB Synthetic Geosciences-like dataset Zarr/Xarray/Single Machine
    • Storage/Format (all Zarr)
      • POSIX
      • GCSFS
      • FUSE
    • Tests
      • read
      • write
      • mean
  • 250 MB Synthetic random Numpy arrays.
    • Storage/Format
      • Zarr/POSIX
      • Zarr/GCSFS
      • Zarr/FUSE
      • HDF5/POSIX
      • HSF5/HSDS
    • Tests
      • read
      • write
      • mean
  • 350 LOCA Numpy
    • Storage/Format
      • HDF5/HSDS

Due to problems I ran into with getting consistent runs in ASV with the plethora of pieces we're dealing with, I didn't get around to documentation or prettying up the plots as I had planned, but will be focusing on that for the next couple days. ASV docs are also quite sparse, so I think it'll be worthwhile have something more comprehensive here--especially since the behavior of some of its settings is not necessarily obvious.

I'd like to more fully detail the tests we have so far and what we plan on working on next. There is no set schedule per se, but I've been roughly favoring getting results out of Dask/Xarray/GCP. In the immediate future, I plan on writing tests that use real data (likely, LLC4320 ccean general circulation simulation output). Since we have all these tests now working for synthetic data, is should be relatively straightforward pointing to actual datasets. Here's my rough idea of a schedule in the next round of test writing.

If there's a particular use case someone is dying to see, I'd be happy to take requests.

@rabernat
Copy link
Member

Kai, this is great progress!

As discussed in our meeting today, here are some next priorities:

  • make sure that you are actually using the number of workers that you expect
  • make sure the test datasets have enough chunks to actually saturate the workers (if you only have 50 chunks but 80 workers, 30 workers will be idle)
  • compare your benchmark results to informal benchmarking based on real data analysis on pangeo.pydata.org, just to make sure it's consistent
  • finally, figure out how to read the .json file of the ASV results directly from a notebook, to make custom plots: start developing a notebook to summarize the results from all your experiments

@rabernat
Copy link
Member

And be careful about load vs persist!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants