Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr loader #1297

Open
rabernat opened this issue Mar 31, 2021 · 24 comments
Open

Zarr loader #1297

rabernat opened this issue Mar 31, 2021 · 24 comments

Comments

@rabernat
Copy link

Hello and thanks for all of your work on this incredible open source package and ecosystem!

At our Pangeo Community meeting today, we discussed our wish to integrate cloud-based weather and climate data stored in the Zarr format with the deck.gl ecosystem. (I noticed this has been discussed before in #1140.) There is a Zarr reader in javascript (zarr.js), so maybe that makes it easier. I understand that #1140 probably has to be resolved to make this possible, but I thought I'd just open a dedicated issue to track the idea.

Tagging @kylebarron, @point9repeating and @manzt who may be interested.

@manzt
Copy link
Collaborator

manzt commented Mar 31, 2021

Thanks for pinging me @rabernat. Hoping I can share lessons learned from zarr.js & Viv.

@kylebarron
Copy link
Collaborator

kylebarron commented Apr 1, 2021

Hey @rabernat, happy to see you in this neck of the Github woods!

I think Zarr is a good fit for a new loader. I would expect it would be a thin wrapper around zarr.js (or maybe zarr-lite). Ideally we'll have a two-step loader so that the first step can instantiate the store and read the metadata (once) and the second step will load a single chunk, like const chunk = await z.getRawChunk('0.0.0').

Questions:

  • Do we need finely-tuned indexing? Other tiled loaders, such as the MVT, Terrain, and Quantized Mesh loaders, only have a use case of loading entire tiles at a time, but it may be different here. In this case it that would mean "load an entire zarr chunk at a time".
  • If we're ok with loading entire tiles/zarr chunks at a time, could we use something like what zarr-lite/core provides, for a lower bundle size?
  • Rendering is a separate question from loading. Since Zarr chunks can be in any projection/tiling, if you want to render in deck, you might need to override the TileLayer's tile referencing. You might want to read/follow/comment on Expose Tileset2D instance deck.gl#5504.

@ibgreen likely has some feedback, but he may be slow to respond the next few days.

@point9repeating
Copy link

Hello!

I spent a little time getting reacquainted with deck.gl / luma.gl (it's been a couple years since I've used these libraries) and was able to throw together a proof of concept rendering a geospatial zarr data set using deck.gl + zarr.js:

image

This data set is the NCEP NAM forecast I had available locally [2m relative humidity with a simple red/blue color bar].

I thought it might be useful to throw up a version of this pointed at an open cloud-hosted data set to help us discuss any potential additions to luma.gl or deck.gl. Unfortunately, I'm struggling to find a data set that has a CORS-enabled http front-end. I tried MUR and HRRR.

It looks like the HRRR bucket is configured with static website hosting:
http://hrrrzarr.s3-website-us-west-1.amazonaws.com/sfc/20210414/20210414_00z_fcst.zarr/surface/TMP/projection_y_coordinate/.zarray

but CORS has not been configured to allow requests from other origins.

@rabernat
Copy link
Author

rabernat commented Apr 15, 2021

One issue (noted in zarr-developers/community#37) is that many of the existing cloud-based Zarr data is optimized for backend analytics and consequently has chunk sizes of ~100 MB. This is probably much too big for interactive browser-based visualization.

What is an optimal chunk size for deck.gl? I'll try to prepare some Zarr data with much smaller chunks and set up an appropriate CORS policy.

@kylebarron
Copy link
Collaborator

kylebarron commented Apr 15, 2021

This is probably much too big for interactive browser-based visualization.

One question is whether the compression algorithm applied to each block supports streaming decompression. For example if you use gzip compression, you could write a streaming loader that emits an async generator of arrays along the third/last dimension. If the array size of the block were 10, 10, 1000, then maybe the generator would emit arrays of 10, 10, 10. Then it would be possible to work with existing data with a large block size as long as the application knows how to handle this streaming array data. (Though I'm not sure if some common codecs like blosc support streaming decompression, so this might be moot).

What is an optimal chunk size for deck.gl?

deck.gl's own requirements are set by the processing speed of the client and the amount of GPU memory it has. Handling at least a couple 100MB blocks at a time should be fine for deck.gl I think the optimal chunk size is more driven by network download time. If you had a Zarr store on fast network attached storage, having 100MB block sizes would be fine; for general internet access you'd probably want smaller block sizes.

Block size also matters for how many blocks you want to display at once. Is your preference for example to tile the entire screen or show a smaller area over a larger time horizon? For example the MUR SST dataset on AWS comes in block sizes:

  1. time: 6443, lat: 100, lon: 100
  2. time: 5, lat: 1799, lon: 3600

You could envision preferring no. 2 at low zooms where you care more about seeing the entire globe and no. 1 at higher zooms where you display a single block at a time, but care more about the animation over time.

@ibgreen
Copy link
Collaborator

ibgreen commented Apr 28, 2021

@point9repeating Your PoC looks very promising, and Zarr support in loaders.gl + deck.gl makes a lot of sense.

Are you willing to share the code so we can start digging in to the details of how this could be done in a general way?

Perhaps @kylebarron and myself could help you set up a quick proxy service to get around the CORS issue?

@zflamig
Copy link

zflamig commented Apr 28, 2021

Hi @point9repeating the CORS on the HRRR Zarr bucket have been adjusted so please try it now.

4/29 update: MUR CORS now also support this use case

@point9repeating
Copy link

@zflamig I just saw this. Thank you so much!

FYI, it looks like mur-sst isn't set up for static website hosting: http://mur-sst.s3-website-us-west-2.amazonaws.com/zarr-v1/.zmetadata

@point9repeating
Copy link

And, it turns out HRRR zarr is stored as half-float arrays [<f2], which isn't compatible with zarr.js because there isn't a native TypedArray in javascript that maps to half-floats (we have Float32Array and Float64Array).

I did a quick attempt at updating zarr.js using this javascript implementation for a Float16Array, but zarr.js is written in typescript and it wasn't trivial to add Float16Array (different base type).

@zflamig
Copy link

zflamig commented May 10, 2021

Do you need the website endpoint for this @point9repeating ? You should be able to just use https://mur-sst.s3.amazonaws.com/zarr-v1/.zmetadata and have it work the same I would think.

@point9repeating
Copy link

That endpoint works great, @zflamig

I didn't realize you could enable CORS without enabling the static website hosting.

Wow. MUR is big. It looks like pulling the full global domain for a single time step will mean requesting 100 chunks that are ~40MB each.

@rabernat
Copy link
Author

It looks like pulling the full global domain for a single time step will mean requesting 100 chunks that are ~40MB each.

This is one reason why we would really like to explicitly support image pyramids in Zarr. (@joshmoore / @manzt and company do have some microscopy datasets that use image pyramids, but afaik there is no standard / convention.)

@manzt
Copy link
Collaborator

manzt commented May 10, 2021

Microscopy community has started to unify around standard / convention:

(think @joshmoore will be talking about this at Dask Summit?)

Some sample datasets from the Image Data Resource can be found here, all implementing the Zarr multiscales extension. Visualized in the browser using a combination of Zarr.js & deck.gl.

@point9repeating
Copy link

@manzt wow. this is so rad

@kylebarron
Copy link
Collaborator

In order to work with existing Zarr stores with a large block size, you could also take a more server-side approach where you write something like a rio-tiler adapter for Zarr, and then connect to a dynamic tiling server like Titiler. But there are clearly some drawbacks to that approach, and it isn't as scalable as directly fetching data from blocks on S3.

@rabernat
Copy link
Author

In order to work with existing Zarr stores with a large block size, you could also take a more server-side approach

Big 👍 to this idea. Dynamic rechunking is definitely needed in the Zarr ecosystem. Simple server-side rechunking should be possible with xpublish.

For testing / demonstration, it would also be easy to create a static Zarr dataset that is optimally chunked for visualization (rather than analysis).

@kylebarron
Copy link
Collaborator

kylebarron commented May 11, 2021

Dynamic rechunking is definitely needed in the Zarr ecosystem. Simple server-side rechunking should be possible with xpublish.

This is straying a bit from loaders.gl, but I wanted to add a couple notes here.

I think https://github.com/developmentseed/titiler is becoming a pretty popular project for serving geospatial raster assets on the fly, and think it could work well with Zarr too. The easiest way to set that up would be to make a new rio-tiler reader, like the COGReader class. Happy to discuss this more, maybe on an issue there?

You can imagine two Zarr adapters in titiler: one to read Zarr collections just like it reads GDAL datasets and another to expose an API with a "virtual" Zarr collection that's rechunked on demand. Then a ZarrLoader in loaders.gl could connect to that rechunked collection through the server.

About the rendering, seems like most Zarr geospatial datasets are in a global WGS84 projection? Deck.gl supports rendering lat/lon and Web Mercator tiles natively, but for data in any other projection, tiles would need to be reprojected at some stage in the process. Note that the TileLayer doesn't currently support non-Web Mercator indexing. I'd love to give advice to someone if they're interested in making a PR for the TileLayer to support arbitrary indexing systems (and also see visgl/deck.gl#5504)

@joshmoore
Copy link

(think @joshmoore will be talking about this at Dask Summit?)

"Talking" is a bit much. I'll be annoying people with "multiscales" during the Life Sciences workshop but happy to discuss elsewhen, too. Short-short pitch, as @rabernat and @manzt know, I'd very much like more libraries to adopt the same strategy of defining multiscale/multiresolution images. It just makes life so much simpler. (Even just in microscopy we had N different formats which is where NGFF started unifying)

@manzt
Copy link
Collaborator

manzt commented May 12, 2021

I absolutely love the idea of dynamic re-chunking, and it's something I've been experimenting with myself. It's also easy to imagine swapping compression on the server, e.g. from lossless to lossy.

It just makes life so much simpler.

+1 to this. Thinking of Zarr as an API rather than file format, a static Zarr dataset in cloud storage is indistinguishable on the client from one created dynamically on the server. The multiscale extension for Zarr essentially describes the endpoints a tile server would be responsible for implementing, that the Zarr client will ask for. Changing chunk-size, compression, etc, can all be expressed by changing the array metadata. If something like titiler adopted the multiscales extension, that would be very exciting.

IMO the ZarrLoader should be completely agnostic to the backend if possible. "Improving" a Zarr dataset for visualization can all be performed on the server and communicated in Zarr metadata. By default, it would be nice if the loader recognized the multiscales extension, but I could see an option where explicit urls for separate ZarrArrays (in the pyramid) is an option.

e.g. "https://my-multicales-dataset.zarr" vs ["https://my-multicales-dataset.zarr/0", "https://my-multicales-dataset.zarr/1", "https://my-multicales-dataset.zarr/2"]

@kylebarron
Copy link
Collaborator

swapping compression on the server, e.g. from lossless to lossy.

Aside: could be interesting to test out LERC with some Zarr data. Seems a good candidate for compressing data to bring to the browser where you have some defined precision. Only see one mention of Zarr + LERC though.

If something like titiler adopted the multiscales extension, that would be very exciting.

Titiler doesn't currently expose a Zarr API, but a Zarr extension is something we could discuss on that repo.

IMO the ZarrLoader should be completely agnostic to the backend if possible

Agreed. Doesn't seem like a difficult requirement; don't see why a ZarrLoader would even need to know if the Zarr dataset is static or dynamic.

@manzt
Copy link
Collaborator

manzt commented May 12, 2021

Agreed. Doesn't seem like a difficult requirement; don't see why a ZarrLoader would even need to know if the Zarr dataset is static or dynamic.

Totally agree. I've just noticed that xpublish and other tools may introduces additional REST API endpoints, beyond chunk/metadata keys, and I'd like to avoid relying on any custom endpoints in a loader implementation.

@rabernat
Copy link
Author

Xpublish has extra endpoints for convenience, but clients don't have to use them. To the client, the data are indistinguishable from static zarr files served over http.

@kylebarron
Copy link
Collaborator

Quick mention here that we're discussing with @manzt creating an initial Zarr loader as part of #1441 from what he already worked on as part of the Viv project.

@kylebarron
Copy link
Collaborator

kylebarron commented Jun 4, 2021

I was also just made aware (thanks @vincentsarago) that a GDAL driver for Zarr is progressing in OSGeo/gdal#3896. We should keep tabs on that to make sure the ZarrLoader here can read that data seamlessly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants