add to xarray backends docs #461

dcherian · 2024-06-13T17:10:55Z

Now that there is a kerchunk engine, it'd be nice to mention it in the Xarray docs:

https://docs.xarray.dev/en/stable/user-guide/io.html

And maybe a super simple example in the tutorial repo: https://tutorial.xarray.dev that then directs people to a more in-depth cookbook?

martindurant · 2024-06-17T14:04:36Z

@Anu-Ra-g , would you like to do this?

Anu-Ra-g · 2024-06-17T14:58:24Z

Yes. This will tutorial will be related to the zarr data format?

martindurant · 2024-06-17T15:01:01Z

will tutorial will be related to the zarr data format

It would show how to use engine="kerchunk" for a simple example reference set.

dcherian · 2024-06-17T15:01:30Z

In https://docs.xarray.dev/en/stable/user-guide/io.html I would just add a kerchunk section and link here may be. We just want people to know that it's possible and have a place to go to for more information.

Anu-Ra-g · 2024-06-20T10:47:05Z

I've made this pull request, pydata/xarray#9146

dwest77a · 2024-07-23T09:53:01Z

I've done some performance testing for Kerchunk vs something the NCAS-CMS team have developed called CFA, and I've discovered something interesting about the Xarray-Kerchunk engine compared to the old get_mapper/open_zarr method and CFA. It looks like the data loading is happening at the point of slicing and not computing for the Kerchunk Engine?

martindurant · 2024-07-23T13:22:54Z

Can I see some minimal code to get this behaviour, please?

martindurant · 2024-07-23T13:26:02Z

For reference, the kerchunk backend essentially does

    m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
    return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)

Are you passing chunks= at all?

In general, I am happy to help try and get the best performance out of kerchunk and your implementation too - we have the same overall goal.

dwest77a · 2024-07-23T13:36:57Z

I'm running individual cells in a Jupyter notebook to obtain the different values for those different sections:

%%time 
import xarray as xr
kfile = 'example_CMIP_file.json'
ds = xr.open_dataset(kfile, engine='kerchunk')

Wall time: 7.81 s

%%time
h1 = ds['huss'].sel(lat=slice(-60,0), lon=slice(80,180)).isel(time=slice(10000,12000)).mean(dim='time')

Wall time: 2.75 s

%%time
h2 = h1.compute()

Wall time: 0.8 ms

%%time
h2.plot()

Wall time: 984 ms

martindurant · 2024-07-23T13:37:56Z

Can I have the "example_CMIP_file.json" ?

dwest77a · 2024-07-23T13:42:30Z

example_CMIP_file.json.zip

Uncompressed this file is 64MB, I can probably make a smaller version if needed.

This file has https:// links now so you should have no issues with accessing the data.

dwest77a · 2024-07-23T13:55:03Z

For reference, the kerchunk backend essentially does
    m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
    return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)
Are you passing chunks= at all?

In general, I am happy to help try and get the best performance out of kerchunk and your implementation too - we have the same overall goal.

I will test to see if using open_dataset with the zarr engine gives a different result. I've considered m2 as:

m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
ds = xr.open_zarr(mapper, consolidated=False)

martindurant · 2024-07-23T13:56:04Z

Ah, I see these are local files - so not much testing I can do about that.

~~Two~~ Three recommendations:

the height, lat and lon arrays are very small, and should be inlined. The time array is bigger, but always loaded eagerly, so maybe should be inlined too. This will greatly improve the dataset open time.
the data chunksize is <100kB, so you would want to set chunks={...} where you increase the size by a decent integer factor along the axis you intend to analyse on, which will decrease the read latency. This is essential for remote (cloud) storage, might not make any difference for local files.
the JSON file isn't huge, but reading and decoding the JSON does take some time; so parquet reference storage might be better. This is only a micro-optimisation at this scale.

dwest77a · 2024-07-23T13:58:18Z

For reference, the CFA conventions/specification can be found here: https://github.com/NCAS-CMS/cfa-conventions/blob/main/source/cfa.md and my developing implementation is here: https://github.com/dwest77a/CFAPyX

My module is just the CFA reader backend for Xarray which reads CFA-netCDF files like a normal file but with additional 'decoding' of aggregated variables into a Wrapper object that gets passed to dask, so the data is only fetched when the dask array is indexed. This is all just using the netCDF4 library so will only work with local files, or in the future with S3 to files on Object Store.

dwest77a · 2024-07-23T14:06:39Z

FYI the variable h1 that I used to store the result of the sel/slice/mean operations includes a data variable that contains the sliced data. I may be wrong in my thinking but should that not happen until I've either run h1.compute() or h1.plot()? I'll see about the other Kerchunk access methods to see what happens with those.

martindurant · 2024-07-23T14:11:54Z

Correct, the data should not be touched (except the coordinates) until you ask for them - whether with xarray's in-built lazyness or backed by dask.

dwest77a · 2024-07-23T14:14:03Z

With the use of get_mapper/open_zarr instead, the object h1.data is a dask.array<mean_agg-aggregate.

Using the Kerchunk Engine with the exact same slice/mean operation I get this:

Edit: Interestingly the Zarr engine xr.open_dataset(m, engine='zarr'...) shows the same as the Kerchunk engine, so it looks like both are loading the data at the point of performing the mean().

martindurant · 2024-07-23T14:21:58Z

As you can see in this notebook (at output 5) from our recent blog, in that case, the object definitely stays as an uncomputed dask array after a sel() call.

dwest77a · 2024-07-23T14:30:03Z

It looks like at the moment it's either the mean() operation or isel() which triggers the data loading. Strangely, I can show an object after sel/isel that has a dask array representation, but then if I compute the mean() and pass that to a different object, the original object is also affected. See the screenshot below:

That second look at h2 where we can see the array does not happen when you use the old method without open_dataset. I think there must be something different in the Xarray backend for both the Kerchunk and Zarr engines where the mean is being applied, and is still tied to other objects, hence why h2 is also affected.

martindurant · 2024-07-23T14:40:02Z

@norlandrhagen , @TomNicholas knowing something about xarray engines, have you any idea how to explain what is described above? Why would calling mean() (no compute or access to .values) cause a compute of the underlying dask array?

dcherian · 2024-07-23T16:05:27Z

Why would calling mean() (no compute or access to .values) cause a compute of the underlying dask array?

It would not for a dask array. However, what you have is not a dask array but one of Xarray's internal lazy arrays. These will compute for any operation but indexing. I think you're missing a chunks={} in the open_dataset call.

martindurant · 2024-07-23T16:54:44Z

think you're missing a chunks={}

I mentioned chunks= too above; but some of the embedded array objects seem to be dask? Maybe not, and I was mistaken.

TomNicholas · 2024-08-07T20:07:13Z

Was this closed by the merging of pydata/xarray#9163?

martindurant · 2024-08-08T02:27:37Z

It appears yes

TomNicholas mentioned this issue Jun 20, 2024

added kerchunk as backend engine to docs pydata/xarray#9146

Closed

3 tasks

Anu-Ra-g mentioned this issue Jun 24, 2024

added kerchunk as backend documentation pydata/xarray#9163

Merged

3 tasks

martindurant added the GSoC-2024 label Aug 7, 2024

martindurant closed this as completed Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add to xarray backends docs #461

add to xarray backends docs #461

dcherian commented Jun 13, 2024

martindurant commented Jun 17, 2024

Anu-Ra-g commented Jun 17, 2024

martindurant commented Jun 17, 2024

dcherian commented Jun 17, 2024

Anu-Ra-g commented Jun 20, 2024

dwest77a commented Jul 23, 2024

martindurant commented Jul 23, 2024

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 •

edited

Loading

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 •

edited

Loading

dwest77a commented Jul 23, 2024

martindurant commented Jul 23, 2024 •

edited

Loading

dwest77a commented Jul 23, 2024

dwest77a commented Jul 23, 2024

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 •

edited

Loading

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 •

edited

Loading

martindurant commented Jul 23, 2024

dcherian commented Jul 23, 2024

martindurant commented Jul 23, 2024

TomNicholas commented Aug 7, 2024

martindurant commented Aug 8, 2024

add to xarray backends docs #461

add to xarray backends docs #461

Comments

dcherian commented Jun 13, 2024

martindurant commented Jun 17, 2024

Anu-Ra-g commented Jun 17, 2024

martindurant commented Jun 17, 2024

dcherian commented Jun 17, 2024

Anu-Ra-g commented Jun 20, 2024

dwest77a commented Jul 23, 2024

martindurant commented Jul 23, 2024

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 • edited Loading

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 • edited Loading

dwest77a commented Jul 23, 2024

martindurant commented Jul 23, 2024 • edited Loading

dwest77a commented Jul 23, 2024

dwest77a commented Jul 23, 2024

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 • edited Loading

martindurant commented Jul 23, 2024

dwest77a commented Jul 23, 2024 • edited Loading

martindurant commented Jul 23, 2024

dcherian commented Jul 23, 2024

martindurant commented Jul 23, 2024

TomNicholas commented Aug 7, 2024

martindurant commented Aug 8, 2024

dwest77a commented Jul 23, 2024 •

edited

Loading

dwest77a commented Jul 23, 2024 •

edited

Loading

martindurant commented Jul 23, 2024 •

edited

Loading

dwest77a commented Jul 23, 2024 •

edited

Loading

dwest77a commented Jul 23, 2024 •

edited

Loading