Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add to xarray backends docs #461

Closed
dcherian opened this issue Jun 13, 2024 · 24 comments
Closed

add to xarray backends docs #461

dcherian opened this issue Jun 13, 2024 · 24 comments

Comments

@dcherian
Copy link
Contributor

Now that there is a kerchunk engine, it'd be nice to mention it in the Xarray docs:

https://docs.xarray.dev/en/stable/user-guide/io.html

And maybe a super simple example in the tutorial repo: https://tutorial.xarray.dev that then directs people to a more in-depth cookbook?

@martindurant
Copy link
Member

@Anu-Ra-g , would you like to do this?

@Anu-Ra-g
Copy link
Contributor

Yes. This will tutorial will be related to the zarr data format?

@martindurant
Copy link
Member

will tutorial will be related to the zarr data format

It would show how to use engine="kerchunk" for a simple example reference set.

@dcherian
Copy link
Contributor Author

In https://docs.xarray.dev/en/stable/user-guide/io.html I would just add a kerchunk section and link here may be. We just want people to know that it's possible and have a place to go to for more information.

@Anu-Ra-g
Copy link
Contributor

I've made this pull request, pydata/xarray#9146

@dwest77a
Copy link

I've done some performance testing for Kerchunk vs something the NCAS-CMS team have developed called CFA, and I've discovered something interesting about the Xarray-Kerchunk engine compared to the old get_mapper/open_zarr method and CFA. It looks like the data loading is happening at the point of slicing and not computing for the Kerchunk Engine?

image

@martindurant
Copy link
Member

Can I see some minimal code to get this behaviour, please?

@martindurant
Copy link
Member

For reference, the kerchunk backend essentially does

    m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
    return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)

Are you passing chunks= at all?

In general, I am happy to help try and get the best performance out of kerchunk and your implementation too - we have the same overall goal.

@dwest77a
Copy link

dwest77a commented Jul 23, 2024

I'm running individual cells in a Jupyter notebook to obtain the different values for those different sections:

%%time 
import xarray as xr
kfile = 'example_CMIP_file.json'
ds = xr.open_dataset(kfile, engine='kerchunk')

Wall time: 7.81 s

%%time
h1 = ds['huss'].sel(lat=slice(-60,0), lon=slice(80,180)).isel(time=slice(10000,12000)).mean(dim='time')

Wall time: 2.75 s

%%time
h2 = h1.compute()

Wall time: 0.8 ms

%%time
h2.plot()

Wall time: 984 ms

@martindurant
Copy link
Member

Can I have the "example_CMIP_file.json" ?

@dwest77a
Copy link

dwest77a commented Jul 23, 2024

example_CMIP_file.json.zip

Uncompressed this file is 64MB, I can probably make a smaller version if needed.

This file has https:// links now so you should have no issues with accessing the data.

@dwest77a
Copy link

For reference, the kerchunk backend essentially does

    m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
    return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)

Are you passing chunks= at all?

In general, I am happy to help try and get the best performance out of kerchunk and your implementation too - we have the same overall goal.

I will test to see if using open_dataset with the zarr engine gives a different result. I've considered m2 as:

m = fsspec.get_mapper("reference://", fo=filename_or_obj, **storage_options)
ds = xr.open_zarr(mapper, consolidated=False)

@martindurant
Copy link
Member

martindurant commented Jul 23, 2024

Ah, I see these are local files - so not much testing I can do about that.

Two Three recommendations:

  • the height, lat and lon arrays are very small, and should be inlined. The time array is bigger, but always loaded eagerly, so maybe should be inlined too. This will greatly improve the dataset open time.
  • the data chunksize is <100kB, so you would want to set chunks={...} where you increase the size by a decent integer factor along the axis you intend to analyse on, which will decrease the read latency. This is essential for remote (cloud) storage, might not make any difference for local files.
  • the JSON file isn't huge, but reading and decoding the JSON does take some time; so parquet reference storage might be better. This is only a micro-optimisation at this scale.

@dwest77a
Copy link

For reference, the CFA conventions/specification can be found here: https://github.com/NCAS-CMS/cfa-conventions/blob/main/source/cfa.md and my developing implementation is here: https://github.com/dwest77a/CFAPyX

My module is just the CFA reader backend for Xarray which reads CFA-netCDF files like a normal file but with additional 'decoding' of aggregated variables into a Wrapper object that gets passed to dask, so the data is only fetched when the dask array is indexed. This is all just using the netCDF4 library so will only work with local files, or in the future with S3 to files on Object Store.

@dwest77a
Copy link

FYI the variable h1 that I used to store the result of the sel/slice/mean operations includes a data variable that contains the sliced data. I may be wrong in my thinking but should that not happen until I've either run h1.compute() or h1.plot()? I'll see about the other Kerchunk access methods to see what happens with those.

@martindurant
Copy link
Member

Correct, the data should not be touched (except the coordinates) until you ask for them - whether with xarray's in-built lazyness or backed by dask.

@dwest77a
Copy link

dwest77a commented Jul 23, 2024

With the use of get_mapper/open_zarr instead, the object h1.data is a dask.array<mean_agg-aggregate.

image

Using the Kerchunk Engine with the exact same slice/mean operation I get this:

image

Edit: Interestingly the Zarr engine xr.open_dataset(m, engine='zarr'...) shows the same as the Kerchunk engine, so it looks like both are loading the data at the point of performing the mean().

@martindurant
Copy link
Member

As you can see in this notebook (at output 5) from our recent blog, in that case, the object definitely stays as an uncomputed dask array after a sel() call.

@dwest77a
Copy link

dwest77a commented Jul 23, 2024

It looks like at the moment it's either the mean() operation or isel() which triggers the data loading. Strangely, I can show an object after sel/isel that has a dask array representation, but then if I compute the mean() and pass that to a different object, the original object is also affected. See the screenshot below:

image

That second look at h2 where we can see the array does not happen when you use the old method without open_dataset. I think there must be something different in the Xarray backend for both the Kerchunk and Zarr engines where the mean is being applied, and is still tied to other objects, hence why h2 is also affected.

@martindurant
Copy link
Member

@norlandrhagen , @TomNicholas knowing something about xarray engines, have you any idea how to explain what is described above? Why would calling mean() (no compute or access to .values) cause a compute of the underlying dask array?

@dcherian
Copy link
Contributor Author

Why would calling mean() (no compute or access to .values) cause a compute of the underlying dask array?

It would not for a dask array. However, what you have is not a dask array but one of Xarray's internal lazy arrays. These will compute for any operation but indexing. I think you're missing a chunks={} in the open_dataset call.

@martindurant
Copy link
Member

think you're missing a chunks={}

I mentioned chunks= too above; but some of the embedded array objects seem to be dask? Maybe not, and I was mistaken.

@TomNicholas
Copy link

Was this closed by the merging of pydata/xarray#9163?

@martindurant
Copy link
Member

It appears yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants