Combining across chunks of a dask array with zarr backend #6864

hardingnj · 2022-08-02T10:26:21Z

hardingnj
Aug 2, 2022

Often the chunk sizes in zarr (recommended 1Mb uncompressed: https://zarr.readthedocs.io/en/stable/tutorial.html) are smaller than recommended for efficient computation with dask (100-1000Mb) https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes.

In dask da.from_zarr chunksize="auto" chooses appropriate chunk sizes for dask by combining multiple zarr chunks.

However this isn't the default behaviour of the "auto" in xarray.open_zarr which uses the default zarr chunksize.

In order to rechunk sensibly in xarray, it's necessary to calculate new chunk sizes. But it is possible to do so by specifying the number of chunks to merge over? I have the following helper function, but this seems a common enough use case to be a standard feature?

"""
Function to calculate new chunk sizes if you want to merge chunk size for xarray.
"""


def merge_chunks(chunks, merge_chunks):
    
    """
    chunks: dict of Frozen(), specifying chunk sizes. 
    merge_chunks: dict dims to ints specifying how many chunks to merge in that dim
    
    """
    return {key: tuple(
        [sum(chunks[key][idx:idx+n]) for idx in range(0, len(chunks[key]), n)]) 
            for key, n in merge_chunks.items()}

eg

dd = xr.concat(
    [xr.open_zarr(zstore, group=seqid, chunks='auto') for seqid in sequences], dim='start')

new_chunks = merge_chunks(dd.chunks, {"pos": 4})

dd = dd.chunk(new_chunks)

dcherian · 2022-08-10T23:11:38Z

dcherian
Aug 10, 2022
Maintainer

This seems like a bug given the documentation of open_dataset:

chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays. chunks='auto' will use dask auto chunking taking into account the engine preferred chunks

Can you open an issue with a reproducible example please?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining across chunks of a dask array with zarr backend #6864

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Combining across chunks of a dask array with zarr backend #6864

hardingnj Aug 2, 2022

Replies: 1 comment

dcherian Aug 10, 2022 Maintainer

hardingnj
Aug 2, 2022

dcherian
Aug 10, 2022
Maintainer