Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest arrays use arrayv3metadata #429

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

abarciauskas-bgse
Copy link
Collaborator

@abarciauskas-bgse abarciauskas-bgse commented Feb 6, 2025

This is still very much a WIP - many tests and implementations still need to be fixed.

A few notes:

  • @sharkinsspatial suggested we remove ZArray completely as a part of this work, as opposed to just converting ZArray's to ArrayV3Metadata. Digging into some of the test examples and reader code, it seems like there is complexity that will require changes to the ZArray implementation anyways - for example how filters and compressors are handled. For example, dmrpp calls all the compressionType's extracted from dmrpp filters, but I think they may be compressors. I think we need some logic to distinguish filters from compressors, and I'm not sure yet what type of object the filters need to be for ArrayV3Metadata to be happy.
  • writers/kerchunk.py - variable_to_kerchunk_arr_refs requires a ManifestArray, but we will need to convert metadata to valid kerchunk v2 metadata
  • I don't think the netcdf3 reader will work with ArrayV3Metadata because the data type is big endian and these types may not be supported by ArrayV3Metadata (this appears in v3 migration documentation of zarr-python as well)
  • it was suggested not to use zarr's _parse_chunk_encoding_v3 function since it is a private function and may change, which is why some of that logic is replicated in convert_to_codec_pipeline

Questions

  • do we prefer zarr codec objects ({"name": "zlib", "configuration": {"level": 1}}) or numcodecs ({"id": "zlib", "level": 1})? I think we have to convert to zarr codec objects to use zarr's get_codec_class anyways.
    • will this be obsolete once zarr codecs are registered in numcodecs?

Checklist

  • Closes ManifestArray should use zarr-python's ArrayV3Metadata #424
  • Manifest tests passing
  • Library (codecs, etc) tests passing
  • Reader tests passing
  • test_integration tests passing
  • test_xarray tests passing
  • Writer tests passing
  • Tests added for new functions
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

"""
Individual chunk size by number of elements.
"""
if isinstance(self._metadata.chunk_grid, RegularChunkGrid):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be checked at construction time, because the concatenation methods assume a regular chunk grid too.

"""
Array shape by number of elements along each dimension.
"""
return tuple(int(length) for length in list(self.metadata.shape))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not just return self.metadata.shape?

new_zarray = first_arr.zarray.replace(
shape=tuple(new_shape),
)
metadata_copy = first_arr.metadata.to_dict().copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly verbose, and something we will want in multiple places. Let's make a replace constructor function, before adding a method upstream.

@TomNicholas TomNicholas added zarr-python Relevant to zarr-python upstream internals labels Feb 6, 2025
@@ -87,8 +87,7 @@ def variable_to_kerchunk_arr_refs(var: Variable, var_name: str) -> KerchunkArrRe
for chunk_key, entry in marr.manifest.dict().items()
}

zarray = marr.zarray.replace(zarr_format=2)

array_v2_metadata = convert_v3_to_v2_metadata(marr.metadata)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to go with this approach, perhaps we should allow for manifest arrays to have ArrayV2Metadata for writing to kerchunk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we want to go with this approach

Why not?

perhaps we should allow for manifest arrays to have ArrayV2Metadata for writing to kerchunk.

I really don't see how that would make anything easier. Inside variable_to_kerchunk_arr_refs is the only place you should need to deal with v2 metadata, so why try to generalize it in the rest of the package if you know you won't need it anywhere else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense to me, I think I was just concerned about bugs trying to go between ArrayV3Metadata and ArrayV2Metadata. But having a single interface for metadata in memory is probably safer over all.

Copy link
Collaborator Author

@abarciauskas-bgse abarciauskas-bgse Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I recalled, I am not sure yet about the best way to go from ArrayV3Metadata codecs to V2 compressors and filters. I don't see anything yet in zarr-python for how to do that. I welcome any suggestions.

Copy link
Collaborator Author

@abarciauskas-bgse abarciauskas-bgse Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer at https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L4290-L4334 it looks like we can parse the codecs in the ArrayV3Metadata:

filters, serializer, codecs = array_v3_metadata.codecs

Although serializer has no significance in Zarr v2, as far as I can tell.

from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec


def identify_codec(codec):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b we have to extract compressors and filters from ArrayV3Metadata for use in writing V2 metadata to kerchunk and also for use in the icechunk writer where we call zarr-python's group.require_array - is this method the best way to do so?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that function reflects the mapping between words and class names we use in zarr-python. But this conceptualization is specific to zarr v3: in zarr v2, any function could be a compressor, and any function could be a filter. It wasn't strictly typed at all. So there is no guarantee that something which was a filter in zarr v2 is a valid filter in zarr v3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals zarr-python Relevant to zarr-python upstream
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

ManifestArray should use zarr-python's ArrayV3Metadata
3 participants