Manifest arrays use arrayv3metadata #429

abarciauskas-bgse · 2025-02-06T03:30:02Z

This is still very much a WIP - many tests and implementations still need to be fixed.

A few notes:

@sharkinsspatial suggested we remove ZArray completely as a part of this work, as opposed to just converting ZArray's to ArrayV3Metadata. Digging into some of the test examples and reader code, it seems like there is complexity that will require changes to the ZArray implementation anyways - for example how filters and compressors are handled. For example, dmrpp calls all the compressionType's extracted from dmrpp filters, but I think they may be compressors. I think we need some logic to distinguish filters from compressors, and I'm not sure yet what type of object the filters need to be for ArrayV3Metadata to be happy.
~~writers/kerchunk.py - variable_to_kerchunk_arr_refs requires a ManifestArray, but we will need to convert metadata to valid kerchunk v2 metadata~~
I don't think the netcdf3 reader will work with ArrayV3Metadata because the data type is big endian and these types may not be supported by ArrayV3Metadata (this appears in v3 migration documentation of zarr-python as well)
it was suggested not to use zarr's _parse_chunk_encoding_v3 function since it is a private function and may change, which is why some of that logic is replicated in convert_to_codec_pipeline

Questions

do we prefer zarr codec objects ({"name": "zlib", "configuration": {"level": 1}}) or numcodecs ({"id": "zlib", "level": 1})? I think we have to convert to zarr codec objects to use zarr's get_codec_class anyways.
- will this be obsolete once zarr codecs are registered in numcodecs?

Checklist

TomNicholas · 2025-02-06T03:53:34Z

virtualizarr/manifests/array.py

+        """
+        Individual chunk size by number of elements.
+        """
+        if isinstance(self._metadata.chunk_grid, RegularChunkGrid):


I think this should be checked at construction time, because the concatenation methods assume a regular chunk grid too.

TomNicholas · 2025-02-06T03:54:36Z

virtualizarr/manifests/array.py

+        """
+        Array shape by number of elements along each dimension.
+        """
+        return tuple(int(length) for length in list(self.metadata.shape))


Can you not just return self.metadata.shape?

TomNicholas · 2025-02-06T03:56:27Z

virtualizarr/manifests/array_api.py

-    new_zarray = first_arr.zarray.replace(
-        shape=tuple(new_shape),
-    )
+    metadata_copy = first_arr.metadata.to_dict().copy()


This seems overly verbose, and something we will want in multiple places. Let's make a replace constructor function, before adding a method upstream.

…not happy about this)

abarciauskas-bgse · 2025-02-07T01:12:14Z

virtualizarr/writers/kerchunk.py

@@ -87,8 +87,7 @@ def variable_to_kerchunk_arr_refs(var: Variable, var_name: str) -> KerchunkArrRe
            for chunk_key, entry in marr.manifest.dict().items()
        }

-        zarray = marr.zarray.replace(zarr_format=2)
-
+        array_v2_metadata = convert_v3_to_v2_metadata(marr.metadata)


I don't think we want to go with this approach, perhaps we should allow for manifest arrays to have ArrayV2Metadata for writing to kerchunk.

don't think we want to go with this approach

Why not?

perhaps we should allow for manifest arrays to have ArrayV2Metadata for writing to kerchunk.

I really don't see how that would make anything easier. Inside variable_to_kerchunk_arr_refs is the only place you should need to deal with v2 metadata, so why try to generalize it in the rest of the package if you know you won't need it anywhere else?

that makes sense to me, I think I was just concerned about bugs trying to go between ArrayV3Metadata and ArrayV2Metadata. But having a single interface for metadata in memory is probably safer over all.

Oh, I recalled, I am not sure yet about the best way to go from ArrayV3Metadata codecs to V2 compressors and filters. I don't see anything yet in zarr-python for how to do that. I welcome any suggestions.

Looking closer at https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/array.py#L4290-L4334 it looks like we can parse the codecs in the ArrayV3Metadata:

filters, serializer, codecs = array_v3_metadata.codecs

Although serializer has no significance in Zarr v2, as far as I can tell.

abarciauskas-bgse · 2025-02-07T20:18:52Z

virtualizarr/zarr.py

+from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec
+
+
+def identify_codec(codec):


@d-v-b we have to extract compressors and filters from ArrayV3Metadata for use in writing V2 metadata to kerchunk and also for use in the icechunk writer where we call zarr-python's group.require_array - is this method the best way to do so?

yes, that function reflects the mapping between words and class names we use in zarr-python. But this conceptualization is specific to zarr v3: in zarr v2, any function could be a compressor, and any function could be a filter. It wasn't strictly typed at all. So there is no guarantee that something which was a filter in zarr v2 is a valid filter in zarr v3.

abarciauskas-bgse added 10 commits February 4, 2025 14:38

Added zarray_to_v3metadata and test

2a01bfa

Working on manifest array tests

17fd547

Fix test_manifests/test_array#TestConcat tests

e5666ab

Passing TestStack tests and add fixture

5a8cc4c

All test_manifests/test_array tests passing

4c0b616

Compressors should be list

ac2f787

Passing dmrpp tests

5503c60

Merge branch 'main' into manifest-arrays-use-arrayv3metadata

1272051

Passing test_hdf.py tests

1f36755

Start to work on kerchunk tests

7098803

abarciauskas-bgse had a problem deploying to test-release February 6, 2025 03:30 — with GitHub Actions Failure

abarciauskas-bgse mentioned this pull request Feb 6, 2025

ManifestArray should use zarr-python's ArrayV3Metadata #424

Open

TomNicholas reviewed Feb 6, 2025

View reviewed changes

TomNicholas added zarr-python Relevant to zarr-python upstream internals labels Feb 6, 2025

Add method to convert array v3 metadata to v2 metadata for kerchunk (…

ce2284c

…not happy about this)

abarciauskas-bgse had a problem deploying to test-release February 7, 2025 01:11 — with GitHub Actions Failure

abarciauskas-bgse commented Feb 7, 2025

View reviewed changes

Fix fixtures and mark xfail netcdf3

c9853d5

abarciauskas-bgse had a problem deploying to test-release February 7, 2025 01:20 — with GitHub Actions Failure

Test for convert_v3_to_v2_metadata

209dae3

abarciauskas-bgse had a problem deploying to test-release February 7, 2025 15:19 — with GitHub Actions Failure

Deduplicate fixture for array v3 metadata

e7205ef

abarciauskas-bgse had a problem deploying to test-release February 7, 2025 15:22 — with GitHub Actions Failure

Parse filters and compressors from v3 metdata for v2 metadata

d65e457

abarciauskas-bgse had a problem deploying to test-release February 7, 2025 15:45 — with GitHub Actions Failure

Rewrite extract_codecs

190c20f

abarciauskas-bgse had a problem deploying to test-release February 7, 2025 20:16 — with GitHub Actions Failure

abarciauskas-bgse commented Feb 7, 2025

View reviewed changes

Refactor convert_to_codec_pipeline

47f5ddd

abarciauskas-bgse had a problem deploying to test-release February 8, 2025 00:14 — with GitHub Actions Failure

Fix hdf integration tests

5d15608

abarciauskas-bgse had a problem deploying to test-release February 8, 2025 17:39 — with GitHub Actions Failure

Test for convert_to_codec_pipeline

908bc52

abarciauskas-bgse had a problem deploying to test-release February 8, 2025 18:13 — with GitHub Actions Failure

Refactor get_codecs and its tests

4a8bfdd

abarciauskas-bgse had a problem deploying to test-release February 8, 2025 18:56 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manifest arrays use arrayv3metadata #429

Manifest arrays use arrayv3metadata #429

abarciauskas-bgse commented Feb 6, 2025 •

edited

Loading

TomNicholas Feb 6, 2025

TomNicholas Feb 6, 2025

TomNicholas Feb 6, 2025

abarciauskas-bgse Feb 7, 2025

TomNicholas Feb 7, 2025

abarciauskas-bgse Feb 7, 2025

abarciauskas-bgse Feb 7, 2025 •

edited

Loading

abarciauskas-bgse Feb 7, 2025 •

edited

Loading

abarciauskas-bgse Feb 7, 2025

d-v-b Feb 8, 2025

		from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec


		def identify_codec(codec):

Manifest arrays use arrayv3metadata #429

Are you sure you want to change the base?

Manifest arrays use arrayv3metadata #429

Conversation

abarciauskas-bgse commented Feb 6, 2025 • edited Loading

Questions

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abarciauskas-bgse Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

abarciauskas-bgse Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abarciauskas-bgse commented Feb 6, 2025 •

edited

Loading

abarciauskas-bgse Feb 7, 2025 •

edited

Loading

abarciauskas-bgse Feb 7, 2025 •

edited

Loading