Fix fill_value serialization issues #2802

moradology · 2025-02-05T19:44:20Z

The current serialization of fill_value in ArrayV2Metadata does not fully conform to the spec, particularly for:

NaN and Infinity values, which must be serialized as strings ("NaN", "Infinity", "-Infinity").
Complex numbers (np.complex64, np.complex128), which must be stored as two-element arrays [real, imag] according to the above Nan/Inf rules.

Changes

Updated _sanitize_fill_value() to enforce correct JSON serialization.
Fixed test_v2meta_fill_value_serialization() to compare expected and actual JSON using a normalized representation.
Introduced property-based testing with Hypothesis to generate valid input cases and verify compliance.
Enforce compliance (to some degree) by setting allow_nan to False on json.dumps

Resolves: #2741

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

moradology · 2025-02-05T19:53:16Z

The bad news is, this issue is going to be slightly more than I've said here. The good news is that the property-based tests caught some edge cases.

dcherian · 2025-02-05T20:50:54Z

Nice, I've been meaning to add this to Zarr:

@st.composite
def v3_array_metadata(draw: st.DrawFn) -> bytes:
    from zarr.codecs.bytes import BytesCodec
    from zarr.core.chunk_grids import RegularChunkGrid
    from zarr.core.chunk_key_encodings import DefaultChunkKeyEncoding
    from zarr.core.metadata.v3 import ArrayV3Metadata

    # separator = draw(st.sampled_from(['/', '\\']))
    shape = draw(array_shapes)
    ndim = len(shape)
    chunk_shape = draw(npst.array_shapes(min_dims=ndim, max_dims=ndim))
    dtype = draw(zrst.v3_dtypes())
    fill_value = draw(npst.from_dtype(dtype))
    dimension_names = draw(
        st.none() | st.lists(st.none() | simple_text, min_size=ndim, max_size=ndim)
    )

    metadata = ArrayV3Metadata(
        shape=shape,
        data_type=dtype,
        chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape),
        fill_value=fill_value,
        attributes=draw(simple_attrs),
        dimension_names=dimension_names,
        chunk_key_encoding=DefaultChunkKeyEncoding(separator="/"),  # FIXME
        codecs=[BytesCodec()],
        storage_transformers=(),
    )

    return metadata.to_buffer_dict(prototype=default_buffer_prototype())["zarr.json"]

What do you think of a array_metadata_json(zarr_formats...) strategy that just returns the JSON and we can test whether that satisfies the spec for V2 and V3?

moradology · 2025-02-05T21:05:08Z

I love the idea. I was thinking the other day that the obvious path out of the bugs that are currently popping up would be property-based testing, so I was pretty pleased to see that there's already some work in that direction

Out of curiosity, do we have something like json schema that we could apply against the outputs to at least verify structure? We'd still need to define all the rules that exist in terms of value/type dependencies etc. but that's an easy win if it exists somewhere

dcherian · 2025-02-05T21:09:34Z

Out of curiosity, do we have something like json schema that we could apply against the outputs to at least verify structure?

Don't know. ping @jhamman @d-v-b

d-v-b · 2025-02-06T11:13:11Z

I'm not aware of a JSON schema definition for the array metadata. If one existed, it would necessarily only support partial validation, because JSON schema can't express certain invariants in the metadata document, like the requirement that dimensional attributes (shape, chunk_shape, etc) be consistent.

dcherian · 2025-02-06T15:04:34Z

A fairly easy alternative way to handle this would be to simply write a test that takes the arrays strategy, extracts the metadata, converts to JSON, and then asserts that the JSON meets spec (as best as we can).

I still think a generic metadata strategy is probably useful.

moradology · 2025-02-06T15:22:46Z

Yeah, you'd definitely need json-schema and custom validation rules that encode relationships among different fields

moradology · 2025-02-17T23:31:46Z

Did some refactoring and organization of property-based testing code and added round trip testing for ArrayV2Metadata (which hopefully captures some important expectations about these things. Still need to fully satisfy the linter and make sure similar guarantees are tested for ArrayV3Metadata

dcherian · 2025-02-17T23:41:21Z

Hye @moradology I added an array_metadata strategy in

zarr-python/src/zarr/testing/strategies.py

Line 110 in e8bfb64

def array_metadata(

would you mind merging your changes in with that strategy please?

moradology · 2025-02-18T00:46:31Z

Definitely - do you mind my pulling the generators up into their own submodule as done here, given their similar and unique functionality?

dcherian · 2025-02-18T01:29:56Z

I think testing/strategies.py is right place. Everything in there already handles the v2 vs v3 complexity

moradology · 2025-02-18T16:51:48Z

@dcherian Maybe I got out ahead of myself. Do you think the small amount of extra organization I added is undesirable? Basically, I broke strategies.py out into strategies/array_metadata_generators.py, strategies/array_generators.py, strategies/dtype_generators.py while using strategies/__init__.py to keep imports the same.

My thought was just that continuing to add property-based tests is probably a good idea and the single strategies.py file will likely get a bit long

dcherian · 2025-02-18T16:54:19Z

It feels a bit premature to me. In any case, I find it nice to keep such refactoring PRs separate so that we can review (much) smaller diffs. In my experience, keeping the diffs small is the best way to keep open-source contributions easy to merge.

How about just keeping them in strategies for now and we can refactor later as needed.

moradology · 2025-02-18T16:56:46Z

Sounds like a plan. As a relatively new contributor to this library, I appreciate the firm opinions!

dcherian · 2025-02-18T16:58:17Z

Thank you for considering the opinions in a constructive manner!

moradology · 2025-02-18T19:40:52Z

OK, so to get the property-based tests working I had to make some decisions about what serialization strategies looked like. I tried to follow the instructions available here, but other issues that have come to the surface in the history of zarr-python suggest there are likely blindspots in this document. This part, in particular, deserves special attention

moradology · 2025-02-18T19:51:17Z

A bit confused about the code coverage going down. The tested constraints are definitely a bit tighter than they were, if anything

tests/test_properties.py

dcherian · 2025-02-18T23:06:03Z

tests/test_properties.py

+
+
+@given(npst.from_dtype(dtype=np.dtype("float64"), allow_nan=True, allow_infinity=True))
+def test_v2meta_nan_and_infinity(fill_value):


This could look more like https://github.com/zarr-developers/zarr-python/pull/2847/files#diff-d318cba7c9e4a6983338cf21df1db66aab796137a2fb4a76ce48c0afa17de2f9

We could abstract out a assert_valid_v2_json_dict and similarly for v3.

Makes sense to me - would you like me to wait for that branch to go in and then add these validations to the broader test you have in mind?

Let's just copy the test over and you can expand them here. I can handle the merge conflicts. It may very well be that this one goes in first :)

OK, so I'm curious about something. See this line: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/metadata/v3.py#L413C1-L414C1
Is that type correct? If so, it seems that numpy types should be serialized after to_dict. As of now, they're not (this PR changes that for v2 but not yet for v3). So what's the desired behavior?

don't know. ping @d-v-b

no, that type is wrong and it bugs me! for v3 metadata to_dict returns an instance of DataType for the data_type key, and I think we registered that type with a custom JSON encoder.

I don't remember why this decision was made but we should definitely fix it. fwiw, dict[str, JSON] is also sub-optimal, given that the keys of the metadata document are (almost) entirely static, and so we could do typeddicts here

OK, so the immediately obvious thing is that the serialization logic I've added to to_dict needs to be pushed down into to_buffer_dict and to_dict should retain its (potentially) not-directly-serializable python types. I assumed serialization should happen in to_dict because of JSON in the type sig but also because there's some serialization happening in that function already: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/metadata/v2.py#L199-L204. I will address that, too

From there, a typed-dict can be implemented for to_dict output.

* main: don't serialize empty tuples for v2 filters, and warn when reading such metadata (zarr-developers#2847)

dcherian · 2025-02-20T16:28:53Z

as promised, I fixed the conflict :)

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 5, 2025

moradology force-pushed the fix/v2meta_fill_serialization branch 2 times, most recently from 9cd0ccd to 251253e Compare February 18, 2025 19:37

moradology marked this pull request as ready for review February 18, 2025 19:37

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2025

dcherian reviewed Feb 18, 2025

View reviewed changes

tests/test_properties.py Outdated Show resolved Hide resolved

dcherian reviewed Feb 18, 2025

View reviewed changes

moradology force-pushed the fix/v2meta_fill_serialization branch from 404213a to 935ac71 Compare February 19, 2025 15:26

moradology added 2 commits February 19, 2025 11:41

Fix fill_value serialization of NaN

fd43cbf

Round trip serialization for array metadata v2/v3

6301b15

moradology force-pushed the fix/v2meta_fill_serialization branch from 807470f to 6301b15 Compare February 19, 2025 17:41

moradology changed the title ~~Fix fill_value serialization of NaN; add property-based tests~~ Fix fill_value serialization of NaN Feb 19, 2025

moradology changed the title ~~Fix fill_value serialization of NaN~~ Fix fill_value serialization issues Feb 19, 2025

Merge branch 'main' into fix/v2meta_fill_serialization

260cfbc

* main: don't serialize empty tuples for v2 filters, and warn when reading such metadata (zarr-developers#2847)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fill_value serialization issues #2802

Fix fill_value serialization issues #2802

moradology commented Feb 5, 2025 •

edited

Loading

moradology commented Feb 5, 2025

dcherian commented Feb 5, 2025

moradology commented Feb 5, 2025 •

edited

Loading

dcherian commented Feb 5, 2025

d-v-b commented Feb 6, 2025

dcherian commented Feb 6, 2025

moradology commented Feb 6, 2025

moradology commented Feb 17, 2025

dcherian commented Feb 17, 2025

moradology commented Feb 18, 2025 •

edited

Loading

dcherian commented Feb 18, 2025

moradology commented Feb 18, 2025 •

edited

Loading

dcherian commented Feb 18, 2025 •

edited

Loading

moradology commented Feb 18, 2025

dcherian commented Feb 18, 2025

moradology commented Feb 18, 2025

moradology commented Feb 18, 2025

dcherian Feb 18, 2025

moradology Feb 19, 2025 •

edited

Loading

dcherian Feb 19, 2025

moradology Feb 19, 2025

moradology Feb 19, 2025 •

edited

Loading

dcherian Feb 19, 2025

d-v-b Feb 20, 2025

d-v-b Feb 20, 2025

moradology Feb 20, 2025 •

edited

Loading

dcherian commented Feb 20, 2025



		@given(npst.from_dtype(dtype=np.dtype("float64"), allow_nan=True, allow_infinity=True))
		def test_v2meta_nan_and_infinity(fill_value):

Fix fill_value serialization issues #2802

Are you sure you want to change the base?

Fix fill_value serialization issues #2802

Conversation

moradology commented Feb 5, 2025 • edited Loading

Changes

moradology commented Feb 5, 2025

dcherian commented Feb 5, 2025

moradology commented Feb 5, 2025 • edited Loading

dcherian commented Feb 5, 2025

d-v-b commented Feb 6, 2025

dcherian commented Feb 6, 2025

moradology commented Feb 6, 2025

moradology commented Feb 17, 2025

dcherian commented Feb 17, 2025

moradology commented Feb 18, 2025 • edited Loading

dcherian commented Feb 18, 2025

moradology commented Feb 18, 2025 • edited Loading

dcherian commented Feb 18, 2025 • edited Loading

moradology commented Feb 18, 2025

dcherian commented Feb 18, 2025

moradology commented Feb 18, 2025

moradology commented Feb 18, 2025

dcherian Feb 18, 2025

Choose a reason for hiding this comment

moradology Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

dcherian Feb 19, 2025

Choose a reason for hiding this comment

moradology Feb 19, 2025

Choose a reason for hiding this comment

moradology Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

dcherian Feb 19, 2025

Choose a reason for hiding this comment

d-v-b Feb 20, 2025

Choose a reason for hiding this comment

d-v-b Feb 20, 2025

Choose a reason for hiding this comment

moradology Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

dcherian commented Feb 20, 2025

moradology commented Feb 5, 2025 •

edited

Loading

moradology commented Feb 5, 2025 •

edited

Loading

moradology commented Feb 18, 2025 •

edited

Loading

moradology commented Feb 18, 2025 •

edited

Loading

dcherian commented Feb 18, 2025 •

edited

Loading

moradology Feb 19, 2025 •

edited

Loading

moradology Feb 19, 2025 •

edited

Loading

moradology Feb 20, 2025 •

edited

Loading