-
-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fill_value serialization issues #2802
base: main
Are you sure you want to change the base?
Fix fill_value serialization issues #2802
Conversation
The bad news is, this issue is going to be slightly more than I've said here. The good news is that the property-based tests caught some edge cases. |
Nice, I've been meaning to add this to Zarr: @st.composite
def v3_array_metadata(draw: st.DrawFn) -> bytes:
from zarr.codecs.bytes import BytesCodec
from zarr.core.chunk_grids import RegularChunkGrid
from zarr.core.chunk_key_encodings import DefaultChunkKeyEncoding
from zarr.core.metadata.v3 import ArrayV3Metadata
# separator = draw(st.sampled_from(['/', '\\']))
shape = draw(array_shapes)
ndim = len(shape)
chunk_shape = draw(npst.array_shapes(min_dims=ndim, max_dims=ndim))
dtype = draw(zrst.v3_dtypes())
fill_value = draw(npst.from_dtype(dtype))
dimension_names = draw(
st.none() | st.lists(st.none() | simple_text, min_size=ndim, max_size=ndim)
)
metadata = ArrayV3Metadata(
shape=shape,
data_type=dtype,
chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape),
fill_value=fill_value,
attributes=draw(simple_attrs),
dimension_names=dimension_names,
chunk_key_encoding=DefaultChunkKeyEncoding(separator="/"), # FIXME
codecs=[BytesCodec()],
storage_transformers=(),
)
return metadata.to_buffer_dict(prototype=default_buffer_prototype())["zarr.json"] What do you think of a |
I love the idea. I was thinking the other day that the obvious path out of the bugs that are currently popping up would be property-based testing, so I was pretty pleased to see that there's already some work in that direction Out of curiosity, do we have something like json schema that we could apply against the outputs to at least verify structure? We'd still need to define all the rules that exist in terms of value/type dependencies etc. but that's an easy win if it exists somewhere |
I'm not aware of a JSON schema definition for the array metadata. If one existed, it would necessarily only support partial validation, because JSON schema can't express certain invariants in the metadata document, like the requirement that dimensional attributes (shape, chunk_shape, etc) be consistent. |
A fairly easy alternative way to handle this would be to simply write a test that takes the I still think a generic metadata strategy is probably useful. |
Yeah, you'd definitely need json-schema and custom validation rules that encode relationships among different fields |
Did some refactoring and organization of property-based testing code and added round trip testing for |
Hye @moradology I added an zarr-python/src/zarr/testing/strategies.py Line 110 in e8bfb64
|
Definitely - do you mind my pulling the generators up into their own submodule as done here, given their similar and unique functionality? |
I think testing/strategies.py is right place. Everything in there already handles the v2 vs v3 complexity |
@dcherian Maybe I got out ahead of myself. Do you think the small amount of extra organization I added is undesirable? Basically, I broke My thought was just that continuing to add property-based tests is probably a good idea and the single |
It feels a bit premature to me. In any case, I find it nice to keep such refactoring PRs separate so that we can review (much) smaller diffs. In my experience, keeping the diffs small is the best way to keep open-source contributions easy to merge. How about just keeping them in strategies for now and we can refactor later as needed. |
Sounds like a plan. As a relatively new contributor to this library, I appreciate the firm opinions! |
Thank you for considering the opinions in a constructive manner! |
9cd0ccd
to
251253e
Compare
OK, so to get the property-based tests working I had to make some decisions about what serialization strategies looked like. I tried to follow the instructions available here, but other issues that have come to the surface in the history of |
A bit confused about the code coverage going down. The tested constraints are definitely a bit tighter than they were, if anything |
tests/test_properties.py
Outdated
|
||
|
||
@given(npst.from_dtype(dtype=np.dtype("float64"), allow_nan=True, allow_infinity=True)) | ||
def test_v2meta_nan_and_infinity(fill_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could look more like https://github.com/zarr-developers/zarr-python/pull/2847/files#diff-d318cba7c9e4a6983338cf21df1db66aab796137a2fb4a76ce48c0afa17de2f9
We could abstract out a assert_valid_v2_json_dict
and similarly for v3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me - would you like me to wait for that branch to go in and then add these validations to the broader test you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just copy the test over and you can expand them here. I can handle the merge conflicts. It may very well be that this one goes in first :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I'm curious about something. See this line: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/metadata/v3.py#L413C1-L414C1
Is that type correct? If so, it seems that numpy types should be serialized after to_dict
. As of now, they're not (this PR changes that for v2
but not yet for v3
). So what's the desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't know. ping @d-v-b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, that type is wrong and it bugs me! for v3 metadata to_dict
returns an instance of DataType
for the data_type
key, and I think we registered that type with a custom JSON encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember why this decision was made but we should definitely fix it. fwiw, dict[str, JSON]
is also sub-optimal, given that the keys of the metadata document are (almost) entirely static, and so we could do typeddicts here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so the immediately obvious thing is that the serialization logic I've added to to_dict
needs to be pushed down into to_buffer_dict
and to_dict
should retain its (potentially) not-directly-serializable python types. I assumed serialization should happen in to_dict
because of JSON
in the type sig but also because there's some serialization happening in that function already: https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/metadata/v2.py#L199-L204. I will address that, too
From there, a typed-dict can be implemented for to_dict
output.
404213a
to
935ac71
Compare
807470f
to
6301b15
Compare
* main: don't serialize empty tuples for v2 filters, and warn when reading such metadata (zarr-developers#2847)
as promised, I fixed the conflict :) |
The current serialization of
fill_value
inArrayV2Metadata
does not fully conform to the spec, particularly for:Changes
allow_nan
toFalse
onjson.dumps
Resolves: #2741
TODO:
docs/user-guide/*.rst
changes/