-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement group.members
#1726
implement group.members
#1726
Conversation
Seems like
Implementing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on @d-v-b. Left you some open-ended questions but this seems like its off to a good start.
src/zarr/v3/group.py
Outdated
# and scoped to specific zarr versions | ||
subkeys_filtered = filter(lambda v: v not in ("zarr.json", ".zgroup", ".zattrs"), subkeys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# and scoped to specific zarr versions | |
subkeys_filtered = filter(lambda v: v not in ("zarr.json", ".zgroup", ".zattrs"), subkeys) | |
# and scoped to specific zarr versions | |
subkeys_filtered = filter(lambda v: v not in ("zarr.json", ".zgroup", ".zarray"), subkeys) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will .zarray
be an immediate subkey of a group? By contrast, we do expect to find .zattrs
under the prefix of any v2 group with attributes.
src/zarr/v3/group.py
Outdated
) | ||
|
||
raise ValueError(msg) | ||
subkeys = await self.store_path.store.list_dir(self.store_path.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following #1686 (comment), this method should also return an async generator. Just calling that out as a change that will come here soon, what you've done is fine for now.
What I would love to see is something like:
async for subkey in self.store_path.store.list_regex(self.store_path.path, pattern=pattern):
yield await self.getitem(subkey)
src/zarr/v3/group.py
Outdated
except KeyError: | ||
# keyerror is raised when `subkey``names an object in the store | ||
# in which case `subkey` cannot be the name of a sub-array or sub-group. | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something here but if subkey
was just found in the store, why would we be getting a KeyError
? This pass
seems like it could silently ignore something that we care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "foo"
is the name of a random non-zarr-related file in the store, group.getitem("foo")
raises KeyError
, because "foo"
is not the name of a sub-array or sub-group. So in this case, a KeyError
just means there's a random file in there, that we want to ignore. I'm not sure that there are sources of KeyError
(or any other exception) that we would want to handle here?
I think this is answered (in the negative) by the need to support asynchronous iteration over keys, and the fact that cloud backends (e.g., s3) will reasonably paginate requests for lists of keys. |
…(name, AsyncArray / AsyncGroup) pairs; Group.members repackages these into a dict.
group.children
group.children
--- group.members
group.children
--- group.members
group.members
the name of this method was discussed in an offline discussion in the biweekly refactoring meeting, and the conclusion was that we will switch the name from |
src/zarr/v3/sync.py
Outdated
async_iterator = await coroutine | ||
return [item async for item in async_iterator] | ||
# async_iterator = await coroutine | ||
return [item async for item in coroutine()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DahnJ I had to revert some of your changes from #1692 here in order to get the code to run. The type signatures are currently incorrect, but the code works, so we should either adjust the annotations to match the actual types, or adjust the implementation (and the actual types) to match the annotation.
@jhamman, I know you have some ideas about the direction for this
a design question: in the latest changes to this PR, |
…nto group_children_fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this in! It unlocks a bunch of work on the hierarchy. I left a few comments around future optimizations / directions but this is a good first step.
async def group_keys(self) -> AsyncGenerator[str, None]: | ||
async for key, value in self.members(): | ||
if isinstance(value, AsyncGroup): | ||
yield key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A future optimization here would be to avoid constructing the AsyncGroup
and instead just generate the key based on filtering the metadata doc. As it is currently written, we construct all the members as AsyncGroup | AsyncArray
, only to return their name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, we could allow Group.members
to filter to only='groups'
so that we never need construct arrays in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i think checking the metadata should be much cheaper than instantiating AsyncArray
/ AsyncGroup
. I will have to think about what those changes would mean for the AsyncGroup.members
API, since that's what group_keys
relies on.
# self._async_group.children() | ||
# ) | ||
# return [Array(obj) if isinstance(obj, AsyncArray) else Group(obj) for obj in _children] | ||
def members(self) -> tuple[tuple[str, Array | Group], ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue to think about the return type here. I could see a (frozen) dict being a more ergonomic return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also a fan of something immutable and dict-like here. I think we can use #1787 to continue brainstorming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xarray has a FrozenDict
type which we could borrow: https://github.com/pydata/xarray/blob/b0036749542145794244dee4c4869f3750ff2dee/xarray/core/utils.py#L415-L443
Implements
Group.children
for v3.Tests are passing with the
LocalStore
. I still need to mock up a remote store for testingRemoteStore
, which will be easy to do with the pre-existing v2 tests fixtures (when this is done I will remove thein progress
label).I would appreciate a critical eye on how I implemented
children
, because I don't have a lot of experience with async iteration in python and maybe I did something that could be easily improved on.A few questions, that don't necessarily need answers in this PR:
children
? For a variety of reasons I prefermembers
, and that's what I settled on over in the zom zep.children
? I think natural sort order on the keys of the subarrays / subgroups is good. This requires that we get all the sub-array / subgroup keys at once, but I think that's how the storage APIs work anyway.TODO:
closes #1753