-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getitems: support meta_array
#1131
Conversation
This pull request introduces 1 alert when merging 26b7d2d into b677db3 - view on LGTM.com new alerts:
|
Codecov Report
@@ Coverage Diff @@
## main #1131 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 36 37 +1
Lines 14799 14855 +56
=========================================
+ Hits 14799 14855 +56
|
Is this the last PR needed by kvik for full Zarr support ? |
Yes with this PR and rapidsai/kvikio#131, all building blocks should be in place to use Zarr backed by CuPy arrays! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mads! This is very helpful 🙏
Asked one question above about API design. Curious to hear yours and others thoughts 🙂
reads of multiple keys and/or to utilize the meta_array argument. | ||
""" | ||
|
||
return {k: self[k] for k in keys if k in self} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Know fsspec-based storage layers are using a dict
return value, but am wondering if we should be doing something different (like returning an iterable). Asking since this would make the read blocking vs. a bit more lazy. The latter can be useful when working with operations that take a bit longer (like reading from the cloud or parallelizing several reads)
cc @martindurant (who also may have thoughts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of it being blocking, is that many keys may be being fetched concurrently. If you make it an iterator, you lose that, unless you have something that can wait on an async iterator, which in turn has first-completed working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with both of you.
The API specifies a return type of Mapping[str, Any]
thus it is possible to return a lazy mapping that only reads self[k]
when accessed. But let's do that in a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, and will be interested in how that looks.
By the way, I mentioned elsewhere the possibility of passing key-specific metadata to the get function; that would happen in this same signature. I wonder if you have any use for an array where only some of it is destined for the GPU (perhaps because that data already exists there and doesn't need loading at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of what they would look like, @martindurant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.
If we want to go the contexts
route, we don't need a meta_array
argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also imply that getitem()
should have a contexts
parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also imply that
getitem()
should have acontexts
parameter?
Good point, yes getitem()
should also take a contexts
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a second, there is no getitem()
and _chunk_getitem()
anymore :)
Just to clarify, this is a quite useful optimization (and much appreciated!). Though is it a requirement (in that an error was hit somewhere)? Is there a bit more context about how this came up (particularly if an issue was encountered)? |
It is required in the sense that a backend doesn’t know the memory destination of a read without this PR. In KvikIO’s backend, we must rely on key names to infer if the data goes to host or device memory. E.g. the following code check for keys that should go to host memory always: if os.path.basename(fn) in [
zarr.storage.array_meta_key,
zarr.storage.group_meta_key,
zarr.storage.attrs_key,
]: This is a problem since a |
Thanks for clarifying. Do we need |
I don't think it is as important as the |
@joshmoore do you have thoughts on this one? 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, @jakirkham. I do get the feeling that after the refactoring that enabled V3, we likely do need some storage-extension-specific documentation notes.
reads of multiple keys and/or to utilize the meta_array argument. | ||
""" | ||
|
||
return {k: self[k] for k in keys if k in self} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of what they would look like, @martindurant?
@@ -1930,76 +1923,44 @@ def _process_chunk( | |||
# store selected data in output | |||
out[out_selection] = tmp | |||
|
|||
def _chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this so private that no one could have made use of it in a subclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, AFAICT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's my understanding as well. Or at least by prefixing with _
we have warned users this is an implementation detail (subject to change) that they shouldn't rely on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakirkham: this pre-dates me, so I defer. But if it's not documented somewhere we might want to review if that holds across the board. For me, _x
is typically valid for subclassing by developers, otherwise it would be a __x
. (i.e. public, protected, private in Java-parlance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a Python thing generally (not specific to Zarr).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single underscore if by far the most common thing to do, to show intent rather than enforce any privateness (which double underscore doesn't do either).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarr/core.py
Outdated
@@ -1257,20 +1257,13 @@ def _get_selection(self, indexer, out=None, fields=None): | |||
else: | |||
check_array_shape('out', out, out_shape) | |||
|
|||
# iterate over chunks | |||
if not hasattr(self.chunk_store, "getitems") or \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ for removing hasattr
hacks in general. 👍
Resolved conflict. |
@joshmoore & @jakirkham - any objections to merging this as-is and deferring the broader discussions for future work? |
Thanks for the ping, @rabernat. I have to say I'm liking the typed |
Notice the link to FITS reading: another example of per-chunk parameters that could be fixed by passing around a context, although it's not the only way to do it. I'd say that there are a number of efforts (some more and some less well developed) waiting to see what zarr will do. |
@joshmoore, how can I help progress here? Maybe an online meeting? |
Sorry, @madsbk. My last comment was shortly before leaving on several weeks of travel and I lost of track of this. It wasn’t my intention to hold you up, and big 👍 for the ping. A few thoughts (from smallest to biggest) though I get the feeling that you are being ping-ponged back and forth between design issues:
|
Agree, fixed.
I have added
I don't think any of us have a clear answer to this. Therefore, I suggest that we merge this PR and let people explore its possibilities. As long as |
Thanks all! |
Thank you for your patience Mads. |
By using the new API in zarr-developers/zarr-python#1131, we do not have to guess whether to read into host or device memory. That is, no more filtering of specify keys like: ```python if os.path.basename(fn) in [ zarr.storage.array_meta_key, zarr.storage.group_meta_key, zarr.storage.attrs_key, ]: ``` Notice, this PR is on hold until Zarr v2.15 is released Closes #119 UPDATE: Zarr v2.15 has been released Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Jordan Jacobelli (https://github.com/jjacobelli) URL: #131
BaseStore
now implementsgetitems()
, which takes ameta_array
argument. Besides simplifying the code, this enable stores to read directly to GPU memory: rapidsai/kvikio#131TODO: