-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest/Functional Store #427
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #427 +/- ##
==========================================
- Coverage 88.98% 85.31% -3.67%
==========================================
Files 28 28
Lines 1706 1770 +64
==========================================
- Hits 1518 1510 -8
- Misses 188 260 +72
|
Sweet! Thanks @ayushnag. Note this requires #424 because of #424 (comment). |
@@ -54,6 +65,110 @@ def with_validation( | |||
return ChunkEntry(path=path, offset=offset, length=length) | |||
|
|||
|
|||
class ManifestStore(Store): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class should be defined in it's own file, and potentially even made public (developer) API.
def __init__(self, fs, vds, read_only=True): | ||
super().__init__(read_only=read_only) | ||
self.fs = fs | ||
self.vds = vds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if some of the variables in the dataset are not virtual? Do we have any need to support that case?
self.vds = None | ||
|
||
def __str__(self) -> str: | ||
return f"manifest://{id(self.vds)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to build a custom repr using the xarray.Dataset
repr here. Low priority suggestion though.
supports_listing: bool = True | ||
|
||
fs: AsyncFileSystem | ||
vds: xr.Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this could also be a DataTree
, or maybe a dict of Datasets
. But we might never need that.
# TODO: is this the best way? | ||
url, offset, length = self.vds[array_name].data.manifest.dict()[chunk_key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reason to improve the API of ChunkManifest
to make it more dict-like (separately from this PR). Although I guess we already have the entire numpy array containing all references already in memory.
async def exists(self, key: str) -> bool: | ||
array_name, _, chunk_key = key.split("/") | ||
url, _, _ = self.vds[array_name].data.manifest.dict()[chunk_key] | ||
return await self.fs._exists(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should instead look at the ._paths
array in the manifest and return True
if the path string is non-empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think we need to check the path for that specific chunk. The _paths
array cannot retrieve the specific path for that key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that there is a convention in the ChunkManifest
class that a zarr store with missing chunks has those chunks represented using an empty path string in the manifest. (The numpy array implementation prevented me from just omitting those keys entirely.)
It's the readers' job to find out if a chunk doesn't actually exist, so by the time we get here we should already know which chunks exist in storage and which don't without having to reach out to the storage to find out.
docs/releases.rst
api.rst
Also discussed in depth here: #238 (comment)