Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest/Functional Store #427

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ayushnag
Copy link
Contributor

@ayushnag ayushnag commented Feb 5, 2025

Also discussed in depth here: #238 (comment)

Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 40.00000% with 39 lines in your changes missing coverage. Please review.

Project coverage is 85.31%. Comparing base (1782cee) to head (a5d9314).

Files with missing lines Patch % Lines
virtualizarr/manifests/manifest.py 40.00% 39 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #427      +/-   ##
==========================================
- Coverage   88.98%   85.31%   -3.67%     
==========================================
  Files          28       28              
  Lines        1706     1770      +64     
==========================================
- Hits         1518     1510       -8     
- Misses        188      260      +72     
Files with missing lines Coverage Δ
virtualizarr/manifests/manifest.py 75.19% <40.00%> (-12.96%) ⬇️

... and 5 files with indirect coverage changes

@TomNicholas
Copy link
Member

Sweet! Thanks @ayushnag.

Note this requires #424 because of #424 (comment).

@TomNicholas TomNicholas added the enhancement New feature or request label Feb 5, 2025
@@ -54,6 +65,110 @@ def with_validation(
return ChunkEntry(path=path, offset=offset, length=length)


class ManifestStore(Store):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class should be defined in it's own file, and potentially even made public (developer) API.

def __init__(self, fs, vds, read_only=True):
super().__init__(read_only=read_only)
self.fs = fs
self.vds = vds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if some of the variables in the dataset are not virtual? Do we have any need to support that case?

self.vds = None

def __str__(self) -> str:
return f"manifest://{id(self.vds)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to build a custom repr using the xarray.Dataset repr here. Low priority suggestion though.

supports_listing: bool = True

fs: AsyncFileSystem
vds: xr.Dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this could also be a DataTree, or maybe a dict of Datasets. But we might never need that.

Comment on lines +110 to +111
# TODO: is this the best way?
url, offset, length = self.vds[array_name].data.manifest.dict()[chunk_key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reason to improve the API of ChunkManifest to make it more dict-like (separately from this PR). Although I guess we already have the entire numpy array containing all references already in memory.

Comment on lines +142 to +145
async def exists(self, key: str) -> bool:
array_name, _, chunk_key = key.split("/")
url, _, _ = self.vds[array_name].data.manifest.dict()[chunk_key]
return await self.fs._exists(url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should instead look at the ._paths array in the manifest and return True if the path string is non-empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we need to check the path for that specific chunk. The _paths array cannot retrieve the specific path for that key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that there is a convention in the ChunkManifest class that a zarr store with missing chunks has those chunks represented using an empty path string in the manifest. (The numpy array implementation prevented me from just omitting those keys entirely.)

It's the readers' job to find out if a chunk doesn't actually exist, so by the time we get here we should already know which chunks exist in storage and which don't without having to reach out to the storage to find out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading data from ManifestArrays without saving references to disk first
2 participants