Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getitems: support meta_array #1131

Merged
merged 33 commits into from
Apr 13, 2023
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c268c63
Use _chunk_getitems() always
madsbk Sep 12, 2022
c7023f9
Implement getitems() always
madsbk Sep 12, 2022
89fa599
FSStore.getitems(): accept meta_array and on_error
madsbk Sep 12, 2022
f05ee3a
getitems(): handle on_error="omit"
madsbk Sep 12, 2022
0eed377
Removed the `on_error argument`
madsbk Sep 13, 2022
e578051
remove redundant check
madsbk Sep 13, 2022
03c97a8
getitems(): use Sequence instead of Iterable
madsbk Sep 13, 2022
8ba463c
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Sep 15, 2022
dfa731b
Merge branch 'main' into getitems
madsbk Sep 22, 2022
8e753d6
Merge branch 'main' into getitems
jakirkham Sep 23, 2022
f05edb1
Merge branch 'main' into getitems
jakirkham Sep 28, 2022
1d2b6ea
Typo
madsbk Oct 7, 2022
f87cb60
Merge branch 'main' into getitems
jakirkham Oct 7, 2022
eea7466
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 11, 2022
05be1d4
Introduce a contexts argument
madsbk Oct 12, 2022
af54f7e
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 12, 2022
5513d6f
CountingDict: impl. getitems()
madsbk Oct 12, 2022
ad46ccc
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 12, 2022
40ecba0
added test_getitems()
madsbk Oct 12, 2022
0f224d4
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Oct 24, 2022
61d7a03
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Nov 4, 2022
81549f5
Introduce Context
madsbk Nov 4, 2022
2bfe68a
doc
madsbk Nov 4, 2022
4e9c39e
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Dec 6, 2022
a250f64
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Mar 13, 2023
02fc80d
support the new get_partial_values() method
madsbk Mar 13, 2023
61981f2
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Mar 13, 2023
d0afcde
Resolve conflict with get_partial_values()
madsbk Mar 13, 2023
7e01831
Merge branch 'main' of github.com:zarr-developers/zarr-python into ge…
madsbk Apr 11, 2023
d9838ef
make contexts keyword-only
madsbk Apr 11, 2023
c3ee95f
Introduce ConstantMap
madsbk Apr 11, 2023
ec5f396
use typing.Mapping
madsbk Apr 11, 2023
a1d3520
test_constant_map
madsbk Apr 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion zarr/_storage/store.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@
import os
from collections.abc import MutableMapping
from string import ascii_letters, digits
from typing import Any, List, Mapping, Optional, Union
from typing import Any, Sequence, List, Mapping, Optional, Union

from zarr.meta import Metadata2, Metadata3
from zarr.util import normalize_storage_path
from zarr.context import Context

# v2 store keys
array_meta_key = '.zarray'
Expand Down Expand Up @@ -129,6 +130,34 @@ def _ensure_store(store: Any):
f"wrap it in Zarr.storage.KVStore. Got {store}"
)

def getitems(
self, keys: Sequence[str], contexts: Mapping[str, Context] = {}
) -> Mapping[str, Any]:
"""Retrieve data from multiple keys.

Parameters
----------
keys : Iterable[str]
The keys to retrieve
contexts: Mapping[str, Context]
A mapping of keys to their context. Each context is a mapping of store
specific information. E.g. a context could be a dict telling the store
the preferred output array type: `{"meta_array": cupy.empty(())}`
Comment on lines +144 to +147
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the context needs to be key specific. If a user wants to load data into the GPU, surely they want to do that for the entire read, not just some chunks. It seems like we could simplify this a lot by just providing one single context to getitems, rather than a mapping.

There is also likely a performance cost to doing it this way, no? If you're reading 1000s of chunks at a time, it seems silly to be carrying around all these big mappings to the same context.

Copy link
Contributor Author

@madsbk madsbk Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this is to support the generic context approach. Although, it is possible to limit the performance impact by using a custom Mapping that returns the same value for all keys.

Also notice, the current code implements the context approach where the meta_array argument is embedded within the context argument. We could rollback to my original approach where getitems() takes the meta_array as an argument directly:

    def getitems(
        self, keys: Sequence[str], meta_array: NDArrayLike
    ) -> Mapping[str, Any]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not try to cover generic context behavior for hypothetical use cases right now. We should strive to make the API and implementation as simple as possible to achieve the specific goal.

At the same time, users of this API should recognize that it is experimental and may change in the future. I fully support developing the context concept more generically...I just don't think this PR should be blocked on that.


Returns
-------
Mapping
A collection mapping the input keys to their results.

Notes
-----
This default implementation uses __getitem__() to read each key sequentially and
ignores contexts. Overwrite this method to implement concurrent reads of multiple
keys and/or to utilize the contexts.
"""

return {k: self[k] for k in keys if k in self}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Know fsspec-based storage layers are using a dict return value, but am wondering if we should be doing something different (like returning an iterable). Asking since this would make the read blocking vs. a bit more lazy. The latter can be useful when working with operations that take a bit longer (like reading from the cloud or parallelizing several reads)

cc @martindurant (who also may have thoughts)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of it being blocking, is that many keys may be being fetched concurrently. If you make it an iterator, you lose that, unless you have something that can wait on an async iterator, which in turn has first-completed working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both of you.
The API specifies a return type of Mapping[str, Any] thus it is possible to return a lazy mapping that only reads self[k] when accessed. But let's do that in a follow up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, and will be interested in how that looks.

By the way, I mentioned elsewhere the possibility of passing key-specific metadata to the get function; that would happen in this same signature. I wonder if you have any use for an array where only some of it is destined for the GPU (perhaps because that data already exists there and doesn't need loading at all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what they would look like, @martindurant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, think this discussion gets hairy enough it should be broken out into an issue (likely two) and discussed separately.

If we want to go the contexts route, we don't need a meta_array argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also imply that getitem() should have a contexts parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also imply that getitem() should have a contexts parameter?

Good point, yes getitem() should also take a contexts parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a second, there is no getitem() and _chunk_getitem() anymore :)



class Store(BaseStore):
"""Abstract store class used by implementations following the Zarr v2 spec.
Expand Down
19 changes: 19 additions & 0 deletions zarr/context.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@

from typing import TypedDict

from numcodecs.compat import NDArrayLike


class Context(TypedDict, total=False):
""" A context for component specific information

All keys are optional. Any component reading the context must provide
a default implementation in the case a key cannot be found.

Items
-----
meta_array : array-like, optional
An array-like instance to use for determining the preferred output
array type.
"""
meta_array: NDArrayLike
84 changes: 24 additions & 60 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1257,20 +1257,13 @@ def _get_selection(self, indexer, out=None, fields=None):
else:
check_array_shape('out', out, out_shape)

# iterate over chunks
if not hasattr(self.chunk_store, "getitems") or \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for removing hasattr hacks in general. 👍

any(map(lambda x: x == 0, self.shape)):
# sequentially get one key at a time from storage
for chunk_coords, chunk_selection, out_selection in indexer:

# load chunk selection into output array
self._chunk_getitem(chunk_coords, chunk_selection, out, out_selection,
drop_axes=indexer.drop_axes, fields=fields)
else:
# allow storage to get multiple items at once
if math.prod(out_shape) > 0:
# get chunks
lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)
self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=indexer.drop_axes, fields=fields)
self._chunk_getitems(
lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=indexer.drop_axes, fields=fields
)

if out.shape:
return out
Expand Down Expand Up @@ -1930,76 +1923,44 @@ def _process_chunk(
# store selected data in output
out[out_selection] = tmp

def _chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this so private that no one could have made use of it in a subclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AFAICT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's my understanding as well. Or at least by prefixing with _ we have warned users this is an implementation detail (subject to change) that they shouldn't rely on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham: this pre-dates me, so I defer. But if it's not documented somewhere we might want to review if that holds across the board. For me, _x is typically valid for subclassing by developers, otherwise it would be a __x. (i.e. public, protected, private in Java-parlance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a Python thing generally (not specific to Zarr).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single underscore if by far the most common thing to do, to show intent rather than enforce any privateness (which double underscore doesn't do either).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_axes=None, fields=None):
"""Obtain part or whole of a chunk.
def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=None, fields=None):
"""Obtain part or whole of chunks.

Parameters
----------
chunk_coords : tuple of ints
Indices of the chunk.
chunk_selection : selection
Location of region within the chunk to extract.
chunk_coords : list of tuple of ints
Indices of the chunks.
chunk_selection : list of selections
Location of region within the chunks to extract.
out : ndarray
Array to store result in.
out_selection : selection
Location of region within output array to store results in.
out_selection : list of selections
Location of regions within output array to store results in.
drop_axes : tuple of ints
Axes to squeeze out of the chunk.
fields
TODO

"""
out_is_ndarray = True
try:
out = ensure_ndarray_like(out)
except TypeError:
out_is_ndarray = False

assert len(chunk_coords) == len(self._cdata_shape)

# obtain key for chunk
ckey = self._chunk_key(chunk_coords)

try:
# obtain compressed data for chunk
cdata = self.chunk_store[ckey]

except KeyError:
# chunk not initialized
if self._fill_value is not None:
if fields:
fill_value = self._fill_value[fields]
else:
fill_value = self._fill_value
out[out_selection] = fill_value

else:
self._process_chunk(out, cdata, chunk_selection, drop_axes,
out_is_ndarray, fields, out_selection)

def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
drop_axes=None, fields=None):
"""As _chunk_getitem, but for lists of chunks

This gets called where the storage supports ``getitems``, so that
it can decide how to fetch the keys, allowing concurrency.
"""
out_is_ndarray = True
try:
out = ensure_ndarray_like(out)
except TypeError: # pragma: no cover
out_is_ndarray = False

# Keys to retrieve
ckeys = [self._chunk_key(ch) for ch in lchunk_coords]

partial_read_decode = False
# Check if we can do a partial read
if (
self._partial_decompress
and self._compressor
and self._compressor.codec_id == "blosc"
and hasattr(self._compressor, "decode_partial")
and not fields
and self.dtype != object
and hasattr(self.chunk_store, "getitems")
):
partial_read_decode = True
cdatas = {
Expand All @@ -2008,8 +1969,11 @@ def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
if ckey in self.chunk_store
}
else:
partial_read_decode = False
cdatas = self.chunk_store.getitems(ckeys, on_error="omit")
contexts = {}
if not isinstance(self._meta_array, np.ndarray):
contexts = {k: {"meta_array": self._meta_array} for k in ckeys}
cdatas = self.chunk_store.getitems(ckeys, contexts=contexts)

for ckey, chunk_select, out_select in zip(ckeys, lchunk_selection, lout_selection):
if ckey in cdatas:
self._process_chunk(
Expand Down
8 changes: 6 additions & 2 deletions zarr/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
from os import scandir
from pickle import PicklingError
from threading import Lock, RLock
from typing import Optional, Union, List, Tuple, Dict, Any
from typing import Sequence, Mapping, Optional, Union, List, Tuple, Dict, Any
import uuid
import time

Expand All @@ -42,6 +42,7 @@
ensure_contiguous_ndarray_like
)
from numcodecs.registry import codec_registry
from zarr.context import Context

from zarr.errors import (
MetadataError,
Expand Down Expand Up @@ -1361,7 +1362,10 @@ def _normalize_key(self, key):

return key.lower() if self.normalize_keys else key

def getitems(self, keys, **kwargs):
def getitems(
self, keys: Sequence[str], contexts: Mapping[str, Context] = {}
) -> Mapping[str, Any]:

keys_transformed = [self._normalize_key(key) for key in keys]
results = self.map.getitems(keys_transformed, on_error="omit")
# The function calling this method may not recognize the transformed keys
Expand Down
28 changes: 28 additions & 0 deletions zarr/tests/test_storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import tempfile
from contextlib import contextmanager
from pickle import PicklingError
from typing import Any, Mapping, Sequence
from zipfile import ZipFile

import numpy as np
Expand All @@ -20,6 +21,7 @@
import zarr
from zarr._storage.store import _get_hierarchy_metadata
from zarr.codecs import BZ2, AsType, Blosc, Zlib
from zarr.context import Context
from zarr.convenience import consolidate_metadata
from zarr.errors import ContainsArrayError, ContainsGroupError, MetadataError
from zarr.hierarchy import group
Expand Down Expand Up @@ -2572,3 +2574,29 @@ def test_meta_prefix_6853():

fixtures = group(store=DirectoryStore(str(fixture)))
assert list(fixtures.arrays())


def test_getitems_contexts():

class MyStore(CountingDict):
def __init__(self):
super().__init__()
self.last_contexts = None

def getitems(
self, keys: Sequence[str], contexts: Mapping[str, Context] = {}
) -> Mapping[str, Any]:
self.last_contexts = contexts
return {k: self.wrapped[k] for k in keys if k in self.wrapped}

store = MyStore()
z = zarr.create(shape=(10,), store=store)

# By default, not contexts are given to the store's getitems()
z[0]
assert len(store.last_contexts) == 0

# Setting a non-default meta_array, will create contexts for the store's getitems()
z._meta_array = "my_meta_array"
z[0]
assert store.last_contexts == {'0': {'meta_array': 'my_meta_array'}}
9 changes: 9 additions & 0 deletions zarr/tests/util.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import collections
import os
import tempfile
from typing import Any, Mapping, Sequence
from zarr.context import Context

from zarr.storage import Store
from zarr._storage.v3 import StoreV3
Expand Down Expand Up @@ -42,6 +44,13 @@ def __delitem__(self, key):
self.counter['__delitem__', key] += 1
del self.wrapped[key]

def getitems(
self, keys: Sequence[str], contexts: Mapping[str, Context] = {}
) -> Mapping[str, Any]:
for key in keys:
self.counter['__getitem__', key] += 1
return {k: self.wrapped[k] for k in keys if k in self.wrapped}


class CountingDictV3(CountingDict, StoreV3):
pass
Expand Down