Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scene.__getitem__ not "greedy" enough #2331

Open
BENR0 opened this issue Dec 20, 2022 · 17 comments
Open

Scene.__getitem__ not "greedy" enough #2331

BENR0 opened this issue Dec 20, 2022 · 17 comments
Labels
Milestone

Comments

@BENR0
Copy link
Collaborator

BENR0 commented Dec 20, 2022

Describe the bug
Currently it is possible to get a dataset identified only by name with a more concise DataID with name and for example resolution. This is also relevant for DataID in Scene testing and dataset deletion (see code examples below).
A __setitem__ operation with a more specific DataID does not overwrite the dataset with a less specific DataID which is good Turns out that the dataset with the less specific DataID is overwritten (at least with satpy main as of 04.12.2024) but the DataID added to the dependency_tree is wrong.

To Reproduce

from satpy.tests.utils import make_dataid
scene = Scene()
scene['1'] = ds1 = xr.DataArray(np.arange(5))
dataid = make_dataid(name='1', resolution=2000) 
print(dataid in scene) # => True should be False

scene[dataid] # => returns ds1 but should return KeyError

del scene[dataid]
print(scene) # => should not be empty


scene['1'] = ds1 = xr.DataArray(np.arange(5))
print(scene._dependency_tree)

scene[dataid] = ds2 = xr.DataArray(np.arange(5, 10))
print(scene._dependency_tree)

Expected behavior
See code example.

Environment Info:

  • OS: Linux
  • Satpy Version: main branch
  • PyResample Version: 1.26.0

Additional context
As far as I can see this behaviour is due to the get_key method in DatasetDict (

def get_key(self, match_key, num_results=1, best=True, **dfilter):
) not being "greedy" enough.

@djhoese
Copy link
Member

djhoese commented Dec 3, 2024

Related: #1807 and the bug report it references #1806

@djhoese
Copy link
Member

djhoese commented Dec 3, 2024

Here's the current test I've made to work on this:

def test_dataset_dict_contains_inexact_match():
    """Test that DatasetDict does not match inexact keys to existing keys.

    Specifically, check that additional DataID properties aren't ignored
    when querying the DatasetDict.

    See https://github.com/pytroll/satpy/issues/2331.
    """
    from satpy.dataset.data_dict import DatasetDict

    dd = DatasetDict()
    name = "1"
    item = xr.DataArray(())
    dd[name] = item
    exact_id = make_dataid(name=name)
    assert exact_id in dd

    inexact_id = make_dataid(name=name, resolution=2000)
    assert inexact_id not in dd

Let me know if this doesn't match the initial failing case in your example code.

@djhoese
Copy link
Member

djhoese commented Dec 3, 2024

@BENR0 If we dive deeper into get_key it hits a point where it takes all the possible DataIDs in the DatasetDict and filters out those that don't match the DataQuery (your DataID with resolution in this case). It hits this point in the filtering:

def _match_dataid(self, dataid):
"""Match the dataid with the current query."""
if self._shares_required_keys(dataid):
keys_to_check = set(dataid.keys()) & set(self._fields)
else:
keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
if not keys_to_check:
return False
return all(self._match_query_value(key, dataid.get(key)) for key in keys_to_check)

where keys_to_check is {"name"} and does NOT include resolution. This is because since the contained DataID is just DataID(name="1") it thinks there is no point in also searching the other properties. Why it does it this way...I'm not sure. I hope to look later tonight or tomorrow.

@djhoese
Copy link
Member

djhoese commented Dec 3, 2024

Another way I see of looking at the above linked method is that keys_to_check and the set operations that create it are trying to say "only check keys that the DataQuery has, don't check those that only the DataID has". This makes sense because we wouldn't want to check if the DataID's calibration matched the DataQuery's calibration if the DataQuery didn't have calibration as a parameter. But what these set operations are also doing by accident (I think) is saying "don't check keys that the DataID doesn't have". Maybe that can help me nail down a solution.

Edit: Another interesting find, the ID keys used when assigning to the Scene with a single string ("1" in your example) only has the "name" and the "resolution" keys:

In [5]: scene = Scene()

In [6]: scene['1'] = ds1 = xr.DataArray(np.arange(5))

In [7]: did = list(scene._datasets.keys())[0]

In [8]: did._id_keys
Out[8]: {'name': {'required': True}, 'resolution': {'transitive': True}}

Now why would the resolution be in that key?

Edit 2: Oh those are the minimal default ID keys set in dataid.py.

Edit 3:

Here's another small example of the difficulties of making this work:

  1. When you do scene["1"] = xr.DataArray(..., attrs={}) the underlying DataID that gets created will have the minimal set of ID keys which are "name" and "resolution".
  2. If you do did = make_dataid(name="1") (which is a testing utility function), the resulting DataID actually includes modifiers=().
  3. So when we get inside the DatasetDict for did in dataset_dict (same as did in scene), we need to say "does the DataID which has no modifiers, have exactly modifiers equal to '()'". The answer is no and that's why some of the fuzziness is going on here about what matches and what doesn't and what keys to look at.

@BENR0
Copy link
Collaborator Author

BENR0 commented Dec 4, 2024

@djhoese Thanks for looking into this. I already thought that this was not an easy solve. I will have to have a closer look at this to comment on your points since I am not that familiar with that part of the code.

@djhoese
Copy link
Member

djhoese commented Dec 4, 2024

No worries. My comments are mostly for my own record keeping.

@djhoese
Copy link
Member

djhoese commented Dec 4, 2024

@BENR0 @mraspaud Ok I found a test that fails when I make the case in this PR pass. The test is exactly the case we have in this issue and are trying to fix:

        dataid_container = [DataID(minimal_default_keys_config,
                                   name="natural_color")]
        dq = DataQuery(name="natural_color", resolution=250)
        assert len(dq.filter_dataids(dataid_container)) == 1

So in this case, why does this data query which specifies resolution=250 result in a DataID that doesn't have resolution specified. There must be some use case here that I'm forgetting.

@djhoese
Copy link
Member

djhoese commented Dec 4, 2024

Ok this issue is going to just be my dumping ground for things as I learn them. So the current (existing) test that I'm playing with is:

        scene = Scene(filenames=["fake1_1.txt"], reader="fake1")
        comp25 = make_cid(name="comp25", resolution=1000)
        scene[comp25] = xr.DataArray([], attrs={"name": "comp25", "resolution": 1000})
        scene.load(["comp25"], resolution=500)

        loaded_ids = list(scene._datasets.keys())
        assert len(loaded_ids) == 2
        assert loaded_ids[0]["name"] == "comp25"
        assert loaded_ids[0]["resolution"] == 500
        assert loaded_ids[1]["name"] == "comp25"
        assert loaded_ids[1]["resolution"] == 1000

Which with the below changes:

Subject: [PATCH] Search all DataID keys for query matching
===================================================================
diff --git a/satpy/dataset/dataid.py b/satpy/dataset/dataid.py
--- a/satpy/dataset/dataid.py	(revision 452c1f6fccdaa513458b19890343c2a334b30557)
+++ b/satpy/dataset/dataid.py	(date 1733326086455)
@@ -583,10 +583,11 @@
 
     def _match_dataid(self, dataid):
         """Match the dataid with the current query."""
-        if self._shares_required_keys(dataid):
-            keys_to_check = set(dataid.keys()) & set(self._fields)
-        else:
-            keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
+        keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
+        # if self._shares_required_keys(dataid):
+        #     keys_to_check = set(dataid.keys()) & set(self._fields)
+        # else:
+        #     keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
         if not keys_to_check:
             return False
         return all(self._match_query_value(key, dataid.get(key)) for key in keys_to_check)

Fails on the scene.load because, from what I can tell, the loading tries to say "of the composites we loaded from the YAML, find DataID(name='comp25', resolution=500)". This request makes sense because we gave the global request for resolution=500, but in this patched version of Satpy it fails because that DataID does not exist in the DatasetDict of composites we know about. We of course know what "comp25" is, but we never try to search for that because we were given resolution=500. So...is it correct for Satpy to assume that composites can and will be made the correct resolution by the proper dependencies being loaded and it is fine to drop all non-required ID-keys/parameters when searching for a composite? I think from a generic DatasetDict sense, the current Satpy implementation is the wrong approach. That is, "is X in dataid_container" and you get a result when it does not exist in that container. I'll see if I can come up with a reasonable solution.

@djhoese
Copy link
Member

djhoese commented Dec 4, 2024

Man, this is really turning into a Satpy 1.0 type of thing. So the biggest competing functionalities currently:

  1. Compositors (the class that creates a composite) are typically only referred to by "name", but have the ability to modify that information in their __init__ and by whatever the user specifies in the YAML. It ends up getting used in the CompositeBase.id property which creates a DataID used by the dependency tree. Note I don't know how often a compositor modifies it's DataID in __init__ in the current builtin composite definitions.
  2. Passing query parameters to Scene.load. For example scene.load(["some_composite"], resolution=500). Should the compositor instance "have" resolution=500? Or should only the DataArray it produces have that and have that based on the inputs?

Possible (partial) solutions rolling around in my brain:

  1. Don't instantiate compositor classes until they are used. For example, wrapping them in some wrapper class so the .id stuff is easier to manipulate before the compositor instance needs it.
  2. Modify how compositors are identified in the dependency tree so the dep tree ID for a compositor might differ from compositor.id. This is already true for prereqs and optional prereqs which have separate instances in the dependency tree nodes.
  3. Allow copying of a Compositor class...no...I don't like this.

Edit: Turns out this idea of not creating the compositor instance until it is needed/used is actually how modifiers work. The dictionary of modifiers loaded from YAML is a tuple of (modifier_class, modifier_options).

@djhoese
Copy link
Member

djhoese commented Dec 5, 2024

Ugh this stuff is all terrible. So I started working on the solution where compositors/composites would now have all of the query information in their identifiers. So if you asked for resolution=500 then the compositor would be identified in the dependency tree by DataID(name="comp", resolution=500). However, this solution only worked because resolution is in the minimal ID keys (which are name and resolution only). When it comes to other properties like calibration=, my solution ignores them because they aren't part of the compositor's satpy ID keys (minimal ID keys - name and resolution). This then fails later because it still tries to load the version of the ID with the calibration=. Wait...can I change that? Let's see...

Edit: Actually, hold up, I was wrong about why my solution wasn't working I think. The main issue is that DataID(name="comp3") == DataQuery(name="comp3", calibration="reflectance").

Edit 2: @mraspaud I'm not sure how I feel about DataQuery equality only checking shared keys. At least, it should be stricter for DataIDs shouldn't it?

def __eq__(self, other):
"""Compare the DataQuerys.
A DataQuery is considered equal to another DataQuery or DataID
if they have common keys that have equal values.
"""
sdict = self._asdict()
try:
odict = other._asdict()
except AttributeError:
return False
common_keys = False
for key, val in sdict.items():
if key in odict:
common_keys = True
if odict[key] != val and val is not None:
return False
return common_keys

@BENR0
Copy link
Collaborator Author

BENR0 commented Dec 5, 2024

1. Compositors (the class that creates a composite) are typically only referred to by "name", but have the ability to modify that information in their `__init__` and by whatever the user specifies in the YAML. It ends up getting used in the `CompositeBase.id` property which creates a `DataID` used by the dependency tree. Note I don't know how often a compositor modifies it's DataID in `__init__` in the current builtin composite definitions.

I think for this point #2333 is somehow connected also. While this is a different use case it also is related to how/when `DataID's are generated.

I probably don't have enough overview over this whole thing especially on how the loading mechanisms and DataID and DependencyTree mechanisms work together but I think compositors or modifiers shouldn't be in charge of generating/changing DataIDs themselves. Instead I think they should put appropriate metadata into the DataArrays they return and the Scene.__set_item__ should handle the generation of the DataID from the attributes. That way the generation would be in a central place.

@BENR0
Copy link
Collaborator Author

BENR0 commented Dec 5, 2024

Ok this issue is going to just be my dumping ground for things as I learn them. So the current (existing) test that I'm playing with is:

        scene = Scene(filenames=["fake1_1.txt"], reader="fake1")
        comp25 = make_cid(name="comp25", resolution=1000)
        scene[comp25] = xr.DataArray([], attrs={"name": "comp25", "resolution": 1000})
        scene.load(["comp25"], resolution=500)

        loaded_ids = list(scene._datasets.keys())
        assert len(loaded_ids) == 2
        assert loaded_ids[0]["name"] == "comp25"
        assert loaded_ids[0]["resolution"] == 500
        assert loaded_ids[1]["name"] == "comp25"
        assert loaded_ids[1]["resolution"] == 1000

For the DataID matching when looking for datasets in the Scene I think it is necessary to think about what different scenarios should do (in #2332 I added a couple of tests for different cases how I think it should be but I think there are still some cases missing).

Assuming:

Scene:
|-DataID(name=1)
+-DataID(name=1, resolution=1000)

I think
Scene[DataID(name=1, resolution=1000)] should clearly get the second dataset

In the case

Scene:
+-DataID(name=1)

Scene[DataID(name=1, resolution=1000)] should give a KeyError

In the case

Scene:
+-DataID(name=1, resolution=1000)

What should Scene[DataID(name=1)] return? Also a KeyError or should the dataset be returned because it is the only one with that name?

And in the case

Scene:
|-DataID(name=1)
+-DataID(name=1, resolution=1000)

What should Scene[DataID(name=1)] return? Both datasets or only the first one with the exact match?

@BENR0
Copy link
Collaborator Author

BENR0 commented Dec 5, 2024

Also I have been thinking about being able to get multiple datasets with a list like it is possible in xarray which is not directly related to the problem at hand but might be worth thinking about while working on a solution for this?

@djhoese
Copy link
Member

djhoese commented Dec 5, 2024

Instead I think they should put appropriate metadata into the DataArrays they return and the Scene.set_item should handle the generation of the DataID from the attributes. That way the generation would be in a central place.

@BENR0 you mentioned this in #2333 and while I agree this would be nice, I don't think it can happen given the tight coupling of .attrs["_satpy_id"] and the rest of the metadata in .attrs. Additionally, we can't depend on the Scene or any __setitem__ stuff to update/use DataIDs as compositor classes can be used outside of the Scene ("manual" composite generation).

@mraspaud What do you think about an xarray accessor for data_arr.satpy.id? Or data_arr.satpy.data_id? Then _satpy_id would only be expected in .attrs during CF NetCDF writing or reading, but only as a temporary measure.

In the case

Scene:
+-DataID(name=1, resolution=1000)

What should Scene[DataID(name=1)] return? Also a KeyError or should the dataset be returned because it is the only one with that name?

@BENR0 Be careful of your use of DataID versus DataQuery. In the above case I'd say this should be a KeyError only because it is a DataID being passed to __getitem__. If it were a DataQuery then it would return the single DataArray that's in there. In the later case you defined with a non-resolution "1" and a resolution=1000 "1", a DataQuery(name="1") would return whichever one follows some defined rules...but I think currently this is not defined. There are rules defined for DataIDs of different resolutions/calibrations/etc and vague queries (ex. always the finer/better resolution), but I'm not sure those explicitly say what would happen in the undefined versus defined case.

The more I think about this the more I think I'm going to have to change the things that "smell", see what breaks, and then fix things as I go. The alternative is to define all the use cases that we know we want to support (possibly a lot) and re-write tests for them and make them work and then go back and fix or remove tests that fail. I think I can do the former one piece at a time for the most part. The latter seems cleaner, but also much harder to get right given all the historic code here.

@mraspaud I also think I need to start adding type annotations since there are a lot of places in the dependency tree where it is just "yeah, it's a thing that identifies what we want" but isn't clearly a DataID or DataQuery or dict.

@mraspaud
Copy link
Member

mraspaud commented Dec 6, 2024

@mraspaud What do you think about an xarray accessor for data_arr.satpy.id? Or data_arr.satpy.data_id? Then _satpy_id would only be expected in .attrs during CF NetCDF writing or reading, but only as a temporary measure.

That works for me, I’ve been wanting to have an accessor for a while now anyway…

The more I think about this the more I think I'm going to have to change the things that "smell", see what breaks, and then fix things as I go. The alternative is to define all the use cases that we know we want to support (possibly a lot) and re-write tests for them and make them work and then go back and fix or remove tests that fail. I think I can do the former one piece at a time for the most part. The latter seems cleaner, but also much harder to get right given all the historic code here.

How about a middle ground: while removing smells, see what tests they apply to and refactor them into clear use cases?

@mraspaud I also think I need to start adding type annotations since there are a lot of places in the dependency tree where it is just "yeah, it's a thing that identifies what we want" but isn't clearly a DataID or DataQuery or dict.

That makes sense if it helps clarify the code.

@djhoese
Copy link
Member

djhoese commented Dec 7, 2024

So the accessor turns out to be impossible due to DataID id_keys needing to stick around. So that information would need to be kept somewhere and can't be kept in the ephemeral accessor object so it would have to be in .attrs.

And yeah I like that middle ground. I think I was initially concerned about it because the first smell I tried to remove revealed a lot of code smells. I'll try to remove some of the minor ones first and see what kind of test failures I get.

@djhoese
Copy link
Member

djhoese commented Dec 9, 2024

I think the main thing I'm finding that I disagree with in the current implementation is that there is a general idea of a DataQuery and DataID being equal if their shared keys are equal. But that's not what a user means when they use a DataQuery and is not very useful otherwise. If I specify something in my DataQuery then any matching DataIDs better have that key and the value should be equal (the exception being the special "*").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants