-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scene.__getitem__ not "greedy" enough #2331
Comments
Here's the current test I've made to work on this: def test_dataset_dict_contains_inexact_match():
"""Test that DatasetDict does not match inexact keys to existing keys.
Specifically, check that additional DataID properties aren't ignored
when querying the DatasetDict.
See https://github.com/pytroll/satpy/issues/2331.
"""
from satpy.dataset.data_dict import DatasetDict
dd = DatasetDict()
name = "1"
item = xr.DataArray(())
dd[name] = item
exact_id = make_dataid(name=name)
assert exact_id in dd
inexact_id = make_dataid(name=name, resolution=2000)
assert inexact_id not in dd Let me know if this doesn't match the initial failing case in your example code. |
@BENR0 If we dive deeper into Lines 584 to 592 in 86079ed
where |
Another way I see of looking at the above linked method is that Edit: Another interesting find, the ID keys used when assigning to the Scene with a single string ("1" in your example) only has the "name" and the "resolution" keys:
Now why would the Edit 2: Oh those are the minimal default ID keys set in Edit 3: Here's another small example of the difficulties of making this work:
|
@djhoese Thanks for looking into this. I already thought that this was not an easy solve. I will have to have a closer look at this to comment on your points since I am not that familiar with that part of the code. |
No worries. My comments are mostly for my own record keeping. |
@BENR0 @mraspaud Ok I found a test that fails when I make the case in this PR pass. The test is exactly the case we have in this issue and are trying to fix: dataid_container = [DataID(minimal_default_keys_config,
name="natural_color")]
dq = DataQuery(name="natural_color", resolution=250)
assert len(dq.filter_dataids(dataid_container)) == 1 So in this case, why does this data query which specifies |
Ok this issue is going to just be my dumping ground for things as I learn them. So the current (existing) test that I'm playing with is: scene = Scene(filenames=["fake1_1.txt"], reader="fake1")
comp25 = make_cid(name="comp25", resolution=1000)
scene[comp25] = xr.DataArray([], attrs={"name": "comp25", "resolution": 1000})
scene.load(["comp25"], resolution=500)
loaded_ids = list(scene._datasets.keys())
assert len(loaded_ids) == 2
assert loaded_ids[0]["name"] == "comp25"
assert loaded_ids[0]["resolution"] == 500
assert loaded_ids[1]["name"] == "comp25"
assert loaded_ids[1]["resolution"] == 1000 Which with the below changes: Subject: [PATCH] Search all DataID keys for query matching
===================================================================
diff --git a/satpy/dataset/dataid.py b/satpy/dataset/dataid.py
--- a/satpy/dataset/dataid.py (revision 452c1f6fccdaa513458b19890343c2a334b30557)
+++ b/satpy/dataset/dataid.py (date 1733326086455)
@@ -583,10 +583,11 @@
def _match_dataid(self, dataid):
"""Match the dataid with the current query."""
- if self._shares_required_keys(dataid):
- keys_to_check = set(dataid.keys()) & set(self._fields)
- else:
- keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
+ keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
+ # if self._shares_required_keys(dataid):
+ # keys_to_check = set(dataid.keys()) & set(self._fields)
+ # else:
+ # keys_to_check = set(dataid._id_keys.keys()) & set(self._fields)
if not keys_to_check:
return False
return all(self._match_query_value(key, dataid.get(key)) for key in keys_to_check)
Fails on the |
Man, this is really turning into a Satpy 1.0 type of thing. So the biggest competing functionalities currently:
Possible (partial) solutions rolling around in my brain:
Edit: Turns out this idea of not creating the compositor instance until it is needed/used is actually how modifiers work. The dictionary of modifiers loaded from YAML is a tuple of |
Ugh this stuff is all terrible. So I started working on the solution where compositors/composites would now have all of the query information in their identifiers. So if you asked for Edit: Actually, hold up, I was wrong about why my solution wasn't working I think. The main issue is that Edit 2: @mraspaud I'm not sure how I feel about DataQuery equality only checking shared keys. At least, it should be stricter for DataIDs shouldn't it? Lines 515 to 532 in 86079ed
|
I think for this point #2333 is somehow connected also. While this is a different use case it also is related to how/when `DataID's are generated. I probably don't have enough overview over this whole thing especially on how the loading mechanisms and |
For the Assuming:
I think In the case
In the case
What should And in the case
What should |
Also I have been thinking about being able to get multiple datasets with a list like it is possible in xarray which is not directly related to the problem at hand but might be worth thinking about while working on a solution for this? |
@BENR0 you mentioned this in #2333 and while I agree this would be nice, I don't think it can happen given the tight coupling of @mraspaud What do you think about an xarray accessor for
@BENR0 Be careful of your use of The more I think about this the more I think I'm going to have to change the things that "smell", see what breaks, and then fix things as I go. The alternative is to define all the use cases that we know we want to support (possibly a lot) and re-write tests for them and make them work and then go back and fix or remove tests that fail. I think I can do the former one piece at a time for the most part. The latter seems cleaner, but also much harder to get right given all the historic code here. @mraspaud I also think I need to start adding type annotations since there are a lot of places in the dependency tree where it is just "yeah, it's a thing that identifies what we want" but isn't clearly a DataID or DataQuery or dict. |
That works for me, I’ve been wanting to have an accessor for a while now anyway…
How about a middle ground: while removing smells, see what tests they apply to and refactor them into clear use cases?
That makes sense if it helps clarify the code. |
So the accessor turns out to be impossible due to DataID And yeah I like that middle ground. I think I was initially concerned about it because the first smell I tried to remove revealed a lot of code smells. I'll try to remove some of the minor ones first and see what kind of test failures I get. |
I think the main thing I'm finding that I disagree with in the current implementation is that there is a general idea of a DataQuery and DataID being equal if their shared keys are equal. But that's not what a user means when they use a DataQuery and is not very useful otherwise. If I specify something in my DataQuery then any matching DataIDs better have that key and the value should be equal (the exception being the special "*"). |
Describe the bug
Currently it is possible to get a dataset identified only by name with a more concise
DataID
with name and for example resolution. This is also relevant forDataID
in Scene testing and dataset deletion (see code examples below).ATurns out that the dataset with the less specific__setitem__
operation with a more specificDataID
does not overwrite the dataset with a less specificDataID
which is goodDataID
is overwritten (at least with satpy main as of 04.12.2024) but theDataID
added to the dependency_tree is wrong.To Reproduce
Expected behavior
See code example.
Environment Info:
Additional context
As far as I can see this behaviour is due to the
get_key
method inDatasetDict
(satpy/satpy/dataset/data_dict.py
Line 142 in cfe4fa9
The text was updated successfully, but these errors were encountered: