-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DataID overwriting attributes during Scene assignment and DataID not reflecting DataArray attributes #2332
base: main
Are you sure you want to change the base?
Conversation
Does this mean this is the behavior added by this PR or this is the behavior existing in |
This behavior is added by the PR. |
We should maybe adopt the same logic as what xarray uses (as much as we can) when you assign a DataArray from one Dataset to a new Dataset and with a different name. I'm not sure off the top of my head what happens but I know that Xarray has some more flexibility because it primarily deals with Variable objects and use the DataArrays as a wrapper around them.
The way you have this implemented essentially throws out the What do we think about making a copy of the passed DataArray and then modifying the attributes? Regardless of what is done now, I don't think we want to overwrite what the user passed. That causes a lot of confusion if someone does:
If this worked before...I don't like it. Glad it gets removed.
I think this is good, but as mentioned above I think it should involve some level of copying (either just the attrs or the entire DataArray).
I think this is a good idea as a best practice kind of thing. Especially if this is what xarray does when you assign a DataArray to a Dataset. I think they need to be unique if they are part of the same Dataset object which is really only relevant in the |
satpy/dataset/data_dict.py
Outdated
new_info = key.to_dict() | ||
except AttributeError: | ||
new_info = key | ||
new_info = key.to_dict() | ||
if isinstance(value_info, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When if value_info
not a dict? You also use it in the above DataID.from_dataarray
as if it was a dict without checking if it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think value_info should always be a dict since it's the xr.DataArrays
attributes... unless the DataArray has no attributes. Mmh need to check that. Should probably then be covered by tests too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So either the user didn't provide a DataArray (numpy, dask, etc), in which case there would be no .attrs
attribute (an exception above). Or we require a DataArray as input and we know it has .attrs
and this isinstance
check can be removed. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that seems right. Maybe this check is a leftover from the implementation I had before making sure it always was a DataArray. I will check and clean it up.
satpy/scene.py
Outdated
if isinstance(value, np.ndarray): | ||
value = xr.DataArray(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also check for a dask array. There is also a good chance that many parts of Satpy won't work without the sub-array being a dask array so maybe wrap it with da.from_array
? Should there be a warning that this conversion is happening? Or...should non-DataArrays just not work at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True maybe also checking for dask array is a good idea. If currently it is allowed to assign plain numpy arrays I guess assigning dask arrays is allowed to so they should also be converted to a xr.DataArray
. I also thought about not only converting to a xr.DataArray
if a user tries to assing a numpy array but also converting it to a dask array but can't remember why I didn't. I guess it makes sense. I will change it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot: I think non-DataArrays should not work at all. But I think there is no need for a warning because the conversion is kind of for the "benefit" of the user since many things might not work if the dataset is only a numpy array so the conversion, albeit intransparently, makes sure that only DataArrays are assinged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so you're saying you think users should be able to provide a numpy array, or a dask array, or a DataArray (with numpy or dask underneath), and this will convert whatever is given to a DataArray with dask underneath?
If so, then I'm OK with that. If not, let's discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yepp. I mean this would leave it backwards compatible if somebody is really providing numpy arrays in his scripts. I would also be ok with throwing a warning or exception and therefore really forbidding the assignment of numpy or dask arrays. That would be more transparent I guess. The question is if that is wanted. Basically the question is if we treat being able to provide numpy or dask arrays as a feature or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm leaning towards not allowing non-DataArrays and if the DataArray has a numpy array underneath it should maybe be turned into a dask array. Here's what happens if you assign a numpy array to an xarray Dataset:
In [1]: import xarray as xr
In [2]: import numpy as np
In [3]: ds = xr.Dataset()
In [4]: ds["a"] = np.zeros((10, 10))
---------------------------------------------------------------------------
MissingDimensionsError Traceback (most recent call last)
Input In [4], in <cell line: 1>()
----> 1 ds["a"] = np.zeros((10, 10))
...
MissingDimensionsError: cannot set variable 'a' with 2-dimensional data without explicit dimension names. Pass a tuple of (dims, data) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then I will change that.
satpy/scene.py
Outdated
if value.attrs.get('name', None) is None or value.attrs['name'] != name: | ||
value.attrs['name'] = name | ||
|
||
value.attrs.pop('_satpy_id', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be popped so that DatasetDict.__setitem__
can use the id keys config from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I remember this correctly; At first I tried that but then found out that the DataID.from_dataarray
class method always returned the _satpy_id
directly if present so it wouldn't get updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 But after the self._datasets[key] = value
shouldn't the new _satpy_id
be set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but I call DataID.from_dataarray
in the datasets.__setitem__
method and therefore I need to remove it because otherwise DataID.from_dataarray
will return the old id. This could probably be done different though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but then I think this pop should go inside DatasetDict.__setitem__
, right?
And actually, instead of from_dataarray
you could use data_id.from_dict
if you have the original DataID of the DataArray...but yeah if the provided DataArray didn't already have a DataID you'd need to create one. Actually that might not be that bad if you used the default id config (satpy.dataset.dataid.default_id_keys_config
).
🤔 Ok that might actually be cleaner. If in the DatasetDict.__setitem__
you create a dictionary of all the metadata, then if there was a DataID in that metadata, pop it. If DataID get the .id_keys
else use default_id_keys_config
. Then create a new DataID(id_keys, **attrs)
and use that. Maybe?
I agree. I only remember having a look at xarray how it is done there but I can't remember if and when how much this mimics xarray :-/.
You mean throwing it out of the
Yes that would not be helpful indeed. It's defenitely not covered by unit tests I guess :-D. I am not sure if I checked this manually.
While I agree that currently the I think unique names should be based on the In general I am not really "happy" with the implementation here. It feels kind of hacky. But it turned out that this was not that easy because of how the and when the |
Arrg it seems that the merge of main brought in too many changes or at least git can't really build an accurate diff of the actual changes I made :-/. Will try to fix it and push again. |
This PR fixes the overwriting of attributes of the
DataArray
to be assigned based on the DataID when writing to an existing id as well as theDataID
not reflecting array attributes (see corresponding issues below).The behaviour makes some assumptions which I think make sense but should be debated:
name
is not in the attributes it gets added based on the choosen name in the key to which theDataArray
is going to be assignedScene
and instead converts it to a xarrayDataArray
(I think this makes sense because many features of the Scene are based on attributes and as far as I can see not even the documentation mentions that this is possible).DataID
is generated from theDataArray
attributes which also updates the_satpy_id
attribute so they are in sync.name
property of theDataArray
is set to the name in theDataID
. This is something that needs to be discussed since it would be good if thename
property also gets set by the readers and also, I think, they need to be unique which is not handled currently (e.g. what happens with two datasets in theScene
withDataID(name='1')
andDataID(name='1', resolution=2000)
).Known problems