Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset of modifier attribute of xr.DataArray when assigning to existing scene DataID #1490

Open
BENR0 opened this issue Dec 22, 2020 · 9 comments · May be fixed by #2332
Open

Reset of modifier attribute of xr.DataArray when assigning to existing scene DataID #1490

BENR0 opened this issue Dec 22, 2020 · 9 comments · May be fixed by #2332
Labels

Comments

@BENR0
Copy link
Collaborator

BENR0 commented Dec 22, 2020

Describe the bug
When "manually" using the SunZenithCorrector modifier the "modifier" attribute of the xr.DataArray is stripped after it is assigned to the scene with the same key of the dataset which was supplied to the corrector.

To Reproduce

from satpy import Scene
from satpy.dataset import DataID
from satpy.modifiers import SunZenithCorrector

scn = Scene(reader="seviri_l1b_nc", filenames=["PATH_TO_SEVIRI_NC_FILE"])
scn.load(["VIS008"])

szen_corrector = SunZenithCorrector(name="szen_corrector", modifiers=("sunz_corrected",)
a = szen_corrector([mscn["VIS008"]])
scn["VIS008"] = a

While the xr.DataArray assigned to "a" has the attribute "modifier" = ("sunz_corrected",) after the assignment to the scene it is stripped away.
Obviously the DataID would not reflect the modifier attribute unless the DataID would be updated during assignment to a scene (which as far as I see it isn't).
Maybe this is unconventional usage but regardless of the DataID not reflecting the modifier it seems weird that only the modifier attribute of the xr.DataArray gets reset.

My "work around" was to delete the DataID from the scene and create a new one from the dataarray like so:

del scn["VIS008"]
new_id = DataID.new_id_from_dataarray(a)
scn[new_id] = a

Expected behavior
The expectation when assigning a xr.DataArray to an existing key in a scene would be that the old xr.DataArray belonging to that key gets overwritten even if the DataID does not reflect the state (e.g. a modifier was applied as in this case) of the data.

Environment Info:

  • OS: Linux
  • Satpy Version: 0.24.0
  • PyResample Version: 1.17.0
  • Readers and writers dependencies (when relevant): [run from satpy.config import check_satpy; check_satpy()]

Additional context
Add any other context about the problem here.

@djhoese djhoese added the bug label Dec 22, 2020
@djhoese
Copy link
Member

djhoese commented Dec 22, 2020

This isn't a common usage, but this is definitely a bug. There is a little uncertainty right now on what the defined behavior should be when assigning to a Scene with a new name. Obviously it shouldn't reset all of the DataID properties except the name. That's why I'd consider this a bug.

I'll let @mraspaud let me know what he thinks and who should try to fix it as he's worked with this stuff the most recently. @mraspaud If you'd rather I try to fix it let me know. It may be better to wait for some of the other refactoring PRs to be merged first.

@mraspaud
Copy link
Member

I agree this seems buggy. But maybe we should use this opportunity to come up with some specifications on what assigning a dataarray to a scene should do, metadata-wise?

@djhoese
Copy link
Member

djhoese commented Dec 22, 2020

Agreed. I think this is very similar to @ninahakansson's #1450. She had a different understanding of how assigning to the Scene would work (different from how I expect it) so it would be good to get other's ideas. My assumption, being the person that wrote the original code that did this, is that the .attrs['name'] metadata would get updated when doing:

scn = Scene()
scn['test-array'] = xr.DataArray(
    [1, 2, 3],
    attrs={'name': 'old name'})
scn['new-name'] = scn['test-array']

So in this case, scn['new-name'] would now have .attrs['name'] == 'new-name' and scn['test-array'] would have .attrs['name'] == 'test-array'. To not break things, this would require copying the .attrs of the DataArray being assigned though. Otherwise, test-array and new-name would point to the same DataArray but with conflicting names.

Alternatively, we could go with a hard refactor where we utilize the .name attribute in the underlying xarray Variable object. Similar to what you see when you load something from a NetCDF file with xr.open_dataset and ds['my_var'].name is 'my_var' initially. Trouble is I'm 90% sure this gets updated as you perform dask operations on the DataArray so... 🤷‍♂️

@BENR0
Copy link
Collaborator Author

BENR0 commented Dec 23, 2020

Also agree. I think this will be a huge benefit to the user. Ideally the solution would also alleviate the need to use combine_metadata most of the time when doing calculations between datasets already in a scene. Obviously not all cases can be covered but it would be nice to have some basics which are needed for scene functionality like keeping the area etc. covered.

The assignment of a new attribute name in #1450 is interesting would also not have expected this kind of usage but I can see why a user might expect this to work.
I guess the question is what should be the specification how DataID attributes get mapped to the corresponding xr.DataArray attributes and vice versa during assigment?

When I assign a xr.DataArray to a scene I naturally use the same name I used in the attributes but there are obviously other usages and scenarios (e.g. #1450). Therefore I think it would make sense to update the "name" attribute if it is different from the key name to keep them "aligned" so to speak.
But for the other default items in the DataID I think it would be better to check the attributes of the xr.DataArray which is to be assigned and update the DataID because that seems to be the problem I described in this issue.

Regarding the utilization of the .name attribute of xarray I don't know how this behaves during dask operations but maybe it would still be nice to also assign it even if it is not directly needed for Scene functionality. I think xr.Datasets rely on the the name being set in DataArrays that's why the to_xarray_dataset function renames all DataArrays.

@djhoese
Copy link
Member

djhoese commented Dec 23, 2020

I think xr.Datasets rely on the the name being set in DataArrays that's why the to_xarray_dataset function renames all DataArrays.

Oh good point! Yeah it would be nice if we could reuse this.

For everything else you said, I wonder if we could look closer at how xarray Dataset does things and treats things. Obviously they don't have an idea of a DataID, but making sure we have similar behavior (implicit copying of attributes or not, implicit renaming of a DataArray's .name on assignment, etc) would be good. We will always depend on .attrs to store additional information used by the DataID as there is no persistent storage in a DataArray object otherwise (ex. accessors are recreated after ever modification and lose internal state).

I see a few options:

  1. Assigning to the Scene with a different name than .attrs['name'] modifies the .attrs['name'] to match what was used in the __setitem__. This will likely lead to users doing scn['new-name'] = scn['old-name'] as it is an implicit change which means we also have to do an implicit copy of the entire DataArray object. This is a little too magical, but can be useful.
  2. Raise an exception when the .attrs['name'] differs from the assigned name. This is the less magic/behind-the-scenes option. This would require users to do their own copying of the DataArray and their own setting of .attrs['name'] to the new name. This has to be done already if users want to modify DataID properties other than name so it isn't that big of a deal. However, this will likely lead to users not copying the DataArray first and "corrupting" the original DataArray stored in the Scene.

The thing not handled or discussed here yet is what if someone does scn['old-name'].attrs['name'] = new-name'? There isn't really a way to protect against that. Only way would be to not store DataArrays in the Scene in a dict structure and instead store them in a list and look up the matching DataArray for every __getitem__ and __setitem__. Not great for performance but not that expensive either. Also not a similar behavior to xarray Dataset objects.

@ninahakansson
Copy link
Contributor

The thing not handled or discussed here yet is what if someone does scn['old-name'].attrs['name'] = new-name'? There isn't really a way to protect against that.

@djhoese thanks for taking notice of this! Unfortunately this is exactly what I am currently doing in level1c4pps to be able to choose the name of the netcdf variables (attrs["name"]) but otherwise changing the variables as little as possible to avoid loosing attributes etc. Maybe this can be solved by introducing a new attribute "netcdf_name" and use that to decide names of the netcdf variables in cf-writer?

@djhoese
Copy link
Member

djhoese commented Jan 13, 2021

@ninahakansson I think the right solution for your use case is just to be careful how you are renaming the DataArrays. Something like:

old = scn['old']
del scn['old']
scn['new'] = old

Would that work for you?

@ninahakansson
Copy link
Contributor

@djhoese I never tried to actually delete a data set from scene before adding it with a new name. I will test that. And as long as we solve the problem with attributes, modifiers getting lost I think it will be fine renaming them in level1c4pps.

@BENR0
Copy link
Collaborator Author

BENR0 commented Apr 22, 2021

@djhoese I am in favor of option 1. This "imitates" the behavior of xarray datasets where assignment of an existing dataarray in the dataset to a different name also makes a deep copy as far as I see. The only difference in Satpy would be that the __setitem__ would also imply a change of the attrs["name"] to match the key in the Scene. I think this not very dangerous behavior and could also be documented so users are potentially aware.

The "problematic" case with users trying to set a new name with scn['old-name'].attrs['name'] = new-name' should not be "fixed" by using list instead of dicts. For one as you pointed out it is not similar to xarray behavior and I think it adds ambiguity. Instead renaming should either be documented with the delete and assign way you pointed out above or maybe introduce a new method rename similar to xarray?

While not directly related; it would also be nice (again similar to xarray) if multiple datasets could be selected from a scene with a list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants