Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/issue-142: duplicated dimension error with TEMPO ozone profile #141

Conversation

danielfromearth
Copy link
Collaborator

@danielfromearth danielfromearth commented Feb 6, 2023

Github Issue: #142

Description

Modified the remove_duplicate_dims() function so that the subsetter pipeline does not fail when encountering the duplicated "layer" dimension in TEMPO ozone profile data (variables: support_data/ozone_averaging_kernel and support_data/ozone_noise_correlation_matrix)

Overview of work done

Added checks in remove_duplicate_dims() for whether the Dimension and Variable corresponding to the duplicated dimension already exist in the NetCDF. The function directly writes the new variable with no duplicated dimensions rather than keeping it with an altered name that needs to be renamed later in the subsetting procedure.

Overview of verification done

Added a new unit test for TEMPO ozone profile data. Checked that all automated tests passed successfully.

Overview of integration done

Explain how this change was integration tested. Provide screenshots or logs if appropriate. An example of this would be a local Harmony deployment.

PR checklist:

  • Linted
  • Updated unit tests
  • Updated changelog
  • Integration testing

See Pull Request Review Checklist for pointers on reviewing this pull request

@danielfromearth danielfromearth changed the title Feature/currently no issue duplicated layer dim Feature/currently no issue - duplicated dimension for TEMPO ozone Feb 6, 2023
@danielfromearth danielfromearth changed the title Feature/currently no issue - duplicated dimension for TEMPO ozone Feature/issue-142 duplicated dimension for TEMPO ozone Feb 6, 2023
@danielfromearth danielfromearth changed the title Feature/issue-142 duplicated dimension for TEMPO ozone Feature/issue-142: duplicated dimension error with TEMPO ozone profile Feb 6, 2023
@danielfromearth danielfromearth marked this pull request as ready for review February 7, 2023 16:00
@danielfromearth
Copy link
Collaborator Author

danielfromearth commented Feb 7, 2023

@nlenssen2013, @frankinspace, I've made some edits to the remove_duplicate_dims() function in this branch in order to allow the subsetter to work with TEMPO Ozone proxy data files, which have duplicated dimensions in two of the variables. This change seems to be working well without the need to hold off variable renaming (with rename_dup_vars() using xarray) later in the subset pipeline. I'm a bit unsure whether this will work for all cases though. Are there additional files we can do regression tests with, besides the ones used by the automated tests in the tests/data directory?

@danielfromearth danielfromearth changed the base branch from develop to feature/issue-142-duplicated-dimensions-in-tempo-ozone-profile February 7, 2023 16:29
@nlenssen2013
Copy link
Contributor

@nlenssen2013, @frankinspace, I've made some edits to the remove_duplicate_dims() function in this branch in order to allow the subsetter to work with TEMPO Ozone proxy data files, which have duplicated dimensions in two of the variables. This change seems to be working well without the need to hold off variable renaming (with rename_dup_vars() using xarray) later in the subset pipeline. I'm a bit unsure whether this will work for all cases though. Are there additional files we can do regression tests with, besides the ones used by the automated tests in the tests/data directory?

To my knowledge, GESDISC only has two collections with this duplicate dimension issue in SNDR and Tropomi. Both of those files get tested in unit test.

# return the variables that will need to be renamed: Rename method is still an issue per https://github.com/Unidata/netcdf-c/issues/1672
return nc_dataset, dup_new_varnames
return nc_dataset, []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the new variable names with duplicate dimensions are the same as the original name (without the '_1' at the end) then a blank list doesn't need to be returned and the function below should be removed. Will need to check with our products.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I've updated the branch now to remove that return value and the use of the other renaming function.

Copy link
Member

@frankinspace frankinspace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think if the pytests all succeeded we are reasonably confident it hasn't broken any of the other datasets. Nice work!

@danielfromearth
Copy link
Collaborator Author

Hey all, just checking back on this...
@nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

@nlenssen2013
Copy link
Contributor

Hey all, just checking back on this... @nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

Yeah, it checks out with our datasets fine, the second function is still needed for one of our collections so the merged branch into develop is solid

@danielfromearth
Copy link
Collaborator Author

danielfromearth commented Feb 16, 2023

Hey all, just checking back on this... @nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

Yeah, it checks out with our datasets fine, the second function is still needed for one of our collections so the merged branch into develop is solid

@nlenssen2013, you mean rename_dup_vars() is still needed? Just double-checking with you, since the latest version of this branch no longer uses that function at all (actually, I was thinking the function could now be deleted too) — is that okay?

If it's okay, I'll just go ahead and squash+merge this. If not, we might need to make a few more tweaks to use that rename_dup_vars() function again.

@nlenssen2013
Copy link
Contributor

Hey all, just checking back on this... @nlenssen2013, @jamesfwood, are you still planning to do additional checks with GESDISC or other data products before we move this from the fork into the base repository? (or perhaps you've already done additional checks?)

Yeah, it checks out with our datasets fine, the second function is still needed for one of our collections so the merged branch into develop is solid

@nlenssen2013, you mean rename_dup_vars() is still needed? Just double-checking with you, since the latest version of this branch no longer uses that function at all (actually, I was thinking the function could now be deleted too) — is that okay?

If it's okay, I'll just go ahead and squash+merge this. If not, we might need to make a few more tweaks to use that rename_dup_vars() function again.

I'm not getting all the pytest tests/ to pass without it currently, not sure if you are. Not a problem for me if it gets merged though - was able to run an individual subset on that collection and it looks good.

@danielfromearth
Copy link
Collaborator Author

danielfromearth commented Feb 16, 2023

@nlenssen2013. hmmm, I just reran using pytest, and all of the tests in tests/ passed on my laptop from this branch (danielfromearth:feature/currently-no-issue-duplicated-layer-dim). If they aren't passing for you, are you definitely using the same branch and pulled the latest?

Perhaps we merge this into the podaac:feature branch, and then let the automated test suite — run via GitHub actions — serve as the more authoritative check before merging it into develop.

@nlenssen2013
Copy link
Contributor

@nlenssen2013. hmmm, I just reran using pytest, and all of the tests in tests/ passed on my laptop from this branch (danielfromearth:feature/currently-no-issue-duplicated-layer-dim). If they aren't passing for you, are you definitely using the same branch and pulled the latest?

Perhaps we merge this into the podaac:feature branch, and then let the automated test suite — run via GitHub actions — serve as the more authoritative check before merging it into develop.

Yep, I'm fine with that

@danielfromearth danielfromearth merged commit 2bf1a2d into podaac:feature/issue-142-duplicated-dimensions-in-tempo-ozone-profile Feb 17, 2023
danielfromearth added a commit that referenced this pull request Feb 22, 2023
#148)

* Feature/issue-142: duplicated dimension error with TEMPO ozone profile (#141)

* add test for ozone profile proxy data

* rework duplicate dimension removal to work with TEMPO ozone profile data

* pylint update

* simplify return of `remove_duplicate_dims() and `open_as_nc_dataset()`

* remove unused import per pylint

* add test data files for TEMPO NO2 and O3PROF (contains duplicate dimension)

* clean up comments

* Revert "simplify return of `remove_duplicate_dims() and `open_as_nc_dataset()`"

This reverts commit e7b7096.

* include Tuple import

* update CHANGELOG.md

* Revert "Revert "simplify return of `remove_duplicate_dims() and `open_as_nc_dataset()`""

This reverts commit 3fe9c2a.

* remove unused import for flake8

* remove now-unused `rename_dup_vars()` function
@danielfromearth danielfromearth deleted the feature/currently-no-issue-duplicated-layer-dim branch March 1, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicated dimension in TEMPO ozone profile proxy data raises error
3 participants