Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally cache small data variables and file handles #981

Merged
merged 20 commits into from
Dec 9, 2019

Conversation

gerritholl
Copy link
Member

@gerritholl gerritholl commented Nov 26, 2019

In the netcdf utility reader, cache small data variables to prevent
needlessly often opening and closing the data files.

In the netcdf utility reader, cache small data variables to prevent
needlessly often opening and closing the data files.
In the FCI reader, use the data variable caching implemented in the
previous commit.  This should address pytroll#972.
For strings, I cannot measure their size because their .dtype is a type,
not a dtype.  Therefore I can't get the itemsize so I don't know how
large they will be (they're also variable length).  Don't cache those
for now, I'm not using them anyway.
@coveralls
Copy link

coveralls commented Nov 26, 2019

Coverage Status

Coverage increased (+0.04%) to 87.006% when pulling 2b75d17 on gerritholl:nc-utils-caching into 2413f74 on pytroll:master.

@codecov
Copy link

codecov bot commented Nov 26, 2019

Codecov Report

Merging #981 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #981      +/-   ##
=========================================
+ Coverage   86.96%     87%   +0.03%     
=========================================
  Files         181     181              
  Lines       27531   27581      +50     
=========================================
+ Hits        23943   23997      +54     
+ Misses       3588    3584       -4
Impacted Files Coverage Δ
satpy/readers/fci_l1c_fdhsi.py 96.12% <ø> (ø) ⬆️
satpy/readers/netcdf_utils.py 100% <100%> (+5.97%) ⬆️
satpy/tests/reader_tests/test_netcdf_utils.py 94.68% <100%> (+1.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2413f74...2b75d17. Read the comment docs.

@gerritholl
Copy link
Member Author

There's a lot of work to be done on this still. It only works mutually with #845.

Fix a bug in the small variable caching, where I was overwriting rather
than adding a key to the cache dictionary.
Fix a small bug in the ncutils small var caching, wrong variable named.
Downstream, we need at least the attributes for some of the cached
variables.  Therefore we do need to make them into xarray dataaarrays
again.
Fix bug in small var caching method, should be xr not xarray
In netcdf_utils, add an option to avoid the slow xarray.open_dataset
completely.  Instead, this option allows to keep the fileformat open as
long as the filehandler objects is, and create xarray.dataarray objects
manually.  The coordinates are missing for now.
The FCI reader nowm uses the new option (introduced in the previous
commit) to bypass xarray.open_dataset completely, this should further
imporve performance.
Fix a bug introduced a couple of commits ago, where a return statement
went AWOL for cases where __getitem__ on the NetCDF4FileHandler is
retrieving an attribute or shape.
Fix a bug where an import statement for dask was missing in the
netcdf-utils.
The previous commit cannot possibly have been running at all.
Add a test case to cover the newly implemented caching feature in
netcdf-utils
PEP8/flake8 fixes in netcdf_utils and test_netcdf_utils
@gerritholl gerritholl marked this pull request as ready for review November 28, 2019 09:29
@gerritholl
Copy link
Member Author

This speeds up the FCI reading-but-not-reading from 40 minutes / 10 GB RAM to 80 seconds / 2 GB RAM.

@gerritholl
Copy link
Member Author

The earlier comment that it only works mutually with #845 is not true. Although #845 depends on this PR to be merged, the reverse is not true. This is ready for review.

Improve test coverage for netcdf_utils.  Test coverage for this module
is now 100% according to my local pytest.
Fix PEP8 / flake8 complaints
@gerritholl gerritholl changed the title WIP: Try to cache small data variables Optionally cache small data variables and file handles Nov 28, 2019
A few cosmetic changes to the netcdf utils caching.  Improve the API
documentation, change an argument name to better reflect its role, and
point out in additional places that we're not doing coordinates when
caching variables.
@mraspaud mraspaud added component:readers enhancement code enhancements, features, improvements labels Dec 6, 2019
Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @gerritholl is this ready to merge ?

Copy link
Member

@djhoese djhoese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. Thanks for all the documentation on the changes.

I have one request that shouldn't stop this from being merge but would be nice to have: The documentation of the class mentions uncached datasets before caching has been discussed at all. Do you think it would be possible to talk about how the loading/caching works before talking about uncached variables or other caching related things?

In the docstring for the optimised netcdf_utils, clarify the first
reference to caching.
@djhoese djhoese merged commit 0a0a4a8 into pytroll:master Dec 9, 2019
@gerritholl gerritholl deleted the nc-utils-caching branch October 22, 2021 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:readers enhancement code enhancements, features, improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MTG-FCI-FDHSI reader is slow, apparently not actually dask-aware
4 participants