Optionally cache small data variables and file handles #981

gerritholl · 2019-11-26T15:43:23Z

In the netcdf utility reader, cache small data variables to prevent
needlessly often opening and closing the data files.

Closes MTG-FCI-FDHSI reader is slow, apparently not actually dask-aware #972
Tests added and test suite added to parent suite
Tests passed
Passes flake8 satpy
Fully documented
Add your name to AUTHORS.md if not there already

In the netcdf utility reader, cache small data variables to prevent needlessly often opening and closing the data files.

In the FCI reader, use the data variable caching implemented in the previous commit. This should address pytroll#972.

For strings, I cannot measure their size because their .dtype is a type, not a dtype. Therefore I can't get the itemsize so I don't know how large they will be (they're also variable length). Don't cache those for now, I'm not using them anyway.

coveralls · 2019-11-26T16:10:27Z

Coverage increased (+0.04%) to 87.006% when pulling 2b75d17 on gerritholl:nc-utils-caching into 2413f74 on pytroll:master.

codecov · 2019-11-26T16:10:35Z

Codecov Report

Merging #981 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #981      +/-   ##
=========================================
+ Coverage   86.96%     87%   +0.03%     
=========================================
  Files         181     181              
  Lines       27531   27581      +50     
=========================================
+ Hits        23943   23997      +54     
+ Misses       3588    3584       -4

Impacted Files	Coverage Δ
satpy/readers/fci_l1c_fdhsi.py	`96.12% <ø> (ø)`	⬆️
satpy/readers/netcdf_utils.py	`100% <100%> (+5.97%)`	⬆️
satpy/tests/reader_tests/test_netcdf_utils.py	`94.68% <100%> (+1.43%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2413f74...2b75d17. Read the comment docs.

gerritholl · 2019-11-26T16:19:06Z

There's a lot of work to be done on this still. It only works mutually with #845.

Fix a bug in the small variable caching, where I was overwriting rather than adding a key to the cache dictionary.

Fix a small bug in the ncutils small var caching, wrong variable named.

Downstream, we need at least the attributes for some of the cached variables. Therefore we do need to make them into xarray dataaarrays again.

Fix bug in small var caching method, should be xr not xarray

In netcdf_utils, add an option to avoid the slow xarray.open_dataset completely. Instead, this option allows to keep the fileformat open as long as the filehandler objects is, and create xarray.dataarray objects manually. The coordinates are missing for now.

The FCI reader nowm uses the new option (introduced in the previous commit) to bypass xarray.open_dataset completely, this should further imporve performance.

Fix a bug introduced a couple of commits ago, where a return statement went AWOL for cases where __getitem__ on the NetCDF4FileHandler is retrieving an attribute or shape.

Fix a bug where an import statement for dask was missing in the netcdf-utils.

The previous commit cannot possibly have been running at all.

Add a test case to cover the newly implemented caching feature in netcdf-utils

PEP8/flake8 fixes in netcdf_utils and test_netcdf_utils

gerritholl · 2019-11-28T09:32:05Z

This speeds up the FCI reading-but-not-reading from 40 minutes / 10 GB RAM to 80 seconds / 2 GB RAM.

gerritholl · 2019-11-28T09:33:05Z

The earlier comment that it only works mutually with #845 is not true. Although #845 depends on this PR to be merged, the reverse is not true. This is ready for review.

Improve test coverage for netcdf_utils. Test coverage for this module is now 100% according to my local pytest.

Fix PEP8 / flake8 complaints

A few cosmetic changes to the netcdf utils caching. Improve the API documentation, change an argument name to better reflect its role, and point out in additional places that we're not doing coordinates when caching variables.

mraspaud

LGTM. @gerritholl is this ready to merge ?

djhoese

Nice job. Thanks for all the documentation on the changes.

I have one request that shouldn't stop this from being merge but would be nice to have: The documentation of the class mentions uncached datasets before caching has been discussed at all. Do you think it would be possible to talk about how the loading/caching works before talking about uncached variables or other caching related things?

In the docstring for the optimised netcdf_utils, clarify the first reference to caching.

gerritholl added 3 commits November 26, 2019 16:39

Try to cache small data variables

00a92d3

In the netcdf utility reader, cache small data variables to prevent needlessly often opening and closing the data files.

In FCI reader, use data variable caching

ab136a5

In the FCI reader, use the data variable caching implemented in the previous commit. This should address pytroll#972.

Don't try to cache strings

d34af0c

For strings, I cannot measure their size because their .dtype is a type, not a dtype. Therefore I can't get the itemsize so I don't know how large they will be (they're also variable length). Don't cache those for now, I'm not using them anyway.

Fix typo in previous commit

f90d525

gerritholl mentioned this pull request Nov 26, 2019

MTG: get projection and extent information from file #845

Merged

6 tasks

gerritholl added 4 commits November 27, 2019 09:05

Caching bugfix

f6f9f80

Fix a bug in the small variable caching, where I was overwriting rather than adding a key to the cache dictionary.

Bugfix in nc utils small var caching

da1cdf3

Fix a small bug in the ncutils small var caching, wrong variable named.

Make xarray objects when caching

f06a6ab

Downstream, we need at least the attributes for some of the cached variables. Therefore we do need to make them into xarray dataaarrays again.

bug in small var caching method

f3ab504

Fix bug in small var caching method, should be xr not xarray

gerritholl mentioned this pull request Nov 27, 2019

MTG-FCI-FDHSI reader is slow, apparently not actually dask-aware #972

Closed

gerritholl added 7 commits November 27, 2019 14:44

FCI reader now uses new nc-uitls file handling

7565be3

The FCI reader nowm uses the new option (introduced in the previous commit) to bypass xarray.open_dataset completely, this should further imporve performance.

Bugfix missing return in __getitem__

f668575

Fix a bug introduced a couple of commits ago, where a return statement went AWOL for cases where __getitem__ on the NetCDF4FileHandler is retrieving an attribute or shape.

Bugfix: add missing import in netcdf-utils

c3c2c80

Fix a bug where an import statement for dask was missing in the netcdf-utils.

Fix bad return statement

b747f0f

The previous commit cannot possibly have been running at all.

TST: Add test case for nc utils caching

40d3ee3

Add a test case to cover the newly implemented caching feature in netcdf-utils

PEP8 fixes in netcdf_utils

d0cc1f1

PEP8/flake8 fixes in netcdf_utils and test_netcdf_utils

gerritholl marked this pull request as ready for review November 28, 2019 09:29

gerritholl requested review from djhoese and mraspaud as code owners November 28, 2019 09:29

gerritholl added 2 commits November 28, 2019 11:37

TST: Improve test coverage for netcdf-utils

8f2442d

Improve test coverage for netcdf_utils. Test coverage for this module is now 100% according to my local pytest.

PEP8 / flake8 fixes

a379c16

Fix PEP8 / flake8 complaints

gerritholl changed the title ~~WIP: Try to cache small data variables~~ Optionally cache small data variables and file handles Nov 28, 2019

Cosmetic fixes in netcdf utils caching

88d22e6

A few cosmetic changes to the netcdf utils caching. Improve the API documentation, change an argument name to better reflect its role, and point out in additional places that we're not doing coordinates when caching variables.

mraspaud assigned gerritholl Dec 6, 2019

mraspaud added component:readers enhancement code enhancements, features, improvements labels Dec 6, 2019

mraspaud approved these changes Dec 6, 2019

View reviewed changes

Merge branch 'master' into nc-utils-caching

b68dbdc

djhoese approved these changes Dec 9, 2019

View reviewed changes

In optimised nc-utils, clarify caching

2b75d17

In the docstring for the optimised netcdf_utils, clarify the first reference to caching.

djhoese merged commit 0a0a4a8 into pytroll:master Dec 9, 2019

gerritholl deleted the nc-utils-caching branch October 22, 2021 14:47

gerritholl mentioned this pull request Oct 22, 2021

Computing FCI datasets on resampled scene may fail if original scene gone #1861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally cache small data variables and file handles #981

Optionally cache small data variables and file handles #981

gerritholl commented Nov 26, 2019 •

edited by mraspaud

Loading

coveralls commented Nov 26, 2019 •

edited

Loading

codecov bot commented Nov 26, 2019 •

edited

Loading

gerritholl commented Nov 26, 2019

gerritholl commented Nov 28, 2019

gerritholl commented Nov 28, 2019

mraspaud left a comment

djhoese left a comment

Optionally cache small data variables and file handles #981

Optionally cache small data variables and file handles #981

Conversation

gerritholl commented Nov 26, 2019 • edited by mraspaud Loading

coveralls commented Nov 26, 2019 • edited Loading

codecov bot commented Nov 26, 2019 • edited Loading

Codecov Report

gerritholl commented Nov 26, 2019

gerritholl commented Nov 28, 2019

gerritholl commented Nov 28, 2019

mraspaud left a comment

Choose a reason for hiding this comment

djhoese left a comment

Choose a reason for hiding this comment

gerritholl commented Nov 26, 2019 •

edited by mraspaud

Loading

coveralls commented Nov 26, 2019 •

edited

Loading

codecov bot commented Nov 26, 2019 •

edited

Loading