Skip to content

Commit

Permalink
Tweak to speed up dask wrapping of netcdf variables (#4135)
Browse files Browse the repository at this point in the history
* Use 'meta' in da.from_array to stop it sampling netcdf variables, which is quite slow.

* Fix PR number.

* Fix test.

* Review changes.

* Update docs/iris/src/whatsnew/3.0.2.rst

Co-authored-by: lbdreyer <lbdreyer@users.noreply.github.com>
  • Loading branch information
pp-mo and lbdreyer authored May 26, 2021
1 parent 38bfb6c commit 09018b0
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 7 deletions.
4 changes: 4 additions & 0 deletions docs/iris/src/whatsnew/3.0.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ This document explains the changes made to Iris for this release
developers to easily disable `cirrus-ci`_ tasks. See
:ref:`skipping Cirrus-CI tasks`. (:pull:`4019`) [``pre-v3.1.0``]

#. `@pp-mo`_ adjusted the use of :func:`dask.array.from_array` in :func:`iris._lazy_data.as_lazy_data`, to avoid
the dask 'test access'. This makes loading of netcdf files with a large number of variables significantly faster.
(:pull:`4135`)

Note that, the contributions labelled ``pre-v3.1.0`` are part of the forthcoming
Iris v3.1.0 release, but require to be included in this patch release.

Expand Down
4 changes: 3 additions & 1 deletion lib/iris/_lazy_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,9 @@ def as_lazy_data(data, chunks=None, asarray=False):
if isinstance(data, ma.core.MaskedConstant):
data = ma.masked_array(data.data, mask=data.mask)
if not is_lazy_data(data):
data = da.from_array(data, chunks=chunks, asarray=asarray)
data = da.from_array(
data, chunks=chunks, asarray=asarray, meta=np.ndarray
)
return data


Expand Down
12 changes: 6 additions & 6 deletions lib/iris/tests/unit/lazy_data/test_co_realise_cubes.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,12 +71,12 @@ def test_combined_access(self):
cube_e = Cube(derived_e)
co_realise_cubes(cube_a, cube_b, cube_c, cube_d, cube_e)
# Though used more than once, the source data should only get fetched
# twice by dask. Once when dask performs an initial data access with
# no data payload to ascertain the metadata associated with the
# dask.array (this access is specific to dask 2+, see
# dask.array.utils.meta_from_array), and again when the whole data is
# accessed.
self.assertEqual(wrapped_array.access_count, 2)
# once by dask, when the whole data is accessed.
# This also ensures that dask does *not* perform an initial data
# access with no data payload to ascertain the metadata associated with
# the dask.array (this access is specific to dask 2+,
# see dask.array.utils.meta_from_array).
self.assertEqual(wrapped_array.access_count, 1)


if __name__ == "__main__":
Expand Down

0 comments on commit 09018b0

Please sign in to comment.