-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk control #3361
Chunk control #3361
Changes from 10 commits
097fbe8
dfa660d
8b9300c
c522e08
3c62481
1acd9f2
019f8f4
1c588fe
f4d77c4
0c7086c
f35bbd5
cdc23ab
ec18e3e
0cd7b0d
229e7a7
174c460
a3054a6
a9e798d
b053dee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# (C) British Crown Copyright 2017 - 2018, Met Office | ||
# (C) British Crown Copyright 2017 - 2019, Met Office | ||
# | ||
# This file is part of Iris. | ||
# | ||
|
@@ -23,12 +23,14 @@ | |
from __future__ import (absolute_import, division, print_function) | ||
from six.moves import (filter, input, map, range, zip) # noqa | ||
|
||
from collections import Iterable | ||
from functools import wraps | ||
|
||
import dask | ||
import dask.array as da | ||
import dask.context | ||
from dask.local import get_sync as dget_sync | ||
import dask.array.core | ||
import dask.config | ||
|
||
import numpy as np | ||
import numpy.ma as ma | ||
|
||
|
@@ -58,26 +60,90 @@ def is_lazy_data(data): | |
return result | ||
|
||
|
||
# A magic value, chosen to minimise chunk creation time and chunk processing | ||
# time within dask. | ||
_MAX_CHUNK_SIZE = 8 * 1024 * 1024 * 2 | ||
def _optimum_chunksize(chunks, shape, | ||
limit=None, | ||
dtype=np.dtype('f4')): | ||
""" | ||
Reduce or increase an initial chunk shape to get close to a chosen ideal | ||
size, while prioritising the splitting of the earlier (outer) dimensions | ||
and keeping intact the later (inner) ones. | ||
|
||
Args: | ||
|
||
* chunks (tuple of int, or None): | ||
Pre-existing chunk shape of the target data : None if unknown. | ||
* shape (tuple of int): | ||
The full array shape of the target data. | ||
* limit (int): | ||
The 'ideal' target chunk size, in bytes. Default from dask.config. | ||
* dtype (np.dtype): | ||
Numpy dtype of target data. | ||
|
||
Returns: | ||
* chunk (tuple of int): | ||
The proposed shape of one full chunk. | ||
|
||
.. note:: | ||
The purpose of this is very similar to | ||
`dask.array.core.normalize_chunks`, when called as | ||
`(chunks='auto', shape, dtype=dtype, previous_chunks=chunks, ...)`. | ||
Except, the operation here is optimised specifically for a 'c-like' | ||
dimension order, i.e. outer dimensions first, as for netcdf variables. | ||
So if, in future, this policy can be implemented in dask, then we would | ||
prefer to replace this function with a call to that one. | ||
Accordingly, the arguments roughly match 'normalize_chunks', except | ||
that we don't support the alternative argument forms of that routine. | ||
The return value, however, is a single 'full chunk', rather than a | ||
complete chunking scheme : so an equivalent code usage could be | ||
"chunks = [c[0] for c in normalise_chunks('auto', ...)]". | ||
|
||
""" | ||
# Return chunks unchanged, for types of invocation we don't comprehend. | ||
if (any(elem <= 0 for elem in shape) or | ||
not isinstance(chunks, Iterable) or | ||
len(chunks) != len(shape)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You don't have explicit tests for this check. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was an attempt to allow alternative, dask-type chunks arguments in 'as_lazy_data'. I now see this is wrong anyway, as that initial test clause assumes that shape is iterable ! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... after some thought, I have removed this to the caller + documented the behaviour there. |
||
# Don't modify chunks for special values like -1, (0,), 'auto', | ||
# or if shape contains 0 or -1 (like raw landsea-mask data proxies). | ||
return chunks | ||
|
||
# Set the chunksize limit. | ||
if limit is None: | ||
# Fetch the default 'optimal' chunksize from the dask config. | ||
limit = dask.config.get('array.chunk-size') | ||
# Convert to bytes | ||
limit = da.core.parse_bytes(limit) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did you chose to get There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I copied this from the dask sourcecode somewhere. |
||
|
||
point_size_limit = limit / dtype.itemsize | ||
|
||
# Create result chunks, starting with a copy of the input. | ||
result = list(chunks) | ||
if shape is None: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When would shape be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I was trying to mimic the API of |
||
shape = result[:] | ||
|
||
if np.prod(result) < point_size_limit: | ||
# If size is less than maximum, expand the chunks, multiplying later | ||
# (i.e. inner) dims first. | ||
i_expand = len(shape) - 1 | ||
while np.prod(result) < point_size_limit and i_expand >= 0: | ||
factor = np.floor(point_size_limit * 1.0 / np.prod(result)) | ||
new_dim = result[i_expand] * int(factor) | ||
# Clip to dim size : N.B. means it cannot exceed the original dims. | ||
if new_dim > shape[i_expand]: | ||
new_dim = shape[i_expand] | ||
result[i_expand] = new_dim | ||
i_expand -= 1 | ||
lbdreyer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else: | ||
# Similarly, reduce if too big, reducing earlier (outer) dims first. | ||
i_reduce = 0 | ||
while np.prod(result) > point_size_limit: | ||
factor = np.ceil(np.prod(result) / point_size_limit) | ||
new_dim = int(result[i_reduce] / factor) | ||
if new_dim < 1: | ||
new_dim = 1 | ||
result[i_reduce] = new_dim | ||
i_reduce += 1 | ||
|
||
def _limited_shape(shape): | ||
# Reduce a shape to less than a default overall number-of-points, reducing | ||
# earlier dimensions preferentially. | ||
# Note: this is only a heuristic, assuming that earlier dimensions are | ||
# 'outer' storage dimensions -- not *always* true, even for NetCDF data. | ||
shape = list(shape) | ||
i_reduce = 0 | ||
while np.prod(shape) > _MAX_CHUNK_SIZE: | ||
factor = np.ceil(np.prod(shape) / _MAX_CHUNK_SIZE) | ||
new_dim = int(shape[i_reduce] / factor) | ||
if new_dim < 1: | ||
new_dim = 1 | ||
shape[i_reduce] = new_dim | ||
i_reduce += 1 | ||
return tuple(shape) | ||
return tuple(result) | ||
|
||
|
||
def as_lazy_data(data, chunks=None, asarray=False): | ||
lbdreyer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
@@ -106,9 +172,12 @@ def as_lazy_data(data, chunks=None, asarray=False): | |
|
||
""" | ||
if chunks is None: | ||
# Default to the shape of the wrapped array-like, | ||
# but reduce it if larger than a default maximum size. | ||
chunks = _limited_shape(data.shape) | ||
# No existing chunks : Make a chunk the shape of the entire input array | ||
# (but we will subdivide it if too big). | ||
chunks = list(data.shape) | ||
|
||
# Expand or reduce the basic chunk shape to an optimum size. | ||
chunks = _optimum_chunksize(chunks, shape=data.shape, dtype=data.dtype) | ||
|
||
if isinstance(data, ma.core.MaskedConstant): | ||
data = ma.masked_array(data.data, mask=data.mask) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# (C) British Crown Copyright 2010 - 2018, Met Office | ||
# (C) British Crown Copyright 2010 - 2019, Met Office | ||
# | ||
# This file is part of Iris. | ||
# | ||
|
@@ -511,7 +511,7 @@ def _get_cf_var_data(cf_var, filename): | |
proxy = NetCDFDataProxy(cf_var.shape, dtype, filename, cf_var.cf_name, | ||
fill_value) | ||
chunks = cf_var.cf_data.chunking() | ||
# Chunks can be an iterable, None, or `'contiguous'`. | ||
# Chunks can be an iterable, or `'contiguous'`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this change. You have removed
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two different sets of conventions here.
I will try to clarify in the comments. |
||
if chunks == 'contiguous': | ||
chunks = None | ||
return as_lazy_data(proxy, chunks=chunks) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pp-mo See #3320 for context.
Currently in Python3.7, you get the following
DeprecationWarning
:Could you adopt the following pattern:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, will do !