-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading netcdf files is slow if there are unlimited dimensions #3357
Comments
Thanks for taking the time to report this @tkarna. I'll see if I can replicate your timings on my side. What version of |
I'm using version |
For what it's worth I see the same performance with version |
Could even be a candidate for user-controlled chunking ? see #3333 |
@tkarna I can recreate the behaviour on my side, so thanks for the repeatable example. That said, after digging a little, it appears that For the non That said, I simply saved the netcdf file with the netcdf import iris
import numpy
import datetime
ntime = 5000
nz = 50
dt = 600.
time = numpy.arange(ntime, dtype=float)*dt
date_zero = datetime.datetime(2000, 1, 1)
date_epoch = datetime.datetime.utcfromtimestamp(0)
time_epoch = time + (date_zero - date_epoch).total_seconds()
z = numpy.linspace(0, 10, nz)
values = 5*numpy.sin(time/(14*24*3600.))
values = numpy.tile(values, (nz, 1)).T
time_dim = iris.coords.DimCoord(time_epoch, standard_name='time',
units='seconds since 1970-01-01 00:00:00-00')
z_dim = iris.coords.DimCoord(z, standard_name='depth', units='m')
cube = iris.cube.Cube(values)
cube.standard_name = 'sea_water_practical_salinity'
cube.units = '1'
cube.add_dim_coord(time_dim, 0)
cube.add_dim_coord(z_dim, 1)
iris.fileformats.netcdf.save(cube, 'example_dataset_unlimited_chunks.nc',
unlimited_dimensions=['time'], chunksizes=[5000, 50]) This resulted in a load time of So either:
Iris will always set the chunksize on loading within dask to align with the chunksize in netcdf, but obviously this can be sub-optimal to dask (and iris) for small netcdf chunksizes |
@tkarna As an extension, I customised a development version of This custom I get the same good loading/touching performance if I set the custom This suggests to me that it would be useful to give control to the user to specify the @tkarna Does this help? |
Ping @pp-mo and @TomekTrzeciak |
Thanks @bjlittle for looking into this. I confirm that the performance hit is caused by chunking. Setting Modifying the input file chunking is not an option for me, as I get these files from an external model and I do not wish to convert them. Iris should be able to read any netCDF file with reasonable performance. Having the chunk options in the |
Another option, perhaps, would be to define better default chunking scheme. For example, if the file chunk only contains < N values, one would revert to |
@tkarna I think that there should be a happy compromise between:
|
@bjlittle Yes, that sounds very good indeed. |
FYI I am also currently investigating a slow load case, where the file has very small chunks specified, and that shows similar gross inefficiencies! |
@pp-mo Dask automatic chunking is relevant here. |
@pp-mo To separate concerns here, could you please open a separate issue (cross-referencing this one) that focuses on the performance overhead of opening and closing files, thanks. |
@pp-mo Just to be clear, setting a dimension to |
@bjlittle, personally I would vote to do the dumb thing here and either load each file in one chunk or rely on dask to make the chunking choice unless overridden by an explicit option. This leads to much more predictable behaviour in the long run. You can then allow users to explicitly set Also, I think it's preferable to address any deficiencies in chunking policy upstream in dask, so that iris doesn't have to interfere too much. But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has |
@bjlittle @pp-mo I do agree that the performance hit is partially in |
Unfortunately loading a variable as a single chunk is the one thing we must avoid, if the chunk is simply unmanageably large. That is the one thing that the current strategy was designed to solve. If we currently decide that a chunk is too large + divide it up, I think in future we can reasonably decide that a chunk is too small and combine a few. The logic is essentially the same. |
Here's how this looks in practice: class PseudoArray:
def __init__(self, shape, dtype, chunks):
self.shape = shape
self.dtype = dtype
self.chunks = chunks
shape = (120, 12, 1000, 1000)
netcdf_chunks = (16, 2, 32, 32)
a_proxy = PseudoArray(shape, dtype='float64', chunks=netcdf_chunks)
result = da.from_array(a_proxy, chunks='50 MiB')
print('netcdf chunks:', da.core.normalize_chunks(a_proxy.chunks, a_proxy.shape, a_proxy.dtype))
print('dask chunks:', result.chunks) Output: Edit: print('dask default chunks (128 MiB):', da.from_array(a_proxy).chunks) Output: Edit2: And this is what happens if there's no print('dask bad chunks:', da.from_array(PseudoArray(shape, 'float64', None)).chunks) Output: |
Actually, dask already comes with its own context manger for that purpose, so one could (should?) do something like this: with dask.config.set({'array.chunk-size': '20 MiB'}):
mycube = iris.load_cube(filepath) The above should work after the following change: diff --git a/lib/iris/_lazy_data.py b/lib/iris/_lazy_data.py
index f5312b7d..00eef321 100644
--- a/lib/iris/_lazy_data.py
+++ b/lib/iris/_lazy_data.py
@@ -58,28 +58,6 @@ def is_lazy_data(data):
return result
-# A magic value, chosen to minimise chunk creation time and chunk processing
-# time within dask.
-_MAX_CHUNK_SIZE = 8 * 1024 * 1024 * 2
-
-
-def _limited_shape(shape):
- # Reduce a shape to less than a default overall number-of-points, reducing
- # earlier dimensions preferentially.
- # Note: this is only a heuristic, assuming that earlier dimensions are
- # 'outer' storage dimensions -- not *always* true, even for NetCDF data.
- shape = list(shape)
- i_reduce = 0
- while np.prod(shape) > _MAX_CHUNK_SIZE:
- factor = np.ceil(np.prod(shape) / _MAX_CHUNK_SIZE)
- new_dim = int(shape[i_reduce] / factor)
- if new_dim < 1:
- new_dim = 1
- shape[i_reduce] = new_dim
- i_reduce += 1
- return tuple(shape)
-
-
def as_lazy_data(data, chunks=None, asarray=False):
"""
Convert the input array `data` to a dask array.
@@ -92,8 +70,7 @@ def as_lazy_data(data, chunks=None, asarray=False):
Kwargs:
* chunks:
- Describes how the created dask array should be split up. Defaults to a
- value first defined in biggus (being `8 * 1024 * 1024 * 2`).
+ Describes how the created dask array should be split up.
For more information see
http://dask.pydata.org/en/latest/array-creation.html#chunks.
@@ -105,11 +82,6 @@ def as_lazy_data(data, chunks=None, asarray=False):
The input array converted to a dask array.
"""
- if chunks is None:
- # Default to the shape of the wrapped array-like,
- # but reduce it if larger than a default maximum size.
- chunks = _limited_shape(data.shape)
-
if isinstance(data, ma.core.MaskedConstant):
data = ma.masked_array(data.data, mask=data.mask)
if not is_lazy_data(data): |
I do like the idea of exporting a 'chunks' on the data wrapper object. This makes me hope that delegating chunking decisions to Dask could work. However, the problem with the above approach is that the main data variable is not the only variable that may be lazy-wrapped within an iris.load call. If lazy aux-coords are created, they won't map the same dimensions as the main data variable. Likewise, as 'iris.load_xxx' in general deals with multiple input files and multiple output cubes, it just needs a bit more finesse, I think. I detailed this problem in #3333 |
I notice Dask have now published more info + advice on chunking. |
Not sure if I follow. Do you mean that aux-coords and variables should have the same chunking along the corresponding dimensions? If that's the case, then I guess the same problem applies to regular variables too (e.g. xy var broadcasted against xyt var with misaligned chunks will work but will be suboptimal for computation). |
Well that would seem to make sense, but it isn't how it is actually controlled.
So this, especially the last bit, may explain why having an unlimited dimension seems to 'force' the variable to be chunked : A 'chunked' form is required for extensibility along the unlimited dimension. Finally, what I thought still wasn't clear, is whether you can have user-specified chunks when some dimension(s) are unlimited.
And in fact, even ...
But who knows what that actually means ??? |
I made a new POC extension of the _limited_shape call : https://github.com/SciTools/iris/compare/master...pp-mo:chunk_control?expand=1 That does fix a testcase I have with very small chunks ~= 5000-odd * (1, 128, 128) : I will try to post that as a testcase when it comes to a PR. |
I've just tried to make a summary of the outstanding known issues. Requests :
Problems:
Possible Solution Techniques:
Proposed Key Testcases:
Of course, this only covers the issues known to me (!) |
That's a nice write-up 👍, thanks for collecting it all in one place.
Definitely non-trivial problem, especially if you consider use cases involving pickling and such. There is a fair bit of code in xarray CachingFileManager for that very purpose.
4+5 would be my vote. I've taken a stab at 4 in the code I'm working on ATM: I think 5 is still desirable for cases where more control is need in order to avoid manual hacks as in the link above. Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size. This could work sufficiently well for cases with multiple variables in a single file that you were concerned about in #3333. |
Thanks for your feedback @TomekTrzeciak
I'm tending that way too. But I do still have some doubts. Experimenting with the key function
Agreed!
Well I'm still wondering if we could persuade you otherwise ?!?
A really nice spot 💐, but it still has the same problem of being tied to the low-level file encoding, which must then be both known (to the user) and stable : In this case we need dimension names, which don't actually appear anywhere in the iris/CF data representation. Also, technically, as the occurrence and order of dimensions will differ between variables, you might still want dimensions divided differently for different variables. But I agree that is probably an obscure case : In principle you could even have an |
Aside: I think that @stephenworsley has recently shown significant improvements in saving a 'stack' of 2D fields, by setting the chunking of the whole stack larger than the 2D field sizes that contribute to it. |
I think
I don't think there is a perfect choice here, it all depends on the use case.
Perhaps dask could have an option like, say, |
In iris 2.2 netcdf file with unlimited dimension (here time) is read one slice at a time. This is slow for long time series. See: iris issue SciTools/iris#3357
Reading arrays from NetCDF files is slow if one dimension is unlimited.
An example: reading an array of shape
(5000, 50)
takes ~7 s if the first dimension is unlimited. If both dimensions are fixed, it takes ~0.02 s. This is a major bottleneck if many (100s) of such files need to be processed. Time dimension is often declared unlimited in files generated by circulation models.Test case:
The input NetCDF file can be generated with:
Profiling suggest that in the unlimited case, each time slice is being read separately, i.e.
NetCDFDataProxy.__getitem__
is being called 5000 times.Tested with: iris version 2.2.0, Anaconda3 2019.03
The text was updated successfully, but these errors were encountered: