Reading netcdf files is slow if there are unlimited dimensions #3357

tkarna · 2019-07-12T14:14:35Z

Reading arrays from NetCDF files is slow if one dimension is unlimited.

An example: reading an array of shape (5000, 50) takes ~7 s if the first dimension is unlimited. If both dimensions are fixed, it takes ~0.02 s. This is a major bottleneck if many (100s) of such files need to be processed. Time dimension is often declared unlimited in files generated by circulation models.

Test case:

import iris
import time

f = 'example_dataset.nc'
var = 'sea_water_practical_salinity'

tic = time.process_time()
cube = iris.load_cube(f, var)
cube.data
duration = time.process_time() - tic
print('Duration {:.3f} s'.format(duration))

The input NetCDF file can be generated with:

import iris
import numpy
import datetime

ntime = 5000
nz = 50
dt = 600.
time = numpy.arange(ntime, dtype=float)*dt
date_zero = datetime.datetime(2000, 1, 1)
date_epoch = datetime.datetime.utcfromtimestamp(0)
time_epoch = time + (date_zero - date_epoch).total_seconds()
z = numpy.linspace(0, 10, nz)
values = 5*numpy.sin(time/(14*24*3600.))
values = numpy.tile(values, (nz, 1)).T

time_dim = iris.coords.DimCoord(time_epoch, standard_name='time',
                                units='seconds since 1970-01-01 00:00:00-00')
z_dim = iris.coords.DimCoord(z, standard_name='depth', units='m')
cube = iris.cube.Cube(values)
cube.standard_name = 'sea_water_practical_salinity'
cube.units = '1'
cube.add_dim_coord(time_dim, 0)
cube.add_dim_coord(z_dim, 1)
iris.fileformats.netcdf.save(cube, 'example_dataset.nc',
                             unlimited_dimensions=['time'])

Profiling suggest that in the unlimited case, each time slice is being read separately, i.e. NetCDFDataProxy.__getitem__ is being called 5000 times.

Tested with: iris version 2.2.0, Anaconda3 2019.03

The text was updated successfully, but these errors were encountered:

bjlittle · 2019-07-12T21:38:09Z

Thanks for taking the time to report this @tkarna.

I'll see if I can replicate your timings on my side. What version of iris are you using? I'm assuming 2.2.1

tkarna · 2019-07-13T03:41:30Z

I'm using version 2.2.0 installed with conda.

tkarna · 2019-07-13T04:06:40Z

For what it's worth I see the same performance with version 2.2.1 and the master branch as well.

pp-mo · 2019-07-15T14:08:59Z

Could even be a candidate for user-controlled chunking ? see #3333

bjlittle · 2019-07-15T15:26:47Z

@tkarna I can recreate the behaviour on my side, so thanks for the repeatable example.

That said, after digging a little, it appears that dask is blocking in a sleep...

I'll dig a little further...

For the non unlimited case, the netcdf variable sea_water_practical_salinity has chunking of contiguous i.e. no chunking. However, for the unlimited case, the netcdf variable sea_water_practical_salinity has a default chunking of [1, 50].... hence why it is being read 5000 times. This explains why the load time is chronically slow - note that dask prefers much larger chunk sizes, otherwise the overhead of using dask outweighs the processing of small chunks of data payload, see https://docs.dask.org/en/latest/array-chunks.html#specifying-chunk-shapes

That said, I simply saved the netcdf file with the netcdf chunksizes set specifically to [5000, 50] e.g.

import iris
import numpy
import datetime

ntime = 5000
nz = 50
dt = 600.
time = numpy.arange(ntime, dtype=float)*dt
date_zero = datetime.datetime(2000, 1, 1)
date_epoch = datetime.datetime.utcfromtimestamp(0)
time_epoch = time + (date_zero - date_epoch).total_seconds()
z = numpy.linspace(0, 10, nz)
values = 5*numpy.sin(time/(14*24*3600.))
values = numpy.tile(values, (nz, 1)).T

time_dim = iris.coords.DimCoord(time_epoch, standard_name='time',
                                units='seconds since 1970-01-01 00:00:00-00')
z_dim = iris.coords.DimCoord(z, standard_name='depth', units='m')
cube = iris.cube.Cube(values)
cube.standard_name = 'sea_water_practical_salinity'
cube.units = '1'
cube.add_dim_coord(time_dim, 0)
cube.add_dim_coord(z_dim, 1)
iris.fileformats.netcdf.save(cube, 'example_dataset_unlimited_chunks.nc',
                             unlimited_dimensions=['time'], chunksizes=[5000, 50])

This resulted in a load time of ~0.057s

So either:

set the chunksizes explicitly to iris.fileformats.netcdf.save
or align the dask reading chunks with the default netcdf variable chunksizes

Iris will always set the chunksize on loading within dask to align with the chunksize in netcdf, but obviously this can be sub-optimal to dask (and iris) for small netcdf chunksizes

bjlittle · 2019-07-16T06:49:46Z

@tkarna As an extension, I customised a development version of iris to hardwire the chunks to be [5000, 50] on loading a cube i.e. the custom chunks = [5000, 50] overrides the netcdf chunking specified in the file for the variable sea_water_practical_salinity of [1, 50].

This custom chunks aligns with the underlying [1, 50] default chunking of the netcdf variable with UNLIMITED 1st dimension. This results in a load and touching of the data of ~0.15s.

I get the same good loading/touching performance if I set the custom chunks to -1 or auto.

This suggests to me that it would be useful to give control to the user to specify the chunks and override any chunking that is specified in the netcdf file at the iris API level.

@tkarna Does this help?

bjlittle · 2019-07-16T06:50:39Z

Ping @pp-mo and @TomekTrzeciak

tkarna · 2019-07-16T07:16:00Z

Thanks @bjlittle for looking into this. I confirm that the performance hit is caused by chunking. Setting chunks=None in netcdf._get_cf_var_data fixes the issue.

Modifying the input file chunking is not an option for me, as I get these files from an external model and I do not wish to convert them. Iris should be able to read any netCDF file with reasonable performance.

Having the chunk options in the iris load_cube API would be useful. This would fix my issue.

tkarna · 2019-07-16T07:22:36Z

Another option, perhaps, would be to define better default chunking scheme. For example, if the file chunk only contains < N values, one would revert to chunks=None as (I'd imagine) reading in tiny chunks is never a good idea. This would also be easier for the user as one wouldn't need to tweak with the chunk options.

bjlittle · 2019-07-16T07:36:48Z

@tkarna I think that there should be a happy compromise between:

iris honouring the chunking specified in the netcdf file
iris overriding the chunking specified in the netcdf file that is considered too small and will result in a performance hit
allowing the user to override iris by specifying the chunks at the API level

iris already performs step 1. above, as we know. Extending iris to support steps 2. and 3. should cover most other use cases, and give enough wiggle room for users to circumvent most chunking issues. Like you said, you're a third-party to the data and can't change the native netcdf chunking of variables in the file, so iris needs to help you easily work around that.

tkarna · 2019-07-16T07:39:35Z

@bjlittle Yes, that sounds very good indeed.

pp-mo · 2019-07-16T09:59:56Z

FYI I am also currently investigating a slow load case, where the file has very small chunks specified, and that shows similar gross inefficiencies!
It also appears, from trials with "just dask", that Iris itself is responsible for a large part of the performance overhead, because (if I understand right) it opens + closes the file every time.
So maybe chunking control isn't the only thing that needs improving.
I hope to distil this into a separate issue, and maybe we can use it as a performance testcase. Watch this space ...

bjlittle · 2019-07-16T10:14:04Z

@pp-mo Dask automatic chunking is relevant here.

bjlittle · 2019-07-16T10:16:08Z

@pp-mo To separate concerns here, could you please open a separate issue (cross-referencing this one) that focuses on the performance overhead of opening and closing files, thanks.

bjlittle · 2019-07-16T10:39:36Z

@pp-mo Just to be clear, setting a dimension to UNLIMITED causes netCDF4 to inject a chunking strategy automatically for a netcdf variable in the netcdf file. The choice of chunking used can be unfavourable to dask, making loading sub-optimal, as the example in this issue highlights.

TomekTrzeciak · 2019-07-16T10:39:48Z

@tkarna I think that there should be a happy compromise between:

iris honouring the chunking specified in the netcdf file

iris overriding the chunking specified in the netcdf file that is considered too small and will result in a performance hit

allowing the user to override iris by specifying the chunks at the API level

iris already performs step 1. above, as we know. Extending iris to support steps 2. and 3. should cover most other use cases, and give enough wiggle room for users to circumvent most chunking issues. Like you said, you're a third-party to the data and can't change the native netcdf chunking of variables in the file, so iris needs to help you easily work around that.

@bjlittle, personally I would vote to do the dumb thing here and either load each file in one chunk or rely on dask to make the chunking choice unless overridden by an explicit option. This leads to much more predictable behaviour in the long run. You can then allow users to explicitly set chunks=... in iris on per call basis or with context manager or globally.

Also, I think it's preferable to address any deficiencies in chunking policy upstream in dask, so that iris doesn't have to interfere too much. But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has chunks attribute. Iris could add this to its NetCDFDataProxy object and get the right behaviour for free.

tkarna · 2019-07-16T10:58:03Z

@bjlittle @pp-mo I do agree that the performance hit is partially in iris itself, and related to the opening/closing of the netCDF file. This happens in NetCDFDataProxy.__getitem__ which in my worst-case example is called 5000 times. As demonstrated, the chunking choice does affect the reading pattern though.

pp-mo · 2019-07-16T11:33:50Z

@TomekTrzeciak personally I would vote to do the dumb thing ... load each file in one chunk

Unfortunately loading a variable as a single chunk is the one thing we must avoid, if the chunk is simply unmanageably large. That is the one thing that the current strategy was designed to solve.

If we currently decide that a chunk is too large + divide it up, I think in future we can reasonably decide that a chunk is too small and combine a few. The logic is essentially the same.
However, the undesirable effect of unlimited dimensions, in that it mimics user-selected variable chunking, is a bit of a shock. Maybe Dask itself has already engineered ways around that problem. I'm not sure yet how much we can handily delegate decisions to dask.

TomekTrzeciak · 2019-07-16T12:55:06Z

But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has chunks attribute. Iris could add this to its NetCDFDataProxy object and get the right behaviour for free.

Here's how this looks in practice:

class PseudoArray:
    def __init__(self, shape, dtype, chunks):
        self.shape = shape
        self.dtype = dtype
        self.chunks = chunks
shape = (120, 12, 1000, 1000)
netcdf_chunks = (16, 2, 32, 32)
a_proxy = PseudoArray(shape, dtype='float64', chunks=netcdf_chunks)
result = da.from_array(a_proxy, chunks='50 MiB')
print('netcdf chunks:', da.core.normalize_chunks(a_proxy.chunks, a_proxy.shape, a_proxy.dtype))
print('dask chunks:', result.chunks)

Output:
netcdf chunks: ((16, 16, 16, 16, 16, 16, 16, 8), (2, 2, 2, 2, 2, 2), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8))
dask chunks: ((48, 48, 24), (6, 6), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40))

Edit:

print('dask default chunks (128 MiB):', da.from_array(a_proxy).chunks)

Output:
dask default chunks (128 MiB): ((64, 56), (8, 4), (128, 128, 128, 128, 128, 128, 128, 104), (128, 128, 128, 128, 128, 128, 128, 104))

Edit2:

And this is what happens if there's no chunks exposed (note how this will trash the original chunk boundaries):

print('dask bad chunks:', da.from_array(PseudoArray(shape, 'float64', None)).chunks)

Output:
dask bad chunks: ((60, 60), (12,), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100))

TomekTrzeciak · 2019-07-16T14:13:44Z

You can then allow users to explicitly set chunks=... in iris on per call basis or with context manager or globally.

Actually, dask already comes with its own context manger for that purpose, so one could (should?) do something like this:

with dask.config.set({'array.chunk-size': '20 MiB'}):
    mycube = iris.load_cube(filepath)

The above should work after the following change:

diff --git a/lib/iris/_lazy_data.py b/lib/iris/_lazy_data.py
index f5312b7d..00eef321 100644
--- a/lib/iris/_lazy_data.py
+++ b/lib/iris/_lazy_data.py
@@ -58,28 +58,6 @@ def is_lazy_data(data):
     return result
 
 
-# A magic value, chosen to minimise chunk creation time and chunk processing
-# time within dask.
-_MAX_CHUNK_SIZE = 8 * 1024 * 1024 * 2
-
-
-def _limited_shape(shape):
-    # Reduce a shape to less than a default overall number-of-points, reducing
-    # earlier dimensions preferentially.
-    # Note: this is only a heuristic, assuming that earlier dimensions are
-    # 'outer' storage dimensions -- not *always* true, even for NetCDF data.
-    shape = list(shape)
-    i_reduce = 0
-    while np.prod(shape) > _MAX_CHUNK_SIZE:
-        factor = np.ceil(np.prod(shape) / _MAX_CHUNK_SIZE)
-        new_dim = int(shape[i_reduce] / factor)
-        if new_dim < 1:
-            new_dim = 1
-        shape[i_reduce] = new_dim
-        i_reduce += 1
-    return tuple(shape)
-
-
 def as_lazy_data(data, chunks=None, asarray=False):
     """
     Convert the input array `data` to a dask array.
@@ -92,8 +70,7 @@ def as_lazy_data(data, chunks=None, asarray=False):
     Kwargs:
 
     * chunks:
-        Describes how the created dask array should be split up. Defaults to a
-        value first defined in biggus (being `8 * 1024 * 1024 * 2`).
+        Describes how the created dask array should be split up.
         For more information see
         http://dask.pydata.org/en/latest/array-creation.html#chunks.
 
@@ -105,11 +82,6 @@ def as_lazy_data(data, chunks=None, asarray=False):
         The input array converted to a dask array.
 
     """
-    if chunks is None:
-        # Default to the shape of the wrapped array-like,
-        # but reduce it if larger than a default maximum size.
-        chunks = _limited_shape(data.shape)
-
     if isinstance(data, ma.core.MaskedConstant):
         data = ma.masked_array(data.data, mask=data.mask)
     if not is_lazy_data(data):

pp-mo · 2019-07-16T14:17:54Z

I do like the idea of exporting a 'chunks' on the data wrapper object. This makes me hope that delegating chunking decisions to Dask could work.

However, the problem with the above approach is that the main data variable is not the only variable that may be lazy-wrapped within an iris.load call. If lazy aux-coords are created, they won't map the same dimensions as the main data variable.
So it is ok for 'general choices' like array.chunk-size, but we can't use it to control chunk sizes directly.

Likewise, as 'iris.load_xxx' in general deals with multiple input files and multiple output cubes, it just needs a bit more finesse, I think.

I detailed this problem in #3333

pp-mo · 2019-07-16T14:23:59Z

I notice Dask have now published more info + advice on chunking.
In case you didn't see, https://docs.dask.org/en/latest/array-best-practices.html#select-a-good-chunk-size and https://docs.dask.org/en/latest/array-chunks.html#automatic-chunking

TomekTrzeciak · 2019-07-16T14:51:57Z

However, the problem with the above approach is that the main data variable is not the only variable that may be lazy-wrapped within an iris.load call : If lazy aux-coords are created, they won't map the same dimensions as the main data variable.

Not sure if I follow. Do you mean that aux-coords and variables should have the same chunking along the corresponding dimensions? If that's the case, then I guess the same problem applies to regular variables too (e.g. xy var broadcasted against xyt var with misaligned chunks will work but will be suboptimal for computation).

pp-mo · 2019-07-16T16:37:49Z

the same chunking along the corresponding dimensions?

Well that would seem to make sense, but it isn't how it is actually controlled.
I looked into the setting of chunking in netcdf, and AFAICT it goes like :

chunking is an HDF5 concept, therefore only applies to netcdf-4 (not earlier) format
it is specified per-variable, fixed at variable creation time (python Dataset.createVariable call)
- can't subsequently be changed
- is not tied to dimensions
- is a property of the variable, stored in the file and can be read from a re-opened file
it affects the actual storage layout (notably, allowing for unlimited dimensions)
it is the opposite of 'contiguous'
- python Variable.chunksizes() returns either "contiguous" or a tuple of ndims * chunks

So this, especially the last bit, may explain why having an unlimited dimension seems to 'force' the variable to be chunked : A 'chunked' form is required for extensibility along the unlimited dimension.

Finally, what I thought still wasn't clear, is whether you can have user-specified chunks when some dimension(s) are unlimited.
So, I tried it + I think you can ...

>>> ds = nc.Dataset('tmp.nc', 'w')
>>> ds.createDimension('t', 0)
>>> ds.createDimension('y', 3)
>>> ds.createDimension('x', 4)
>>> ds.createVariable('v1', np.int32, ['t', 'x', 'y'], chunksizes=(1, 1, 2))
>>> ds.close()
>>> ds = nc.Dataset('tmp.nc', 'r')
>>> ds.variables['v1'].chunking()
[1, 1, 2]
>>>

And in fact, even ...

>>> ds.createVariable('v1', np.int32, ['t', 'x', 'y'], chunksizes=(5, 1, 2))
>>> ds.close()
>>> ds = nc.Dataset('tmp.nc', 'r')
>>> ds.variables['v1'].chunking()
[5, 1, 2]
>>>

But who knows what that actually means ???
More answers --> more questions as usual.

pp-mo · 2019-07-16T18:05:19Z

I made a new POC extension of the _limited_shape call : https://github.com/SciTools/iris/compare/master...pp-mo:chunk_control?expand=1

That does fix a testcase I have with very small chunks ~= 5000-odd * (1, 128, 128) :
Result is much faster.

I will try to post that as a testcase when it comes to a PR.
However, I'm not sure I have the options right yet.
At present, it is happily modifying any chunking specified in the file that it "doesn't like".
It also still needs all the extensions for the user control. And, we should explore using dask 'auto' to replace this Iris code.

pp-mo · 2019-07-17T09:33:59Z

However, I'm not sure I have the options right yet.

I've just tried to make a summary of the outstanding known issues.
I hope this is encouragingly concise !! ...

Requests :

More control over lazy data creation (chunking) #3333 : user control of dask chunking in Iris
Reading netcdf files is slow if there are unlimited dimensions #3357 : unlimited dimensions cause gross chunking inefficiencies in Iris

Problems:

files with awkwardly small chunks set [like STAGE files, here in MetOffice]
files with unlimited dimensions
( files with large contiguous variables : this one is currently fixed, in Iris )
Iris imposes high per-chunk overhead
Iris target (i.e. default) chunk size is too small
Iris implements own chunk handling, unnecessarily duplicating Dask 'auto' option
Iris should allow user control for special cases
Dask cannot see chunk options in an Iris NetCDFDataProxy

Possible Solution Techniques:

(problem#4) reduce Iris per-chunk overhead by keeping files open (somewhat, somehow)
~~(problem#5) increase Iris target chunksize~~ (UPDATE: now configurable Chunk control #3361)
~~(problem#1) extend Iris chunk 'guessing' to expand too-small chunks (as in the POC mentioned above)~~ (UPDATE: done Chunk control #3361)
replace Iris chunk 'guessing' logic with use of Dask 'chunks="auto"'
a. give Iris NetCDFDataProxy a 'chunks' property, to enable dask 'auto' to work properly
enable user control of chunks in Iris loading (i.e. More control over lazy data creation (chunking) #3333)
a. add keys to netcdf.load_cubes
a. add **kwargs (and myabe *args) handling into iris.load_XXX calls

Proposed Key Testcases:

file with huge variable
file with explicit, too-large chunks
file with explicit, too-small chunks
file with unlimited dimension
file with unlimited dimension AND creator-selected chunking

Of course, this only covers the issues known to me (!)
@bjlittle @tkarna @TomekTrzeciak Any comments / additions / suggestions ?

TomekTrzeciak · 2019-07-17T16:16:13Z

I've just tried to make a summary of the outstanding known issues.
I hope this is encouragingly concise !! ...

That's a nice write-up 👍, thanks for collecting it all in one place.

Possible Solution Techniques:

reduce Iris per-chunk overhead by keeping files open (somewhat, somehow)

Definitely non-trivial problem, especially if you consider use cases involving pickling and such. There is a fair bit of code in xarray CachingFileManager for that very purpose.

increase Iris target chunksize

extend Iris chunk 'guessing' to expand too-small chunks (as in the POC mentioned above)

replace Iris chunk 'guessing' logic with use of Dask 'chunks="auto"'
a. give Iris NetCDFDataProxy a 'chunks' property, to enable dask 'auto' to work properly

enable user control of chunks in Iris loading (i.e. More control over lazy data creation (chunking) #3333)
a. add keys to netcdf.load_cubes
a. add **kwargs (and myabe *args) handling into iris.load_XXX calls

4+5 would be my vote. I've taken a stab at 4 in the code I'm working on ATM:
https://github.com/metoppv/improver/pull/876/files/2cb1d0875e6e9dc99b539243532b277e22e58910..4c016001e6878699effdc732068ef3c47d589453

I think 5 is still desirable for cases where more control is need in order to avoid manual hacks as in the link above. Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size. This could work sufficiently well for cases with multiple variables in a single file that you were concerned about in #3333.

pp-mo · 2019-07-17T17:23:43Z

Thanks for your feedback @TomekTrzeciak

4+5 would be my vote

I'm tending that way too. But I do still have some doubts. Experimenting with the key function dask.array.core.normalize_chunks, I find it does a pretty good job except I wish it would divide earlier dimensions preferentially over later ones (like the existing Iris code). Instead, it aims to divide all dimension equally,
e.g. normalise_chunks((50, 1500, 2000), limit=10.e6, existing_chunks=(50, 1500, 2000))
results in (10, 375, 500), but we would really much prefer (3, 1500, 2000).
I think we can possibly influence this if we need to, by 'pinning' the last N dimensions with -1. But it's still a shame to maintain our own strategy, when Dask already has one.
I might feedback a proposal to dask on this, if I can see a clear purpose for the 'better' way. I think it arises because we "know" that for netcdf data, the dimensions follow C- rather than FORTRAN-order.

keeping files open ... Definitely non-trivial problem

Agreed!
From our experiments, I believe that simply using larger chunks will deliver big benefits, and is much easier to achieve.

I think 5 is still desirable for cases where more control is need

Well I'm still wondering if we could persuade you otherwise ?!?
For the cases I'm looking at, I think the equivalent of (3.=extend iris strategy) or (4.=replace iris with dask) do a good job,.
It might be good enough ??

Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size.

A really nice spot 💐, but it still has the same problem of being tied to the low-level file encoding, which must then be both known (to the user) and stable : In this case we need dimension names, which don't actually appear anywhere in the iris/CF data representation.
If we could use names of dimension coordinates, that would help loads with this. But I think we can only do that if we first "pre-load", then "re-load" the file. 😢

Also, technically, as the occurrence and order of dimensions will differ between variables, you might still want dimensions divided differently for different variables. But I agree that is probably an obscure case : In principle you could even have an X(r:74, t:20000, y:1500, x:2000) and a Y(x:2000, t:20000, r:74), but I think it would be really rare -- maybe a usecase in ocean data ??

pp-mo · 2019-07-17T17:32:26Z

using larger chunks will deliver big benefits

Aside: I think that @stephenworsley has recently shown significant improvements in saving a 'stack' of 2D fields, by setting the chunking of the whole stack larger than the 2D field sizes that contribute to it.

TomekTrzeciak · 2019-07-19T15:18:12Z

I wish it would divide earlier dimensions preferentially over later ones (like the existing Iris code). Instead, it aims to divide all dimension equally,

I think dask.array.core.normalize_chunks is intended as a general utility, not just for chunking files or serialised formats (most significantly, it is also used in rechunking dask arrays). I think one could easily argue for several different behaviours:

maintain original chunk proportions - sensible if you expect the access patterns that led to the current array chunks to continue in the future and just want to reduce their fragmentation below certain threshold
try to keep the chunks roughly square - sensible if access patterns along different dimensions are unpredictable
try to keep inner dimensions continuous in preference to outer ones - sensible if you want to optimise for sequential reading, like from disk or tape

I don't think there is a perfect choice here, it all depends on the use case.

I might feedback a proposal to dask on this, if I can see a clear purpose for the 'better' way.

Perhaps dask could have an option like, say, chunking_strategy='propotional' or 'equal' or 'serialized' or custom_function? That way one-fits-all solution would not be required.

pp-mo · 2019-07-26T11:58:07Z

Evidence for #3361 fixing the original problem here : see this

lbdreyer · 2019-08-23T13:43:27Z

@tkarna
The fix @pp-mo mentioned (#3361) has now been merged into master, and will be released as part of Iris 2.3

In iris 2.2 netcdf file with unlimited dimension (here time) is read one slice at a time. This is slow for long time series. See: iris issue SciTools/iris#3357

bjlittle self-assigned this Jul 12, 2019

bjlittle added the Type: Performance label Jul 12, 2019

bjlittle added Status: Decision Required Votable Feature labels Jul 16, 2019

bjlittle added Release: Minor and removed Status: Decision Required Votable Feature labels Jul 16, 2019

bjlittle added this to the v2.3.0 milestone Jul 16, 2019

pp-mo mentioned this issue Jul 16, 2019

More control over lazy data creation (chunking) #3333

Closed

znicholls mentioned this issue Jul 18, 2019

Monkey patch iris and speed up crunching znicholls/netcdf-scm#72

Merged

4 tasks

pp-mo mentioned this issue Jul 25, 2019

Chunk control #3361

Merged

This was referenced Jul 29, 2019

Use dask whenever possible in preprocessor to keep memory intake low ESMValGroup/ESMValCore#32

Closed

Iris save taking very long (hanging) with some netcdf files #3362

Closed

lbdreyer closed this as completed in #3361 Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading netcdf files is slow if there are unlimited dimensions #3357

Reading netcdf files is slow if there are unlimited dimensions #3357

tkarna commented Jul 12, 2019

bjlittle commented Jul 12, 2019

tkarna commented Jul 13, 2019

tkarna commented Jul 13, 2019

pp-mo commented Jul 15, 2019

bjlittle commented Jul 15, 2019 •

edited

Loading

bjlittle commented Jul 16, 2019 •

edited

Loading

bjlittle commented Jul 16, 2019

tkarna commented Jul 16, 2019

tkarna commented Jul 16, 2019

bjlittle commented Jul 16, 2019

tkarna commented Jul 16, 2019

pp-mo commented Jul 16, 2019

bjlittle commented Jul 16, 2019

bjlittle commented Jul 16, 2019

bjlittle commented Jul 16, 2019

TomekTrzeciak commented Jul 16, 2019

tkarna commented Jul 16, 2019

pp-mo commented Jul 16, 2019 •

edited

Loading

TomekTrzeciak commented Jul 16, 2019 •

edited

Loading

TomekTrzeciak commented Jul 16, 2019 •

edited

Loading

pp-mo commented Jul 16, 2019 •

edited

Loading

pp-mo commented Jul 16, 2019

TomekTrzeciak commented Jul 16, 2019

pp-mo commented Jul 16, 2019

pp-mo commented Jul 16, 2019 •

edited

Loading

pp-mo commented Jul 17, 2019 •

edited

Loading

TomekTrzeciak commented Jul 17, 2019 •

edited

Loading

Possible Solution Techniques:

pp-mo commented Jul 17, 2019 •

edited

Loading

pp-mo commented Jul 17, 2019 •

edited

Loading

TomekTrzeciak commented Jul 19, 2019

pp-mo commented Jul 26, 2019

lbdreyer commented Aug 23, 2019

Reading netcdf files is slow if there are unlimited dimensions #3357

Reading netcdf files is slow if there are unlimited dimensions #3357

Comments

tkarna commented Jul 12, 2019

bjlittle commented Jul 12, 2019

tkarna commented Jul 13, 2019

tkarna commented Jul 13, 2019

pp-mo commented Jul 15, 2019

bjlittle commented Jul 15, 2019 • edited Loading

bjlittle commented Jul 16, 2019 • edited Loading

bjlittle commented Jul 16, 2019

tkarna commented Jul 16, 2019

tkarna commented Jul 16, 2019

bjlittle commented Jul 16, 2019

tkarna commented Jul 16, 2019

pp-mo commented Jul 16, 2019

bjlittle commented Jul 16, 2019

bjlittle commented Jul 16, 2019

bjlittle commented Jul 16, 2019

TomekTrzeciak commented Jul 16, 2019

tkarna commented Jul 16, 2019

pp-mo commented Jul 16, 2019 • edited Loading

TomekTrzeciak commented Jul 16, 2019 • edited Loading

TomekTrzeciak commented Jul 16, 2019 • edited Loading

pp-mo commented Jul 16, 2019 • edited Loading

pp-mo commented Jul 16, 2019

TomekTrzeciak commented Jul 16, 2019

pp-mo commented Jul 16, 2019

pp-mo commented Jul 16, 2019 • edited Loading

pp-mo commented Jul 17, 2019 • edited Loading

Requests :

Problems:

Possible Solution Techniques:

Proposed Key Testcases:

TomekTrzeciak commented Jul 17, 2019 • edited Loading

Possible Solution Techniques:

pp-mo commented Jul 17, 2019 • edited Loading

pp-mo commented Jul 17, 2019 • edited Loading

TomekTrzeciak commented Jul 19, 2019

pp-mo commented Jul 26, 2019

lbdreyer commented Aug 23, 2019

bjlittle commented Jul 15, 2019 •

edited

Loading

bjlittle commented Jul 16, 2019 •

edited

Loading

pp-mo commented Jul 16, 2019 •

edited

Loading

TomekTrzeciak commented Jul 16, 2019 •

edited

Loading

TomekTrzeciak commented Jul 16, 2019 •

edited

Loading

pp-mo commented Jul 16, 2019 •

edited

Loading

pp-mo commented Jul 16, 2019 •

edited

Loading

pp-mo commented Jul 17, 2019 •

edited

Loading

TomekTrzeciak commented Jul 17, 2019 •

edited

Loading

pp-mo commented Jul 17, 2019 •

edited

Loading

pp-mo commented Jul 17, 2019 •

edited

Loading