Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading netcdf files is slow if there are unlimited dimensions #3357

Closed
tkarna opened this issue Jul 12, 2019 · 32 comments · Fixed by #3361
Closed

Reading netcdf files is slow if there are unlimited dimensions #3357

tkarna opened this issue Jul 12, 2019 · 32 comments · Fixed by #3361

Comments

@tkarna
Copy link

tkarna commented Jul 12, 2019

Reading arrays from NetCDF files is slow if one dimension is unlimited.

An example: reading an array of shape (5000, 50) takes ~7 s if the first dimension is unlimited. If both dimensions are fixed, it takes ~0.02 s. This is a major bottleneck if many (100s) of such files need to be processed. Time dimension is often declared unlimited in files generated by circulation models.

Test case:

import iris
import time

f = 'example_dataset.nc'
var = 'sea_water_practical_salinity'

tic = time.process_time()
cube = iris.load_cube(f, var)
cube.data
duration = time.process_time() - tic
print('Duration {:.3f} s'.format(duration))

The input NetCDF file can be generated with:

import iris
import numpy
import datetime

ntime = 5000
nz = 50
dt = 600.
time = numpy.arange(ntime, dtype=float)*dt
date_zero = datetime.datetime(2000, 1, 1)
date_epoch = datetime.datetime.utcfromtimestamp(0)
time_epoch = time + (date_zero - date_epoch).total_seconds()
z = numpy.linspace(0, 10, nz)
values = 5*numpy.sin(time/(14*24*3600.))
values = numpy.tile(values, (nz, 1)).T

time_dim = iris.coords.DimCoord(time_epoch, standard_name='time',
                                units='seconds since 1970-01-01 00:00:00-00')
z_dim = iris.coords.DimCoord(z, standard_name='depth', units='m')
cube = iris.cube.Cube(values)
cube.standard_name = 'sea_water_practical_salinity'
cube.units = '1'
cube.add_dim_coord(time_dim, 0)
cube.add_dim_coord(z_dim, 1)
iris.fileformats.netcdf.save(cube, 'example_dataset.nc',
                             unlimited_dimensions=['time'])

Profiling suggest that in the unlimited case, each time slice is being read separately, i.e. NetCDFDataProxy.__getitem__ is being called 5000 times.

Tested with: iris version 2.2.0, Anaconda3 2019.03

@bjlittle
Copy link
Member

Thanks for taking the time to report this @tkarna.

I'll see if I can replicate your timings on my side. What version of iris are you using? I'm assuming 2.2.1

@tkarna
Copy link
Author

tkarna commented Jul 13, 2019

I'm using version 2.2.0 installed with conda.

@tkarna
Copy link
Author

tkarna commented Jul 13, 2019

For what it's worth I see the same performance with version 2.2.1 and the master branch as well.

@pp-mo
Copy link
Member

pp-mo commented Jul 15, 2019

Could even be a candidate for user-controlled chunking ? see #3333

@bjlittle
Copy link
Member

bjlittle commented Jul 15, 2019

@tkarna I can recreate the behaviour on my side, so thanks for the repeatable example.

That said, after digging a little, it appears that dask is blocking in a sleep...
Screenshot from 2019-07-15 16-11-47
I'll dig a little further...

For the non unlimited case, the netcdf variable sea_water_practical_salinity has chunking of contiguous i.e. no chunking. However, for the unlimited case, the netcdf variable sea_water_practical_salinity has a default chunking of [1, 50].... hence why it is being read 5000 times. This explains why the load time is chronically slow - note that dask prefers much larger chunk sizes, otherwise the overhead of using dask outweighs the processing of small chunks of data payload, see https://docs.dask.org/en/latest/array-chunks.html#specifying-chunk-shapes

That said, I simply saved the netcdf file with the netcdf chunksizes set specifically to [5000, 50] e.g.

import iris
import numpy
import datetime

ntime = 5000
nz = 50
dt = 600.
time = numpy.arange(ntime, dtype=float)*dt
date_zero = datetime.datetime(2000, 1, 1)
date_epoch = datetime.datetime.utcfromtimestamp(0)
time_epoch = time + (date_zero - date_epoch).total_seconds()
z = numpy.linspace(0, 10, nz)
values = 5*numpy.sin(time/(14*24*3600.))
values = numpy.tile(values, (nz, 1)).T

time_dim = iris.coords.DimCoord(time_epoch, standard_name='time',
                                units='seconds since 1970-01-01 00:00:00-00')
z_dim = iris.coords.DimCoord(z, standard_name='depth', units='m')
cube = iris.cube.Cube(values)
cube.standard_name = 'sea_water_practical_salinity'
cube.units = '1'
cube.add_dim_coord(time_dim, 0)
cube.add_dim_coord(z_dim, 1)
iris.fileformats.netcdf.save(cube, 'example_dataset_unlimited_chunks.nc',
                             unlimited_dimensions=['time'], chunksizes=[5000, 50])

This resulted in a load time of ~0.057s

So either:

  • set the chunksizes explicitly to iris.fileformats.netcdf.save
  • or align the dask reading chunks with the default netcdf variable chunksizes

Iris will always set the chunksize on loading within dask to align with the chunksize in netcdf, but obviously this can be sub-optimal to dask (and iris) for small netcdf chunksizes

@bjlittle
Copy link
Member

bjlittle commented Jul 16, 2019

@tkarna As an extension, I customised a development version of iris to hardwire the chunks to be [5000, 50] on loading a cube i.e. the custom chunks = [5000, 50] overrides the netcdf chunking specified in the file for the variable sea_water_practical_salinity of [1, 50].

This custom chunks aligns with the underlying [1, 50] default chunking of the netcdf variable with UNLIMITED 1st dimension. This results in a load and touching of the data of ~0.15s.

I get the same good loading/touching performance if I set the custom chunks to -1 or auto.

This suggests to me that it would be useful to give control to the user to specify the chunks and override any chunking that is specified in the netcdf file at the iris API level.

@tkarna Does this help?

@bjlittle
Copy link
Member

Ping @pp-mo and @TomekTrzeciak

@tkarna
Copy link
Author

tkarna commented Jul 16, 2019

Thanks @bjlittle for looking into this. I confirm that the performance hit is caused by chunking. Setting chunks=None in netcdf._get_cf_var_data fixes the issue.

Modifying the input file chunking is not an option for me, as I get these files from an external model and I do not wish to convert them. Iris should be able to read any netCDF file with reasonable performance.

Having the chunk options in the iris load_cube API would be useful. This would fix my issue.

@tkarna
Copy link
Author

tkarna commented Jul 16, 2019

Another option, perhaps, would be to define better default chunking scheme. For example, if the file chunk only contains < N values, one would revert to chunks=None as (I'd imagine) reading in tiny chunks is never a good idea. This would also be easier for the user as one wouldn't need to tweak with the chunk options.

@bjlittle
Copy link
Member

@tkarna I think that there should be a happy compromise between:

  1. iris honouring the chunking specified in the netcdf file
  2. iris overriding the chunking specified in the netcdf file that is considered too small and will result in a performance hit
  3. allowing the user to override iris by specifying the chunks at the API level

iris already performs step 1. above, as we know. Extending iris to support steps 2. and 3. should cover most other use cases, and give enough wiggle room for users to circumvent most chunking issues. Like you said, you're a third-party to the data and can't change the native netcdf chunking of variables in the file, so iris needs to help you easily work around that.

@tkarna
Copy link
Author

tkarna commented Jul 16, 2019

@bjlittle Yes, that sounds very good indeed.

@pp-mo
Copy link
Member

pp-mo commented Jul 16, 2019

FYI I am also currently investigating a slow load case, where the file has very small chunks specified, and that shows similar gross inefficiencies!
It also appears, from trials with "just dask", that Iris itself is responsible for a large part of the performance overhead, because (if I understand right) it opens + closes the file every time.
So maybe chunking control isn't the only thing that needs improving.
I hope to distil this into a separate issue, and maybe we can use it as a performance testcase. Watch this space ...

@bjlittle
Copy link
Member

@pp-mo Dask automatic chunking is relevant here.

@bjlittle
Copy link
Member

@pp-mo To separate concerns here, could you please open a separate issue (cross-referencing this one) that focuses on the performance overhead of opening and closing files, thanks.

@bjlittle
Copy link
Member

@pp-mo Just to be clear, setting a dimension to UNLIMITED causes netCDF4 to inject a chunking strategy automatically for a netcdf variable in the netcdf file. The choice of chunking used can be unfavourable to dask, making loading sub-optimal, as the example in this issue highlights.

@TomekTrzeciak
Copy link

@tkarna I think that there should be a happy compromise between:

  1. iris honouring the chunking specified in the netcdf file
  2. iris overriding the chunking specified in the netcdf file that is considered too small and will result in a performance hit
  3. allowing the user to override iris by specifying the chunks at the API level

iris already performs step 1. above, as we know. Extending iris to support steps 2. and 3. should cover most other use cases, and give enough wiggle room for users to circumvent most chunking issues. Like you said, you're a third-party to the data and can't change the native netcdf chunking of variables in the file, so iris needs to help you easily work around that.

@bjlittle, personally I would vote to do the dumb thing here and either load each file in one chunk or rely on dask to make the chunking choice unless overridden by an explicit option. This leads to much more predictable behaviour in the long run. You can then allow users to explicitly set chunks=... in iris on per call basis or with context manager or globally.

Also, I think it's preferable to address any deficiencies in chunking policy upstream in dask, so that iris doesn't have to interfere too much. But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has chunks attribute. Iris could add this to its NetCDFDataProxy object and get the right behaviour for free.

@tkarna
Copy link
Author

tkarna commented Jul 16, 2019

@bjlittle @pp-mo I do agree that the performance hit is partially in iris itself, and related to the opening/closing of the netCDF file. This happens in NetCDFDataProxy.__getitem__ which in my worst-case example is called 5000 times. As demonstrated, the chunking choice does affect the reading pattern though.

@pp-mo
Copy link
Member

pp-mo commented Jul 16, 2019

@TomekTrzeciak personally I would vote to do the dumb thing ... load each file in one chunk

Unfortunately loading a variable as a single chunk is the one thing we must avoid, if the chunk is simply unmanageably large. That is the one thing that the current strategy was designed to solve.

If we currently decide that a chunk is too large + divide it up, I think in future we can reasonably decide that a chunk is too small and combine a few. The logic is essentially the same.
However, the undesirable effect of unlimited dimensions, in that it mimics user-selected variable chunking, is a bit of a shock. Maybe Dask itself has already engineered ways around that problem. I'm not sure yet how much we can handily delegate decisions to dask.

@TomekTrzeciak
Copy link

TomekTrzeciak commented Jul 16, 2019

But based on dask.array.from_array docs, dask already respects chunk boundaries if the underlying object has chunks attribute. Iris could add this to its NetCDFDataProxy object and get the right behaviour for free.

Here's how this looks in practice:

class PseudoArray:
    def __init__(self, shape, dtype, chunks):
        self.shape = shape
        self.dtype = dtype
        self.chunks = chunks
shape = (120, 12, 1000, 1000)
netcdf_chunks = (16, 2, 32, 32)
a_proxy = PseudoArray(shape, dtype='float64', chunks=netcdf_chunks)
result = da.from_array(a_proxy, chunks='50 MiB')
print('netcdf chunks:', da.core.normalize_chunks(a_proxy.chunks, a_proxy.shape, a_proxy.dtype))
print('dask chunks:', result.chunks)

Output:
netcdf chunks: ((16, 16, 16, 16, 16, 16, 16, 8), (2, 2, 2, 2, 2, 2), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8), (32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 8))
dask chunks: ((48, 48, 24), (6, 6), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40), (96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 40))

Edit:

print('dask default chunks (128 MiB):', da.from_array(a_proxy).chunks)

Output:
dask default chunks (128 MiB): ((64, 56), (8, 4), (128, 128, 128, 128, 128, 128, 128, 104), (128, 128, 128, 128, 128, 128, 128, 104))

Edit2:

And this is what happens if there's no chunks exposed (note how this will trash the original chunk boundaries):

print('dask bad chunks:', da.from_array(PseudoArray(shape, 'float64', None)).chunks)

Output:
dask bad chunks: ((60, 60), (12,), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100), (100, 100, 100, 100, 100, 100, 100, 100, 100, 100))

@TomekTrzeciak
Copy link

TomekTrzeciak commented Jul 16, 2019

You can then allow users to explicitly set chunks=... in iris on per call basis or with context manager or globally.

Actually, dask already comes with its own context manger for that purpose, so one could (should?) do something like this:

with dask.config.set({'array.chunk-size': '20 MiB'}):
    mycube = iris.load_cube(filepath)

The above should work after the following change:

diff --git a/lib/iris/_lazy_data.py b/lib/iris/_lazy_data.py
index f5312b7d..00eef321 100644
--- a/lib/iris/_lazy_data.py
+++ b/lib/iris/_lazy_data.py
@@ -58,28 +58,6 @@ def is_lazy_data(data):
     return result
 
 
-# A magic value, chosen to minimise chunk creation time and chunk processing
-# time within dask.
-_MAX_CHUNK_SIZE = 8 * 1024 * 1024 * 2
-
-
-def _limited_shape(shape):
-    # Reduce a shape to less than a default overall number-of-points, reducing
-    # earlier dimensions preferentially.
-    # Note: this is only a heuristic, assuming that earlier dimensions are
-    # 'outer' storage dimensions -- not *always* true, even for NetCDF data.
-    shape = list(shape)
-    i_reduce = 0
-    while np.prod(shape) > _MAX_CHUNK_SIZE:
-        factor = np.ceil(np.prod(shape) / _MAX_CHUNK_SIZE)
-        new_dim = int(shape[i_reduce] / factor)
-        if new_dim < 1:
-            new_dim = 1
-        shape[i_reduce] = new_dim
-        i_reduce += 1
-    return tuple(shape)
-
-
 def as_lazy_data(data, chunks=None, asarray=False):
     """
     Convert the input array `data` to a dask array.
@@ -92,8 +70,7 @@ def as_lazy_data(data, chunks=None, asarray=False):
     Kwargs:
 
     * chunks:
-        Describes how the created dask array should be split up. Defaults to a
-        value first defined in biggus (being `8 * 1024 * 1024 * 2`).
+        Describes how the created dask array should be split up.
         For more information see
         http://dask.pydata.org/en/latest/array-creation.html#chunks.
 
@@ -105,11 +82,6 @@ def as_lazy_data(data, chunks=None, asarray=False):
         The input array converted to a dask array.
 
     """
-    if chunks is None:
-        # Default to the shape of the wrapped array-like,
-        # but reduce it if larger than a default maximum size.
-        chunks = _limited_shape(data.shape)
-
     if isinstance(data, ma.core.MaskedConstant):
         data = ma.masked_array(data.data, mask=data.mask)
     if not is_lazy_data(data):

@pp-mo
Copy link
Member

pp-mo commented Jul 16, 2019

I do like the idea of exporting a 'chunks' on the data wrapper object. This makes me hope that delegating chunking decisions to Dask could work.

However, the problem with the above approach is that the main data variable is not the only variable that may be lazy-wrapped within an iris.load call. If lazy aux-coords are created, they won't map the same dimensions as the main data variable.
So it is ok for 'general choices' like array.chunk-size, but we can't use it to control chunk sizes directly.

Likewise, as 'iris.load_xxx' in general deals with multiple input files and multiple output cubes, it just needs a bit more finesse, I think.

I detailed this problem in #3333

@pp-mo
Copy link
Member

pp-mo commented Jul 16, 2019

I notice Dask have now published more info + advice on chunking.
In case you didn't see, https://docs.dask.org/en/latest/array-best-practices.html#select-a-good-chunk-size and https://docs.dask.org/en/latest/array-chunks.html#automatic-chunking

@TomekTrzeciak
Copy link

However, the problem with the above approach is that the main data variable is not the only variable that may be lazy-wrapped within an iris.load call : If lazy aux-coords are created, they won't map the same dimensions as the main data variable.

Not sure if I follow. Do you mean that aux-coords and variables should have the same chunking along the corresponding dimensions? If that's the case, then I guess the same problem applies to regular variables too (e.g. xy var broadcasted against xyt var with misaligned chunks will work but will be suboptimal for computation).

@pp-mo
Copy link
Member

pp-mo commented Jul 16, 2019

the same chunking along the corresponding dimensions?

Well that would seem to make sense, but it isn't how it is actually controlled.
I looked into the setting of chunking in netcdf, and AFAICT it goes like :

  • chunking is an HDF5 concept, therefore only applies to netcdf-4 (not earlier) format
  • it is specified per-variable, fixed at variable creation time (python Dataset.createVariable call)
    • can't subsequently be changed
    • is not tied to dimensions
    • is a property of the variable, stored in the file and can be read from a re-opened file
  • it affects the actual storage layout (notably, allowing for unlimited dimensions)
  • it is the opposite of 'contiguous'
    • python Variable.chunksizes() returns either "contiguous" or a tuple of ndims * chunks

So this, especially the last bit, may explain why having an unlimited dimension seems to 'force' the variable to be chunked : A 'chunked' form is required for extensibility along the unlimited dimension.

Finally, what I thought still wasn't clear, is whether you can have user-specified chunks when some dimension(s) are unlimited.
So, I tried it + I think you can ...

>>> ds = nc.Dataset('tmp.nc', 'w')
>>> ds.createDimension('t', 0)
>>> ds.createDimension('y', 3)
>>> ds.createDimension('x', 4)
>>> ds.createVariable('v1', np.int32, ['t', 'x', 'y'], chunksizes=(1, 1, 2))
>>> ds.close()
>>> ds = nc.Dataset('tmp.nc', 'r')
>>> ds.variables['v1'].chunking()
[1, 1, 2]
>>> 

And in fact, even ...

>>> ds.createVariable('v1', np.int32, ['t', 'x', 'y'], chunksizes=(5, 1, 2))
>>> ds.close()
>>> ds = nc.Dataset('tmp.nc', 'r')
>>> ds.variables['v1'].chunking()
[5, 1, 2]
>>> 

But who knows what that actually means ???
More answers --> more questions as usual.

@pp-mo
Copy link
Member

pp-mo commented Jul 16, 2019

I made a new POC extension of the _limited_shape call : https://github.com/SciTools/iris/compare/master...pp-mo:chunk_control?expand=1

That does fix a testcase I have with very small chunks ~= 5000-odd * (1, 128, 128) :
Result is much faster.

I will try to post that as a testcase when it comes to a PR.
However, I'm not sure I have the options right yet.
At present, it is happily modifying any chunking specified in the file that it "doesn't like".
It also still needs all the extensions for the user control. And, we should explore using dask 'auto' to replace this Iris code.

@pp-mo
Copy link
Member

pp-mo commented Jul 17, 2019

However, I'm not sure I have the options right yet.

I've just tried to make a summary of the outstanding known issues.
I hope this is encouragingly concise !! ...

Requests :

Problems:

  1. files with awkwardly small chunks set [like STAGE files, here in MetOffice]
  2. files with unlimited dimensions
  3. ( files with large contiguous variables : this one is currently fixed, in Iris )
  4. Iris imposes high per-chunk overhead
  5. Iris target (i.e. default) chunk size is too small
  6. Iris implements own chunk handling, unnecessarily duplicating Dask 'auto' option
  7. Iris should allow user control for special cases
  8. Dask cannot see chunk options in an Iris NetCDFDataProxy

Possible Solution Techniques:

  1. (problem#4) reduce Iris per-chunk overhead by keeping files open (somewhat, somehow)
  2. (problem#5) increase Iris target chunksize (UPDATE: now configurable Chunk control #3361)
  3. (problem#1) extend Iris chunk 'guessing' to expand too-small chunks (as in the POC mentioned above) (UPDATE: done Chunk control #3361)
  4. replace Iris chunk 'guessing' logic with use of Dask 'chunks="auto"'
    a. give Iris NetCDFDataProxy a 'chunks' property, to enable dask 'auto' to work properly
  5. enable user control of chunks in Iris loading (i.e. More control over lazy data creation (chunking) #3333)
    a. add keys to netcdf.load_cubes
    a. add **kwargs (and myabe *args) handling into iris.load_XXX calls

Proposed Key Testcases:

  1. file with huge variable
  2. file with explicit, too-large chunks
  3. file with explicit, too-small chunks
  4. file with unlimited dimension
  5. file with unlimited dimension AND creator-selected chunking

Of course, this only covers the issues known to me (!)
@bjlittle @tkarna @TomekTrzeciak Any comments / additions / suggestions ?

@TomekTrzeciak
Copy link

TomekTrzeciak commented Jul 17, 2019

I've just tried to make a summary of the outstanding known issues.
I hope this is encouragingly concise !! ...

That's a nice write-up 👍, thanks for collecting it all in one place.

Possible Solution Techniques:

  1. reduce Iris per-chunk overhead by keeping files open (somewhat, somehow)

Definitely non-trivial problem, especially if you consider use cases involving pickling and such. There is a fair bit of code in xarray CachingFileManager for that very purpose.

  1. increase Iris target chunksize
  2. extend Iris chunk 'guessing' to expand too-small chunks (as in the POC mentioned above)
  3. replace Iris chunk 'guessing' logic with use of Dask 'chunks="auto"'
    a. give Iris NetCDFDataProxy a 'chunks' property, to enable dask 'auto' to work properly
  4. enable user control of chunks in Iris loading (i.e. More control over lazy data creation (chunking) #3333)
    a. add keys to netcdf.load_cubes
    a. add **kwargs (and myabe *args) handling into iris.load_XXX calls

4+5 would be my vote. I've taken a stab at 4 in the code I'm working on ATM:
https://github.com/metoppv/improver/pull/876/files/2cb1d0875e6e9dc99b539243532b277e22e58910..4c016001e6878699effdc732068ef3c47d589453

I think 5 is still desirable for cases where more control is need in order to avoid manual hacks as in the link above. Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size. This could work sufficiently well for cases with multiple variables in a single file that you were concerned about in #3333.

@pp-mo
Copy link
Member

pp-mo commented Jul 17, 2019

Thanks for your feedback @TomekTrzeciak

4+5 would be my vote

I'm tending that way too. But I do still have some doubts. Experimenting with the key function dask.array.core.normalize_chunks, I find it does a pretty good job except I wish it would divide earlier dimensions preferentially over later ones (like the existing Iris code). Instead, it aims to divide all dimension equally,
e.g. normalise_chunks((50, 1500, 2000), limit=10.e6, existing_chunks=(50, 1500, 2000))
results in (10, 375, 500), but we would really much prefer (3, 1500, 2000).
I think we can possibly influence this if we need to, by 'pinning' the last N dimensions with -1. But it's still a shame to maintain our own strategy, when Dask already has one.
I might feedback a proposal to dask on this, if I can see a clear purpose for the 'better' way. I think it arises because we "know" that for netcdf data, the dimensions follow C- rather than FORTRAN-order.

keeping files open ... Definitely non-trivial problem

Agreed!
From our experiments, I believe that simply using larger chunks will deliver big benefits, and is much easier to achieve.

I think 5 is still desirable for cases where more control is need

Well I'm still wondering if we could persuade you otherwise ?!?
For the cases I'm looking at, I think the equivalent of (3.=extend iris strategy) or (4.=replace iris with dask) do a good job,.
It might be good enough ??

Xarray has a nice idea there to accept chunks given as a mapping of dimension name to chunk size.

A really nice spot 💐, but it still has the same problem of being tied to the low-level file encoding, which must then be both known (to the user) and stable : In this case we need dimension names, which don't actually appear anywhere in the iris/CF data representation.
If we could use names of dimension coordinates, that would help loads with this. But I think we can only do that if we first "pre-load", then "re-load" the file. 😢

Also, technically, as the occurrence and order of dimensions will differ between variables, you might still want dimensions divided differently for different variables. But I agree that is probably an obscure case : In principle you could even have an X(r:74, t:20000, y:1500, x:2000) and a Y(x:2000, t:20000, r:74), but I think it would be really rare -- maybe a usecase in ocean data ??

@pp-mo
Copy link
Member

pp-mo commented Jul 17, 2019

using larger chunks will deliver big benefits

Aside: I think that @stephenworsley has recently shown significant improvements in saving a 'stack' of 2D fields, by setting the chunking of the whole stack larger than the 2D field sizes that contribute to it.

@TomekTrzeciak
Copy link

I wish it would divide earlier dimensions preferentially over later ones (like the existing Iris code). Instead, it aims to divide all dimension equally,

I think dask.array.core.normalize_chunks is intended as a general utility, not just for chunking files or serialised formats (most significantly, it is also used in rechunking dask arrays). I think one could easily argue for several different behaviours:

  • maintain original chunk proportions - sensible if you expect the access patterns that led to the current array chunks to continue in the future and just want to reduce their fragmentation below certain threshold
  • try to keep the chunks roughly square - sensible if access patterns along different dimensions are unpredictable
  • try to keep inner dimensions continuous in preference to outer ones - sensible if you want to optimise for sequential reading, like from disk or tape

I don't think there is a perfect choice here, it all depends on the use case.

I might feedback a proposal to dask on this, if I can see a clear purpose for the 'better' way.

Perhaps dask could have an option like, say, chunking_strategy='propotional' or 'equal' or 'serialized' or custom_function? That way one-fits-all solution would not be required.

@pp-mo pp-mo mentioned this issue Jul 25, 2019
@pp-mo
Copy link
Member

pp-mo commented Jul 26, 2019

Evidence for #3361 fixing the original problem here : see this

@lbdreyer
Copy link
Member

@tkarna
The fix @pp-mo mentioned (#3361) has now been merged into master, and will be released as part of Iris 2.3

tkarna added a commit to tkarna/galene that referenced this issue Feb 21, 2022
In iris 2.2 netcdf file with unlimited dimension (here time) is
read one slice at a time. This is slow for long time series.

See: iris issue SciTools/iris#3357
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants