Dask integration #1

caspervdw · 2020-03-06T08:00:08Z

Short summary of an email discussion: it was proposed to replace our lazy FramesSequence with dask.

I actually tried to combine dask with PIMS a while ago. Back then I concluded that dask.Array is not suitable for image sequences. This has to do with the sheer number of chunks that you need to initialize on loading. Dask keeps track of every chunk in a dict: 100000 frames would give a dict with that size. Any operation would add another 100000.

@danielballan proposed to not give each frame its own dask task, i.e. that dask array chunks have shape (1, ...). The internal dask array chunks should be of shape (N, ...) so that the each task in the graph (the dict kept by dask) represents multiple frames, keeping the total number of tasks in the graph manageable. Dask has some useful tooling where you can set the chunk size based on desired bytes-per-chunk. The relevant docs say that anything from 10 MB to 1 GB is common. Each PIMS reader can choose a suitable default value for chunk size and make it tune-able in init.

The text was updated successfully, but these errors were encountered:

caspervdw · 2020-03-06T08:12:32Z

Varying the chunk size might be the only way to scale this to long sequences, indeed. The problem with that is that you might perform too much IO. An example:

Say I had a 100fps movie and I need to do a heavy per-frame analysis. Chunking with (1, ...) is not an option here because of too many frames. I need to perform an analysis an every 100th frame (equivalent of sequence[::100]). If we now would have a (100, ...) chunking, dask will (correct me if I'm wrong) still load the complete chunk and you end up reading the whole movie (in chunks) while you could have only read 1/100 of it.

So either you construct a very big dict in memory or you perform far too much IO. Is there a way out of this?

Most of the current PIMS readers will allow this and only load the relevant frames from disk. There even is code that selects the most efficient reading function based on the queried slice (https://github.com/soft-matter/pims/blob/master/pims/base_frames.py#L310)

This might be an extreme example, but it stands for a whole bunch of use cases in which a dataset is partially processed.

danielballan · 2020-03-06T14:33:20Z

I see what you mean. I don't immediately see a way to do efficient strided slicing with dask without building the large dict up front. Brainstorming possibilities to explore:

Can we build a task graph that is itself lazy, meaning that instead of a dict with 100000 keys in memory, we build a dict-like object that materializes its items on demand? I suspect the dask scheduler would greedily materialize the whole thing, but I'm not sure.
Can we express the code that selects the most efficient reading function as a dask optimization on a high-level graph?

I'd be curious to know if @GenevieveBuckley, @jakirkham, or @jni see a clear way forward here.

danielballan · 2020-03-06T14:59:17Z

This looks relevant: https://github.com/dask-image/dask-imread/blob/115b01aecf29aa1df386a115c3365c40ba827835/dask_imread/__init__.py#L102-L109

caspervdw · 2020-03-06T19:03:06Z

One option is a two-step procedure. This is what I implemented in https://github.com/nens/dask-geomodeling , where we faced a similar issue. You stack lazy operations together, and only when you call .get_data(bbox=...), the graph is constructed and computed with dask.

This approach works well for us, but we have to implement every array/dataframe operation, which is a pity because there is so much more available in dask.Array.

An advanced optimization approach might be feasible. You may initialize one chunk with size (100000, 256, 256), and optimize that such that you reduce it to the slice(s) you are interested in.

caspervdw · 2020-03-07T19:43:47Z

I found a related discussion at dask: dask/dask#3514

The lazy dict-creation approach seems to require significant work on the dask core.

For an optimization approach, we could make:

an optimization rule that swaps elementwise operations and and slices ( (a + 1)[::100] to (a[::100] + 1)
an optimization rule that fuses a slice with a read function, making it skip irrelevant frames

This would at least allow strided reads combined with a large subset of dask.Array operations.

nkeim · 2020-04-15T20:29:06Z

Thanks, @danielballan for getting this started and @caspervdw for your focus on this exciting possibility! Sorry I haven't had much chance to contribute, especially because I have experience using dask for my particle-tracking workflow.

It sounds like dask is currently not well-suited to processing large movies at all, which is also my experience.

I know little of dask's internals so I'm not sure how difficult @caspervdw 's optimization rule ideas would be in practice. One complementary way to make progress would be to make it easier for users to develop their own workarounds to this limitation. pims2 could be something that can reliably plug into concurrent.futures, but that does not need to be represented on a task graph. For instance,

raw_seq = pims2.open(...)
seq = pipeline_func_2(pipeline_func_1(raw_seq))[::100]

pool_exec = concurrent.futures.ProcessPoolExecutor(...)  
# (or equivalent object from ipyparallel, dask, etc.)

future_frame_12 = pool_exec.submit(seq.get_frame, 12)  # obvious but inelegant

results = []
# Iterate through the sequence as an iterable of futures objects.
# If the user is not careful, results will quickly pile up in memory
# or on disk.

for fut in seq.read_futures(concurrent=pool_exec):
    results.append(analyze(fut.result()))

results = []
# Iterate through the sequence as actual frames, but with streaming.
# pims2 deals with the pool and yields frames as they become available.
# "buffer" limits the number of frames that can be processed concurrently, which
# limits memory use.
for im in seq.read(concurrent=pool_exec, buffer=50):
    results.append(analyze(im))

This would force pims2 to be as serializable and/or thread-safe as reasonably possible, which would hopefully make any eventual dask integration much less painful.

(In case you're curious, my own solution to particle tracking in dask has been to implement streaming, whereby each track-linking task in turn talks to the cluster scheduler, and submits feature-identification tasks on several frames concurrently. This approach limits the number of frame-level tasks that can be on the graph at once. But it has the drawback of opening a fresh pims reader for each frame! That happens to be inexpensive for the movie format I use.)

jakirkham · 2020-04-15T23:29:18Z

Based on the discussion so far, it seems like what you want is lazy support for things like a[::100]. Is that right?

My naive suggestion would be to look into making a very simplistic NumPy-like array object that supports this behavior. So objects would track their start and stop indices along with their step size. Dask could still go about slicing them assuming them to be NumPy-like arrays. Though the underlying behavior will delay the reads as expected. Users or pims2 would need to be mindful of this need by passing asarray=False to da.from_array(...) to ensure Dask doesn't convert the NumPy-like array to an actual NumPy array.

Since Dask will also try to call other NumPy operations on this object, you may want to implement things like __array__ and possibly __array_function__ to allow these operations to trigger a read for evaluation and return a NumPy array for processing (or process the NumPy array directly).

As serialization is relevant for any data transmission or spilling, the NumPy-like array would need some kind of __reduce__(self, ...) or similar pickling method. We could discuss custom serialization in Dask, but my guess is we would still need to know how to pickle things at the simplest level (and likely that is optimal enough).

My hope is these would be relatively lightweight/simple things to implement (and similar in some ways to how pims worked), but I could be missing context on how pims2 would/should work.

danielballan · 2020-04-24T17:06:22Z

Yes. It sounds like we could do this by making slicerator the "very simplistic NumPy-like array object." It hasn't been revised to account for NEP-18 yet, but I think it easily could be.

jakirkham · 2020-04-24T18:06:03Z

Yeah that makes sense.

One other thing I forgot to mention is Dask has support for meta objects. Basically a tiny dummy array to try out operations on to figure out what the output type will be. We could solve this a couple of ways:

Just make the meta object a NumPy array
Ensure slicerator is able to act as a meta
Have slicerator generate a meta to use

1 seems pretty easy. Though that may box us in a little if we want to use different arrays, but it could be overridable.

2 likely mostly falls out of what we have already described above. We may just want to make sure it corresponds to some small in-memory value under-the-hood. So perhaps reading a single pixel from the image and packing that back into a slicerator object or something else appropriate.

3 is the best of both worlds. It's a tiny amount more work than 1 while being more user friendly. Has all the benefits of 2, but is conceptually (and likely actually) simpler to implement.

Admittedly this is a minor point once everything else is in place. Going with 1 should get this off the ground with no additional work. Don't have a sense offhand how much work 2 would be. 3 would be pretty easy to do and just as easy to use.

sofroniewn · 2020-05-01T02:30:14Z

My naive suggestion would be to look into making a very simplistic NumPy-like array object that supports this behavior.

Just want to +1 that from the perspective of a downstream user having the NumPy-like array interface here would be very desirable. Exciting discussion!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask integration #1

Dask integration #1

caspervdw commented Mar 6, 2020

caspervdw commented Mar 6, 2020

danielballan commented Mar 6, 2020

danielballan commented Mar 6, 2020

caspervdw commented Mar 6, 2020

caspervdw commented Mar 7, 2020

nkeim commented Apr 15, 2020

jakirkham commented Apr 15, 2020

danielballan commented Apr 24, 2020

jakirkham commented Apr 24, 2020

sofroniewn commented May 1, 2020

Dask integration #1

Dask integration #1

Comments

caspervdw commented Mar 6, 2020

caspervdw commented Mar 6, 2020

danielballan commented Mar 6, 2020

danielballan commented Mar 6, 2020

caspervdw commented Mar 6, 2020

caspervdw commented Mar 7, 2020

nkeim commented Apr 15, 2020

jakirkham commented Apr 15, 2020

danielballan commented Apr 24, 2020

jakirkham commented Apr 24, 2020

sofroniewn commented May 1, 2020