Stateless run in kedro #162

AlexandreOuellet · 2019-11-13T16:12:32Z

Description

I would like Kedro to be able to run in parallel, without each run interfering with each other. Since each node uses inputs and outputs as absolute paths, 2 runs in parallel would overwrite each/other's output.

Context

This is very useful when we want to run Kedro multiple batches at the same time (Say, for parallelising large predictions, or even for streaming single prediction).

Possible Implementation

I am not certain that this is feasible in Kedro right now (or ever), given the scope of this tool. My best guess on the proper implementation would be to have an option to have temporary paths in the AbstractDataset, each implemented by the concrete AbstractDataset.

Possible Alternatives

My best guess right now is to simply do local catalogs with relative path (except the very first inputs), and copy/paste the kedro project for each kedro run I want to do.

WaylonWalker · 2019-11-13T17:47:14Z

Do versioned datasets solve your problem? Results from both runners would exist, but only one would remain the latest.

https://kedro.readthedocs.io/en/latest/04_user_guide/04_data_catalog.html#versioning-datasets-and-ml-models.

AlexandreOuellet · 2019-11-13T18:51:48Z

Hmm...I don't think it would work in our case. The data is essentially volatile. There are sources to enrich the input data, join it with other datasets, create intermediary data, pass it to a model, output an answer, but then, we don't care anymore about any intermediary data, only the answer of the model.

dasturge · 2019-11-13T22:54:34Z

I've been considering some related issues myself.

If you have say, a set number between 5 and 20, I would keep it in the current kedro workflow:

def create_pipeline():
   pipelines = []
   preprecessed_endpoints = []
   for idx in range(20):
       output_name = "out_preprocessed_shard_%s" % idx
       pipelines.append(Pipeline([
               node(preprocess,
                        ["input_dataset_shard_%s" % idx],
                        [output_name])
           ]))
       preprocessing_endpoints.append(output_name)

   pipeline = Pipeline(
                       pipelines,
                       node(merge, preprocessing_endpoints, "output"
                    )
   
   return pipeline

using this pattern you can quickly generate a parallel workflow into one graph. It requires maybe a large catalog, but you can easily use pyyaml to construct it. If the number is really large or if it needs to be generated dynamically, you might move towards dask, and since you say the intermediates are "volatile", you can just rely on dask internally for parallelization and out-of-memory computation, and keep intermediates as MemoryDataSets if you want.

lorenabalan · 2020-05-28T10:30:08Z

Coming back here to check if this could be a modular pipelines use case.
To be honest I can't think of a way to do this without versioning or just using MemoryDataSets for intermediary datasets so there is no overwriting. If you leave them in memory, it can just be an exercise of composing a DAG that has embarrassingly-parallel branches, and @dasturge 's suggestion above is a great guide! I believe this has been simplified slightly with modular pipelines (particularly if you have more complex pipelines you need to reuse), so I included an example below.

from kedro.pipeline import pipeline   # available from kedro 0.16.*


prediction_pipeline = Pipeline([...])  # input to the pipeline is some input_dataset

prediction_pipeline_1 = pipeline(prediction_pipeline, inputs={input_dataset: input_dataset}, namespace="run1")
prediction_pipeline_2 = pipeline(prediction_pipeline, inputs={input_dataset: input_dataset}, namespace="run2")
...
prediction_pipeline_x = pipeline(prediction_pipeline, inputs={input_dataset: input_dataset}, namespace="runx")

^ your parallel branches, which you can bundle together like:

master_pipeline = prediction_pipeline_1 + prediction_pipeline_2 + ...

OR

# potentially even with a node to join all outputs together
master_pipeline = Pipeline([prediction_pipeline_1, prediction_pipeline_2, ..., node(...)... ])

Obviously the above can be restructured into a list comprehension, but the idea is:

you have a pipeline you need to run x times, preferably in parallel, based on the same input
namespacing them means dataset names that are not specified under inputs or outputs will be prefixed (run1.original_name) so the pipeline won't complain that you're writing to the same catalog entry
if you don't specify the new prefixed names in the catalog, they'll reside in memory -> happy days for intermediary datasets that you don't care about!
however if you want to persist each individual output, you would need run1.output_dataset, run2.output_dataset etc. in the catalog.yml.

limdauto · 2020-06-01T12:55:46Z

Another way to look at this is related to #382

If user can write extensions so that they can easily add a new parameter such as --batch to run command:

$ kedro run --batch=<batch_id>

then they can launch multiple scripts in the background, each running one particular batch.

lorenabalan · 2020-10-05T13:29:20Z

Closing this as answered / stale. Please feel free to raise a new issue if you encounter further problems.

AlexandreOuellet added the Issue: Feature Request New feature or improvement to existing feature label Nov 13, 2019

mzjp2 mentioned this issue Jun 1, 2020

Running multiple pipelines in Kedro #396

Closed

seeM mentioned this issue Jul 6, 2020

Partition-wise pipeline runs #432

Closed

lorenabalan closed this as completed Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateless run in kedro #162

Stateless run in kedro #162

AlexandreOuellet commented Nov 13, 2019

WaylonWalker commented Nov 13, 2019

AlexandreOuellet commented Nov 13, 2019

dasturge commented Nov 13, 2019

lorenabalan commented May 28, 2020

limdauto commented Jun 1, 2020

lorenabalan commented Oct 5, 2020

Stateless run in kedro #162

Stateless run in kedro #162

Comments

AlexandreOuellet commented Nov 13, 2019

Description

Context

Possible Implementation

Possible Alternatives

WaylonWalker commented Nov 13, 2019

AlexandreOuellet commented Nov 13, 2019

dasturge commented Nov 13, 2019

lorenabalan commented May 28, 2020

limdauto commented Jun 1, 2020

lorenabalan commented Oct 5, 2020