Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateless run in kedro #162

Closed
AlexandreOuellet opened this issue Nov 13, 2019 · 6 comments
Closed

Stateless run in kedro #162

AlexandreOuellet opened this issue Nov 13, 2019 · 6 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@AlexandreOuellet
Copy link

Description

I would like Kedro to be able to run in parallel, without each run interfering with each other. Since each node uses inputs and outputs as absolute paths, 2 runs in parallel would overwrite each/other's output.

Context

This is very useful when we want to run Kedro multiple batches at the same time (Say, for parallelising large predictions, or even for streaming single prediction).

Possible Implementation

I am not certain that this is feasible in Kedro right now (or ever), given the scope of this tool. My best guess on the proper implementation would be to have an option to have temporary paths in the AbstractDataset, each implemented by the concrete AbstractDataset.

Possible Alternatives

My best guess right now is to simply do local catalogs with relative path (except the very first inputs), and copy/paste the kedro project for each kedro run I want to do.

@AlexandreOuellet AlexandreOuellet added the Issue: Feature Request New feature or improvement to existing feature label Nov 13, 2019
@WaylonWalker
Copy link
Contributor

Do versioned datasets solve your problem? Results from both runners would exist, but only one would remain the latest.

https://kedro.readthedocs.io/en/latest/04_user_guide/04_data_catalog.html#versioning-datasets-and-ml-models.

@AlexandreOuellet
Copy link
Author

Hmm...I don't think it would work in our case. The data is essentially volatile. There are sources to enrich the input data, join it with other datasets, create intermediary data, pass it to a model, output an answer, but then, we don't care anymore about any intermediary data, only the answer of the model.

@dasturge
Copy link

I've been considering some related issues myself.

If you have say, a set number between 5 and 20, I would keep it in the current kedro workflow:

def create_pipeline():
   pipelines = []
   preprecessed_endpoints = []
   for idx in range(20):
       output_name = "out_preprocessed_shard_%s" % idx
       pipelines.append(Pipeline([
               node(preprocess,
                        ["input_dataset_shard_%s" % idx],
                        [output_name])
           ]))
       preprocessing_endpoints.append(output_name)

   pipeline = Pipeline(
                       pipelines,
                       node(merge, preprocessing_endpoints, "output"
                    )
   
   return pipeline

using this pattern you can quickly generate a parallel workflow into one graph. It requires maybe a large catalog, but you can easily use pyyaml to construct it. If the number is really large or if it needs to be generated dynamically, you might move towards dask, and since you say the intermediates are "volatile", you can just rely on dask internally for parallelization and out-of-memory computation, and keep intermediates as MemoryDataSets if you want.

@lorenabalan
Copy link
Contributor

Coming back here to check if this could be a modular pipelines use case.
To be honest I can't think of a way to do this without versioning or just using MemoryDataSets for intermediary datasets so there is no overwriting. If you leave them in memory, it can just be an exercise of composing a DAG that has embarrassingly-parallel branches, and @dasturge 's suggestion above is a great guide! I believe this has been simplified slightly with modular pipelines (particularly if you have more complex pipelines you need to reuse), so I included an example below.

from kedro.pipeline import pipeline   # available from kedro 0.16.*


prediction_pipeline = Pipeline([...])  # input to the pipeline is some input_dataset

prediction_pipeline_1 = pipeline(prediction_pipeline, inputs={input_dataset: input_dataset}, namespace="run1")
prediction_pipeline_2 = pipeline(prediction_pipeline, inputs={input_dataset: input_dataset}, namespace="run2")
...
prediction_pipeline_x = pipeline(prediction_pipeline, inputs={input_dataset: input_dataset}, namespace="runx")

^ your parallel branches, which you can bundle together like:

master_pipeline = prediction_pipeline_1 + prediction_pipeline_2 + ...

OR

# potentially even with a node to join all outputs together
master_pipeline = Pipeline([prediction_pipeline_1, prediction_pipeline_2, ..., node(...)... ])

Obviously the above can be restructured into a list comprehension, but the idea is:

  • you have a pipeline you need to run x times, preferably in parallel, based on the same input
  • namespacing them means dataset names that are not specified under inputs or outputs will be prefixed (run1.original_name) so the pipeline won't complain that you're writing to the same catalog entry
  • if you don't specify the new prefixed names in the catalog, they'll reside in memory -> happy days for intermediary datasets that you don't care about!
  • however if you want to persist each individual output, you would need run1.output_dataset, run2.output_dataset etc. in the catalog.yml.

@limdauto
Copy link
Contributor

limdauto commented Jun 1, 2020

Another way to look at this is related to #382

If user can write extensions so that they can easily add a new parameter such as --batch to run command:

$ kedro run --batch=<batch_id>

then they can launch multiple scripts in the background, each running one particular batch.

@lorenabalan
Copy link
Contributor

Closing this as answered / stale. Please feel free to raise a new issue if you encounter further problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

5 participants