Consistent nodes execution order with `SequentialRunner` #1350

noklam · 2022-03-15T17:34:23Z

Description

I'm always frustrated when I execute an identical pipeline it can yield different results. The root cause of this issue is due to the fact that multiple combinations of nodes exists. Kedro only try to solve the DAGs by finding 1 possible solution, but it is not guaranteed to be the same.

One workaround is to specify input/output of nodes to make sure there is only 1 possible solution, but this is not ideal as users has to maintain arbitrary dummy variables.

What is missing in the DAGs?

Seed of random number generator. Consider a simple pipeline with 3 nodes:

In this pipeline, there are 2 possible execution order with SequentialRunner, 1. A->B->C, 2. B->A->C. Although there are no strong preference whether 1/2 is better, it is better to stick with one of them, as the output can be changed.

Context

In data science/machine learning pipeline, setting a seed to ensure reproducible result are very common, and currently there are no easy way to achieve this.

Possible Implementation

Ensure the resolved nodes are sorted so it always run in the same order with SequentialRunner.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

The text was updated successfully, but these errors were encountered:

datajoely · 2022-03-15T17:46:52Z

So on this - we use the toposort~=1.5 external library, since Kedro was released there is now a stdlib https://docs.python.org/3/library/graphlib.html module that does the same thing. If we do look into making this deterministic, it might be a good opportunity to adopt this too.

noklam · 2022-03-15T21:17:57Z

Note: currently we are using a function called toposort, but actually there is a function call toposort_flatten will guarantee the node order

datajoely · 2022-03-16T08:42:12Z

@noklam awesome!

antonymilne · 2022-03-16T09:26:42Z

This is interesting - I'd never thought that individual node ordering would make any difference in practice so long as things are toposorted, even if that sort isn't deterministic.

@noklam just for my understanding, can you give a concrete example of something you might do where changing node execution order can alter the final output? Your diagram of nodes didn't come out correctly so it's a bit hard to understand.

datajoely · 2022-03-16T09:34:13Z

I'll just add users do ask for this maybe once every two months, I've even seen people introduce fake nodes to force ordering.

noklam · 2022-03-16T10:04:46Z

@AntonyMilneQB
This has to deal with the fact that the random number generator do have an specific order.

Imagine a simple pipeline like this

train_test_split_data-- \    
                         \   
                         train_model   
                         /   
                        /   
Model initialization--

Model initialization and train_test_split is 2 independent node, where it doesn't really matter who goes first. However, it is a common practice to set random seed to ensure the results are identical.

If we try to do kedro run and unfold the execution order, these are 2 possible scenraio.
Case 1

np.random.seed(2022)
model = model_initialization()
x_train, x_val, y_train, y_val = train_test_split(data)
train_model(x_train, x_val, y_train, y_val, model)

Case 2

np.random.seed(2022)
x_train, x_val, y_train, y_val = train_test_split(data)
model = model_initialization()
train_model(x_train, x_val, y_train, y_val, model)

Neither Case 1 / Case 2 is preferred, but one would preferred it is always Case1 or always Case 2 instead of bouncing between the 2.

antonymilne · 2022-03-16T11:01:45Z

@noklam Got it, thanks! I guess I usually pass random_seed into each function explicitly so didn't notice this before, but what you say makes perfect sense - it would be better if node execution order was completely deterministic.

deepyaman · 2022-03-30T18:56:10Z

Just seeing this. I'm not sure I like the idea of making any guarantees of deterministic node execution order, if people will start relying on it for anything beyond tests.

Guaranteed execution order only makes sense for SequentialRunner.
In a distributed/parallelized setting, any decision as to which node you run first could be made by a smart scheduler rather than user definition.

edit: Just to be clear, I'm fine with it being deterministic, but question making a guarantee that it would be deterministic. No code should depend on this. Trust me when I say, some users will find a way to depend on guaranteed execution order--I'm pretty sure I've seen requests for this sort of functionality--so I would go so far as to put a big warning to not depend on it lol.

noklam · 2022-03-31T15:58:52Z

The deterministic pipeline will work for SequentialRunner only indeed.

I would argue it is useful for the reproducible pipeline. My question would be what are the benefit of having random execution order with SequentialRunner?

Currently, the only way to achieve this is either

dummy input/output
Reset seed on every node

kingslee28 · 2022-06-09T07:55:54Z

I had exactly the same problem, that my machine learning pipeline returned different results with runs having different orders of nodes, some of which involved random seed settings.

A workaround solution is to pass a dummy I/O object across nodes, however, that is not a good solution in terms of coding design. (And what if there are many nodes invloved?)

I wonder if there is any follow up on this issue. Thank you!

noklam · 2022-06-09T09:14:29Z

@kingslee28 As mentioned above, there are 2 ways to go around this if this is a pressing issue.

dummy input/output as you have mentioned.
resetting seed on every node.
You can either set it in the node function directly or add a before_node_run hook so it always resets the seed.

Hopefully, the PR can fix this problem so we don't need these workarounds in the future. I am happy to pick up this one later in the sprint, I pretty much have the solution already but will need to add a test for it.

(p.s. I don't know how to properly rephrase the issue😅, I will keep use @deepyaman's word as Guaranteed execution order only makes sense for SequentialRunner since deterministic kedro pipeline is a bit over-promising.)

noklam · 2022-07-08T10:58:40Z

#Close by #1604
@kingslee28

noklam added the Issue: Feature Request New feature or improvement to existing feature label Mar 15, 2022

noklam changed the title ~~Deterministic kedro pipeline~~ Consistent nodes execution order with SequentialRunner Jun 9, 2022

noklam mentioned this issue Jun 9, 2022

Consistent node execution order by sorting node with Sequentialrunner #1604

Merged

5 tasks

noklam closed this as completed Jul 8, 2022

noklam linked a pull request Jul 8, 2022 that will close this issue

Consistent node execution order by sorting node with Sequentialrunner #1604

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent nodes execution order with `SequentialRunner` #1350

Consistent nodes execution order with `SequentialRunner` #1350

noklam commented Mar 15, 2022 •

edited

Loading

datajoely commented Mar 15, 2022 •

edited

Loading

noklam commented Mar 15, 2022

datajoely commented Mar 16, 2022

antonymilne commented Mar 16, 2022

datajoely commented Mar 16, 2022

noklam commented Mar 16, 2022 •

edited

Loading

antonymilne commented Mar 16, 2022 •

edited

Loading

deepyaman commented Mar 30, 2022 •

edited

Loading

noklam commented Mar 31, 2022

kingslee28 commented Jun 9, 2022

noklam commented Jun 9, 2022

noklam commented Jul 8, 2022

Consistent nodes execution order with SequentialRunner #1350

Consistent nodes execution order with SequentialRunner #1350

Comments

noklam commented Mar 15, 2022 • edited Loading

Description

What is missing in the DAGs?

Context

Possible Implementation

Possible Alternatives

datajoely commented Mar 15, 2022 • edited Loading

noklam commented Mar 15, 2022

datajoely commented Mar 16, 2022

antonymilne commented Mar 16, 2022

datajoely commented Mar 16, 2022

noklam commented Mar 16, 2022 • edited Loading

antonymilne commented Mar 16, 2022 • edited Loading

deepyaman commented Mar 30, 2022 • edited Loading

noklam commented Mar 31, 2022

kingslee28 commented Jun 9, 2022

noklam commented Jun 9, 2022

noklam commented Jul 8, 2022

Consistent nodes execution order with `SequentialRunner` #1350

Consistent nodes execution order with `SequentialRunner` #1350

noklam commented Mar 15, 2022 •

edited

Loading

datajoely commented Mar 15, 2022 •

edited

Loading

noklam commented Mar 16, 2022 •

edited

Loading

antonymilne commented Mar 16, 2022 •

edited

Loading

deepyaman commented Mar 30, 2022 •

edited

Loading