Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent nodes execution order with SequentialRunner #1350

Closed
noklam opened this issue Mar 15, 2022 · 12 comments · Fixed by #1604
Closed

Consistent nodes execution order with SequentialRunner #1350

noklam opened this issue Mar 15, 2022 · 12 comments · Fixed by #1604
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Mar 15, 2022

Description

I'm always frustrated when I execute an identical pipeline it can yield different results. The root cause of this issue is due to the fact that multiple combinations of nodes exists. Kedro only try to solve the DAGs by finding 1 possible solution, but it is not guaranteed to be the same.

One workaround is to specify input/output of nodes to make sure there is only 1 possible solution, but this is not ideal as users has to maintain arbitrary dummy variables.

What is missing in the DAGs?

Seed of random number generator. Consider a simple pipeline with 3 nodes:

A-- \    
      \   
       C   
      /   
    /   
B--

In this pipeline, there are 2 possible execution order with SequentialRunner, 1. A->B->C, 2. B->A->C. Although there are no strong preference whether 1/2 is better, it is better to stick with one of them, as the output can be changed.

Context

In data science/machine learning pipeline, setting a seed to ensure reproducible result are very common, and currently there are no easy way to achieve this.

Possible Implementation

Ensure the resolved nodes are sorted so it always run in the same order with SequentialRunner.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label Mar 15, 2022
@datajoely
Copy link
Contributor

datajoely commented Mar 15, 2022

So on this - we use the toposort~=1.5 external library, since Kedro was released there is now a stdlib https://docs.python.org/3/library/graphlib.html module that does the same thing. If we do look into making this deterministic, it might be a good opportunity to adopt this too.

@noklam
Copy link
Contributor Author

noklam commented Mar 15, 2022

Note: currently we are using a function called toposort, but actually there is a function call toposort_flatten will guarantee the node order

@datajoely
Copy link
Contributor

@noklam awesome!

@antonymilne
Copy link
Contributor

This is interesting - I'd never thought that individual node ordering would make any difference in practice so long as things are toposorted, even if that sort isn't deterministic.

@noklam just for my understanding, can you give a concrete example of something you might do where changing node execution order can alter the final output? Your diagram of nodes didn't come out correctly so it's a bit hard to understand.

@datajoely
Copy link
Contributor

I'll just add users do ask for this maybe once every two months, I've even seen people introduce fake nodes to force ordering.

@noklam
Copy link
Contributor Author

noklam commented Mar 16, 2022

@AntonyMilneQB
This has to deal with the fact that the random number generator do have an specific order.

Imagine a simple pipeline like this

train_test_split_data-- \    
                         \   
                         train_model   
                         /   
                        /   
Model initialization--

Model initialization and train_test_split is 2 independent node, where it doesn't really matter who goes first. However, it is a common practice to set random seed to ensure the results are identical.

If we try to do kedro run and unfold the execution order, these are 2 possible scenraio.
Case 1

np.random.seed(2022)
model = model_initialization()
x_train, x_val, y_train, y_val = train_test_split(data)
train_model(x_train, x_val, y_train, y_val, model)

Case 2

np.random.seed(2022)
x_train, x_val, y_train, y_val = train_test_split(data)
model = model_initialization()
train_model(x_train, x_val, y_train, y_val, model)

Neither Case 1 / Case 2 is preferred, but one would preferred it is always Case1 or always Case 2 instead of bouncing between the 2.

@antonymilne
Copy link
Contributor

antonymilne commented Mar 16, 2022

@noklam Got it, thanks! I guess I usually pass random_seed into each function explicitly so didn't notice this before, but what you say makes perfect sense - it would be better if node execution order was completely deterministic.

@deepyaman
Copy link
Member

deepyaman commented Mar 30, 2022

Just seeing this. I'm not sure I like the idea of making any guarantees of deterministic node execution order, if people will start relying on it for anything beyond tests.

  1. Guaranteed execution order only makes sense for SequentialRunner.
  2. In a distributed/parallelized setting, any decision as to which node you run first could be made by a smart scheduler rather than user definition.

edit: Just to be clear, I'm fine with it being deterministic, but question making a guarantee that it would be deterministic. No code should depend on this. Trust me when I say, some users will find a way to depend on guaranteed execution order--I'm pretty sure I've seen requests for this sort of functionality--so I would go so far as to put a big warning to not depend on it lol.

@noklam
Copy link
Contributor Author

noklam commented Mar 31, 2022

The deterministic pipeline will work for SequentialRunner only indeed.

I would argue it is useful for the reproducible pipeline. My question would be what are the benefit of having random execution order with SequentialRunner?

Currently, the only way to achieve this is either

  1. dummy input/output
  2. Reset seed on every node

@kingslee28
Copy link

I had exactly the same problem, that my machine learning pipeline returned different results with runs having different orders of nodes, some of which involved random seed settings.

A workaround solution is to pass a dummy I/O object across nodes, however, that is not a good solution in terms of coding design. (And what if there are many nodes invloved?)

I wonder if there is any follow up on this issue. Thank you!

@noklam
Copy link
Contributor Author

noklam commented Jun 9, 2022

@kingslee28 As mentioned above, there are 2 ways to go around this if this is a pressing issue.

  1. dummy input/output as you have mentioned.
  2. resetting seed on every node.
    You can either set it in the node function directly or add a before_node_run hook so it always resets the seed.

Hopefully, the PR can fix this problem so we don't need these workarounds in the future. I am happy to pick up this one later in the sprint, I pretty much have the solution already but will need to add a test for it.

(p.s. I don't know how to properly rephrase the issue😅, I will keep use @deepyaman's word as Guaranteed execution order only makes sense for SequentialRunner since deterministic kedro pipeline is a bit over-promising.)

@noklam noklam changed the title Deterministic kedro pipeline Consistent nodes execution order with SequentialRunner Jun 9, 2022
@noklam
Copy link
Contributor Author

noklam commented Jul 8, 2022

#Close by #1604
@kingslee28

@noklam noklam closed this as completed Jul 8, 2022
@noklam noklam linked a pull request Jul 8, 2022 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants