-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent nodes execution order with SequentialRunner
#1350
Comments
So on this - we use the |
Note: currently we are using a function called |
@noklam awesome! |
This is interesting - I'd never thought that individual node ordering would make any difference in practice so long as things are toposorted, even if that sort isn't deterministic. @noklam just for my understanding, can you give a concrete example of something you might do where changing node execution order can alter the final output? Your diagram of nodes didn't come out correctly so it's a bit hard to understand. |
I'll just add users do ask for this maybe once every two months, I've even seen people introduce fake nodes to force ordering. |
@AntonyMilneQB Imagine a simple pipeline like this
Model initialization and train_test_split is 2 independent node, where it doesn't really matter who goes first. However, it is a common practice to set random seed to ensure the results are identical. If we try to do kedro run and unfold the execution order, these are 2 possible scenraio.
Case 2
Neither Case 1 / Case 2 is preferred, but one would preferred it is always Case1 or always Case 2 instead of bouncing between the 2. |
@noklam Got it, thanks! I guess I usually pass |
Just seeing this. I'm not sure I like the idea of making any guarantees of deterministic node execution order, if people will start relying on it for anything beyond tests.
edit: Just to be clear, I'm fine with it being deterministic, but question making a guarantee that it would be deterministic. No code should depend on this. Trust me when I say, some users will find a way to depend on guaranteed execution order--I'm pretty sure I've seen requests for this sort of functionality--so I would go so far as to put a big warning to not depend on it lol. |
The deterministic pipeline will work for I would argue it is useful for the reproducible pipeline. My question would be what are the benefit of having random execution order with Currently, the only way to achieve this is either
|
I had exactly the same problem, that my machine learning pipeline returned different results with runs having different orders of nodes, some of which involved random seed settings. A workaround solution is to pass a dummy I/O object across nodes, however, that is not a good solution in terms of coding design. (And what if there are many nodes invloved?) I wonder if there is any follow up on this issue. Thank you! |
@kingslee28 As mentioned above, there are 2 ways to go around this if this is a pressing issue.
Hopefully, the PR can fix this problem so we don't need these workarounds in the future. I am happy to pick up this one later in the sprint, I pretty much have the solution already but will need to add a test for it. (p.s. I don't know how to properly rephrase the issue😅, I will keep use @deepyaman's word as |
SequentialRunner
#Close by #1604 |
Description
I'm always frustrated when I execute an identical pipeline it can yield different results. The root cause of this issue is due to the fact that multiple combinations of nodes exists. Kedro only try to solve the DAGs by finding 1 possible solution, but it is not guaranteed to be the same.
One workaround is to specify input/output of nodes to make sure there is only 1 possible solution, but this is not ideal as users has to maintain arbitrary dummy variables.
What is missing in the DAGs?
Seed of random number generator. Consider a simple pipeline with 3 nodes:
In this pipeline, there are 2 possible execution order with SequentialRunner, 1. A->B->C, 2. B->A->C. Although there are no strong preference whether 1/2 is better, it is better to stick with one of them, as the output can be changed.
Context
In data science/machine learning pipeline, setting a seed to ensure reproducible result are very common, and currently there are no easy way to achieve this.
Possible Implementation
Ensure the resolved nodes are sorted so it always run in the same order with SequentialRunner.
Possible Alternatives
(Optional) Describe any alternative solutions or features you've considered.
The text was updated successfully, but these errors were encountered: