-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize cycle detection logic in dag #3539
Conversation
The following is the coverage report on the affected files.
|
30a4f39
to
5966ab2
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments to address but I think the change is good!
Tasks: v1beta1.PipelineTaskList{a, bDependsOnA, cRunsAfterA, dDependsOnBAndC, eRunsAfterD, fRunsAfterD, gDependsOnF}, | ||
}, | ||
} | ||
if _, err := dag.Build(v1beta1.PipelineTaskList(p.Spec.Tasks)); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to check the error message includes the cycle we're expecting? Otherwise we could get in to a situation like above where the error is of a different type than we want.
_, err := dag.Build
if err == nil || !strings.Contains(err.Error(), "e -> a -> b -> d -> e") {
t.Errorf("expected cycle error but got %v", err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sbwsg, there are multiple paths possible leading to a cycle. It all depends on the order in which dependency map is processed.
This test can fail with one of these paths:
if err == nil || !(strings.Contains(err.Error(), "e -> a -> b -> d -> e") ||
strings.Contains(err.Error(), "a -> c -> b -> d -> e -> a") ||
strings.Contains(err.Error(), "d -> e -> a -> b -> d")) {
t.Errorf("expected cycle error but got %v", err)
}
I can replace the existing error checking with the above check if we decide to keep this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the unit test multiple times with count=1
leads to many different combination of paths including:
"a -> c -> b -> d -> e -> a",
"a -> b -> c -> d -> e -> a",
"a -> b -> d -> e -> a",
"a -> c -> d -> e -> a",
"b -> d -> e -> a -> b",
"c -> d -> e -> a -> c",
I walked through the graph one more time and there are much more combinations possible. I think its fine to leave it as just cycle detected with e -> a or a -> e
instead of looking for a specific path in the unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, the order of traversal isn't stable across runs because the nodes are stored in a hash map. cycle detected
makes sense 👍
5966ab2
to
5b2c3c9
Compare
The following is the coverage report on the affected files.
|
Thanks for this! I didn't manage to review yet, I'll do tomorrow 🙏 I wonder if on long term we could consider using something like https://godoc.org/github.com/twmb/algoimpl/go/graph for the DAG logic? It already detects cycles via the "strongly connected components". That at the cost of an extra dependency of course :) |
Hey @pritidesai !! Nice catch, it looks like this one has been around for a while :D It took me a while to wrap my head around the changes here and I had to do some investigation to understand them, so I want to try to explain these changes and can you confirm if this is how you are seeing it as well? To try to understand what's happening I had to look back at a1f2733 when the visited list was updated to do that concatenation. Here is my understanding of the problem that was being fixed at that time:
While (2) and (3) were happening, a "visited" list was being built up So a1f2733 attempted to fix this by changing the keys in visited. BUT! As @pritidesai pointed out, we may ADD the concatenated keys but WE DONT USE THEM WHEN CHECKING KEYS I think what this means is that a1f2733 effectively made us ignore all these entries we were adding to the list. And it works! Which is why in this PR @pritidesai can remove these additions into the visited list b/c we didn't need them in the first place. And the REASON we don't need them is because the actual question we are trying to ask when we call So in fact I think we could simplify
Something like:
I don't think we actually need this new test (maybe not the other one either) - The test TestBuild_InvalidDAGWithCycle passes even without the changes (the error message is different but a cycle is still detected) and it looks like TestBuild_Invalid is testing this case (there are 3 different versions of cycles (a has to run after z but z runs after x which runs after a i think?) (it could definitely use a comment depicting the graph - and maybe a better name) Because I think what this PR is doing is not actually changing any functionality, but is optimizing it (we're no longer storing the concatenated keys which were being ignored anyway)
I have mixed feelings! On the one hand I feel like this DAG logic is at the heart of our Pipeline execution so it feels weird to entrust it to an external lib, on the other hand we had this inefficiency for nearly 2 years and didn't notice (on the other other hand, this didn't actually cause any bugs). Might be interesting to try swapping it out and see what it's like? |
yes @bobcatfish you absolutely nailed it.
I agree, it is bit obscured, before adding any new link, its actually checking if this node (
yup, since there is no functionality change 😄 This test validates a little bit more complex graph compared to existing invalid test:
TestBuild_Invalid is validating one single cycle in three different forms (combination of runAfter and from) but from the
I generally avoid third party dependencies for the same reasons you have mentioned, specially for the heart of the pipeline execution. |
5b2c3c9
to
56b6174
Compare
The following is the coverage report on the affected files.
|
56b6174
to
a730024
Compare
The following is the coverage report on the affected files.
|
That's an interesting though, because I would have gone the other way. The DAG is not something specific to Tekton, the way we build it and us it is, so I would think using an external library that does the "dag" well would be good enough for us and would allow us to add value to where we need to. But I don't have strong opinion on this 🙃 |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbwsg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
in order to make sure it's clear which tests are covering what to future maintainers, what do you think about changing the names of these tests to make it more clear what the line is between them? and/or maybe even combining them into one table driven test? |
Speaking of tests, this is a bit of scope creep, but why does dagv1beta1_test.go duplicate all the tests is in dag_test.go? Is there some way we can avoid having to maintain this duplication? |
a730024
to
511c92c
Compare
I have renamed the test. |
1f87a89
to
55c66f8
Compare
The following is the coverage report on the affected files.
|
While creating a separate cleanup PR, I will rearrange test functions further e.g. local functions towards the end of the file, rearrange |
@bobcatfish if build passes, its ready for review 🙏 |
/test pull-tekton-pipeline-integration-tests |
Yeah I think so 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we are almost there! thanks for your patience!
TestBuild_Invalid tests DAG for other failures as well e.g. duplicate task names, invalid task name in dependency, etc. Have renamed the test to TestBuild_Invalid.
I'd like to give one last plug for having one test for all of these cases, or for removing the invalid DAG cases from the existing function - i.e. one place to find all the invalid graphs, it doesn't seem to me like we're gaining much from the 2 separate tests at the moment, since the cases they are covering overlap. so id much prefer to see one test that covers all the cases, or a clear line between which cases each of the 2 tests cover
55c66f8
to
17f6b48
Compare
The following is the coverage report on the affected files.
|
Definitely, have merged them all in one. Also, if cleanup PR #3556 is merged before this, will update this PR to drop changes to |
Hey, yeah, that was my thinking exactly, but I don't have a strong feeling about this either |
Thanks @pritidesai ! looks good to me :D even tho you'll need it again for the conflict, a symbolic lgtm: /lgtm |
visited map holds name of the nodes been visited as keys and "true" as value. This map is being updated with currentName.HashKey on every iteration which results in keys such as a.d, b.d where a, b, and d are nodes of a dag. But such keys are not utilized anywhere in the visit function. Visit function checks existence of the node just by the name without any string concatenation. This extra addition in the map is causing severe delay for a graph with more than >60 nodes.
17f6b48
to
0b4864b
Compare
The following is the coverage report on the affected files.
|
/lgtm |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-tekton-pipeline-integration-tests |
|
Changes
visited
map holds name of the nodes being visited as keys andtrue
as value. This map is being updated withcurrentName.HashKey
on every iteration which results in keys such asa.d
,b.d
where a, b, and d are nodes of adag
. But such keys are not utilized anywhere in the samevisit
function.Visit
function checks existence of the node justby the name without any string concatenation.
This extra addition in the map is causing severe delay for a graph with more than 60 nodes with distinct dependencies.
visit
function is called on line 131 with all the parents of the existingpipelineTask
to check if any of the parents exist in thevisited
map.visit
function is then recursively called to check each parent's parent to see if any of them exist in thevisited
map, so on and so forth to detect a cycle./kind bug
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes