-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix recursion issue on Skip #3524
Conversation
547fb27
to
6603bd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Just a couple style questions, in case they help debugging this code in the future. Feel free to ignore them.
Thanks for tracking this down!
6603bd2
to
cdf533c
Compare
/cc @vdemeester @imjasonh @skaegi @jerop @pritidesai I added this to the v0.18 milestone, I think it's probably worth a minor release. |
cdf533c
to
17b7469
Compare
17b7469
to
72455ba
Compare
/lgtm |
/cc @jerop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for tackling this Andrea!
facts.SkipCache = make(map[string]bool) | ||
} | ||
if _, cached := facts.SkipCache[t.PipelineTask.Name]; !cached { | ||
facts.SkipCache[t.PipelineTask.Name] = t.skip(facts) // t.skip() is same as our existing t.Skip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest removing the comment now that the original t.Skip
is renamed t.skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbwsg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Oops, one more thing: |
// needed, via the `Skip` method in pipelinerunresolution.go | ||
// The skip data is sensitive to changes in the state. The ResetSkippedCache method | ||
// can be used to clean the cache and force re-computation when needed. | ||
SkipCache map[string]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might be able to instantiate this cache based on pr.Status.SkippedTasks
which holds the list of skippedTasks
.
I will not block this PR for skippedTasks
, we can revisit this in next iteration. @jerop?
It looks like CI is slower than my laptop:
I'll reduce the size of the pipeline. |
246df4d
to
aaec640
Compare
(Copying my comment thread from slack for posterity): I've checked out the branch and I am noticing some slightly odd behaviour. Every test run varies dramatically in amount of time large_deps,_not_started takes. Some runs are 7 seconds, some are blowing through the 30s timeout cap. I'm poking around with pprof to see if I can figure out where all the time is being spent when the test run gets slow. When i hit the 30s timeout I am seeing the following top from pprof:
So all the time is being spent in Edit: here are the commands I'm running.
|
@sbwsg I ran the commands you pasted 🆙 and its the same
|
@afrittoli I ran the unit test you have added here on
I am wondering why these changes have so much time spent on |
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "ResetSkippedCache". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Co-authored-by: Scott <sbws@google.com> Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
aaec640
to
7aa35f0
Compare
The following is the coverage report on the affected files.
|
Only consider link B -> A once. If A depends from B in N different ways, consider that a single link in the dag. Considering them different won't help us detect cycles nor scheduling tasks. In the cycle detection logic, use a string builder instead of '+' to concatenate strings, since it more efficient.
4c385b9
to
eda511b
Compare
The following is the coverage report on the affected files.
|
|
||
} | ||
|
||
return uniqueDeps.List() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this significantly reduces the call to dag.visit
and makes it possible for the unit test in this PR to complete on average under five seconds. The mysterious dot notation in dag.visit
is still unsolved though:
pipeline/pkg/reconciler/pipeline/dag/dag.go
Line 145 in 747f4ba
visited[currentName+"."+n.Task.HashKey()] = true |
Its creating a link between two nodes which are indirectly connected (C.A
in A -> B -> C
) but where/how is this data being utilized is a mystery.
/lgtm @skaegi let us know if you still run into scaling issues, the examples you have specified must be fixed with these changes. Thanks @afrittoli and @sbwsg for fixes and help troubleshoot this 🙏 |
Cheery picked with PR #3534 |
Changes
Fix recursion issue on Skip
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.
It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.
The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.
We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".
With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped".
This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining #2821
This changes adds a unit test that reproduces the issue in #3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.
Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.
Fixes #3521
Co-authored-by: Scott sbws@google.com
Signed-off-by: Andrea Frittoli andrea.frittoli@uk.ibm.com
/kind bug
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes