-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines #7094
Conversation
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
pkg/taskrunmetrics/metrics.go
Outdated
@@ -219,6 +224,11 @@ func viewRegister(cfg *config.Metrics) error { | |||
Measure: runningTRsThrottledByNodeCount, | |||
Aggregation: view.LastValue(), | |||
} | |||
runningTRsWaitingOnBundleResolutionCountView = &view.View{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see below at https://github.com/tektoncd/pipeline/pull/7094/files#diff-7bb37a1fbfb2780494fcbb3f55f273ac112b0503172188595cf81c4fa8854848R245 and https://github.com/tektoncd/pipeline/pull/7094/files#diff-7bb37a1fbfb2780494fcbb3f55f273ac112b0503172188595cf81c4fa8854848R245
the associated metric is ultimately incremented at https://github.com/tektoncd/pipeline/pull/7094/files#diff-7bb37a1fbfb2780494fcbb3f55f273ac112b0503172188595cf81c4fa8854848R399
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I got confused due to the naming.
Can you please rename it to runningTRsWaitingOnTaskResolutionCountView
instead of runningTRsWaitingOnBundleResolutionCountView
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah I renamed the metric but forgot to rename the view ... thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update pushed @khrm
423b81d
to
50dff3d
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/assign @vdemeester @lbernick
status corev1.ConditionStatus | ||
reason string | ||
prWaitCount float64 | ||
trWaitCount float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test case for multiple taskruns/pipelineruns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, though as I type today is EOB Friday, so it will be early next week hopefully.
succeedCondition := pr.Status.GetCondition(apis.ConditionSucceeded) | ||
if succeedCondition != nil && succeedCondition.Status == corev1.ConditionUnknown { | ||
switch succeedCondition.Reason { | ||
case v1.TaskRunReasonResolvingTaskRef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this occur if the pipelinerun has status v1.TaskRunReasonResolvingTaskRef? I'm a bit confused how the PR status is updated if it's waiting on multiple taskrun resolutions. Wouldn't the TR metric be inaccurate in this case, because it's incremented once per pipelinerun rather than once per taskrun in each pipelinerun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the answer to your first question is "yes".
Specifically, the code at
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go
Lines 538 to 547 in c887347
pipelineRunState, err := c.resolvePipelineState(ctx, tasks, pipelineMeta.ObjectMeta, pr) | |
switch { | |
case errors.Is(err, remote.ErrRequestInProgress): | |
message := fmt.Sprintf("PipelineRun %s/%s awaiting remote resource", pr.Namespace, pr.Name) | |
pr.Status.MarkRunning(v1.TaskRunReasonResolvingTaskRef, message) | |
return nil | |
case err != nil: | |
return err | |
default: | |
} |
Now, looking at the implementation of resolvePipelineState
it errors out of that method on the first instance of remote.ErrRequestInProgress
.... so it does not bother to see how many taskruns it is waiting on.
From at least my team's operational perspective, knowing that a pipelinesrun is blocked by the resolution of any of its taskruns is sufficient..... i.e. the number of pipelineruns blocked by taskruns does not have to equal the number of blocked taskruns.
Lastly, I looked at my description for the metric, and I'll admit it is not precisely clear on this nuance (i.e. I need the throw the word "any" in there in the right spot).
Hopefully that helps with the bit you said you were confused on and answers the second question, and of course if you agree with my logic/thinking here. And if you think I need to put some form of all this ^^ in comments in the code, or in the metrics.md file let me know.
Thansk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this makes more sense now, thank you! I was thinking that it was supposed to reflect the number of blocked taskruns in a pipelinerun, but re-reading the metric names/descriptions this is more clear. I'm wondering if it's useful to separate out waiting on pipeline resolution vs waiting on task resolution, vs just the number of PRs waiting on resolution of any kind? Genuine question, curious to hear what is useful for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sort of consolidation notion occurred to me too @lbernick
I landed on a finer grained distinction because in our deployment we already are considering/moving to a situation where there will be more than one container registry employed for our various bundles. Short term, the current level of granularity is OK for us for distinguishing between container registries, and the minimal labeling helps with the cardinality of the metric within the metric subsystem. But longer term, we might even pursue a separate PR to add say the container registry (minus image/tag) as a label. But I wanted to wait until the we had some vetting of the current metric before adding to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably another metric for the number of PRs waiting on resolution of any kind is useful. The other three are also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tag related to resolution like registry/repo can be helpful but this should be configurable via configmap. We don't want to increase the cardinality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack on the resolution/configmap note @khrm ..... I'll make sure and including the configuration / opt in approach for those labels if and when I come back with a new PR that contributes that.
On the waiting for resolution of any other kind, I think the reason label we check for is for any of the resolver types. Though I've referenced image/bundle resolution in some of my comments here, the other types are not precluded. And I was careful in the metric description to not reference a specific resolver. Certainly though if we were to add configurable labels in a follow up PR, specific configurable labels for each of the types could make sense. Not adding the more detailed labels with this initial PR bypasses all that additional code delta.
@khrm @lbernick - in addition to the response to @lbernick 's comments from earlier today, I realized while processing them today that the PR title and commit text is not entirely accurate. These metrics are not "wait times". Rather, they are a counter/gauge of how many pielineruns/taskruns are waiting on resolving pipelines/tasks. I'll update those when I get back to making updates to this PR after the weekend. Now, some tl;dr .... actual wait times ... i.e. how long was the "waiting on resolution" reasons set in the ConditionSuceeded condition, is something my team is interested in as well. I refrained from adding that metric in this PR for 2 basic reasons
As it turns out, I can implement such "wait time" metrics more cleanly in other controllers outside of tekton that watch pipelineruns/taskruns as well (in the interest of brevity, I'll refrain on the why/how they are easier/more clean, but can elaborate if desired). That said, if you all are interested in seeing how a "wait time" metric could look in tekton, I could submit another PR when we are done with this one, and we can go from there. Thanks. |
thanks @gabemontero! this looks good to me other than the few comments and PR title/release notes as you mentioned. If it would be helpful for you to add wait time metrics too, please go for it! (Just in a separate PR, like you mentioned :) ) If it would be helpful to have some discussion on how best to implement these wait time metrics before actually creating a PR, I think the best way would be to create a tracking issue with a short description of why the metric would be helpful and how you're thinking of implementing it, and we can discuss there. |
50dff3d
to
b87813b
Compare
OK @lbernick @khrm I've updated the unit tests to do multiple pipelineruns / taskruns, and I have fixed the PR title / commit message to remove the bit about time waiting. Also responded / believe agreed with @khrm comments from today. Unless there are any disconnects on my unit test updates or my response to @khrm comments today, from end I am waiting on @lbernick to chime in on the separate issue / separate PR for the pipelinerun reason comment. And yep @lbernick saw your advice about either a separate PR for the wait time metric or open the issue first to discuss. Once we are done with this one I'll reset, revisit a little, and see which way to go. thanks folks |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
unregisterMetrics() | ||
ctx, _ := ttesting.SetupFakeContext(t) | ||
informer := fakepipelineruninformer.Get(ctx) | ||
for i := 0; i < 3; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you make this magic number a named variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now a variable
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: khrm, lbernick The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b87813b
to
79d70e8
Compare
The following is the coverage report on the affected files.
|
…ks/pipelines This commit adds new experimental gauge metrics that count the number of TaskRuns who are waiting for resolution of any Tasks they reference, as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution for their underlying TaskRuns.
79d70e8
to
213a8d8
Compare
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
thanks @gabemontero, would you mind updating the release note to correct what the metric is for? @khrm could you leave lgtm since I've already approved? |
ah good catch yep just a sec |
@lbernick I don't have permission to give lgtm @vdemeester @afrittoli can you do that? |
@khrm: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
release note updated @lbernick thanks |
/lgtm |
so I've been able percolate on the wait time metric a bit more @lbernick @khrm , and there are at least 2 implementation paths where I see various pros and cons, depending on how you view the elements of those changes fwiw I have an "outside of tekton" implementation for one of those paths, so have prototyped things to some degree, and have some POC / unit level testing to help validate but given all this, I'll open up an issue vs. just providing a PR per your earlier comment @lbernick, where I'll try to describe the 2 choices I see, with my take on the pros/cons we can then see what you all think about those, if you see other possible paths to go down, etc. thanks |
Changes
/kind feature
This commit adds new experimental gauge metrics that count the number of TaskRuns who are waiting for resolution of any Tasks they reference, as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution for their underlying TaskRuns.
This change's motivation is similar to #6744 and stems from the same deployment of tekton that motivated the #6744 changes I submitted back in May of this year. In particular, questions around "how much time is spent resolving bundles" have arisen with a fair amount of frequency from various stakeholders of our deployment.
Getting some precise data on bundle wait times could also help lend priority to features like #6385
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes
@vdemeester @khrm @lbernick PTAL if you all have sufficient time to do so - thanks !!