Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines #7094

gabemontero · 2023-09-06T20:26:07Z

Changes

/kind feature

This commit adds new experimental gauge metrics that count the number of TaskRuns who are waiting for resolution of any Tasks they reference, as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution for their underlying TaskRuns.

This change's motivation is similar to #6744 and stems from the same deployment of tekton that motivated the #6744 changes I submitted back in May of this year. In particular, questions around "how much time is spent resolving bundles" have arisen with a fair amount of frequency from various stakeholders of our deployment.

Getting some precise data on bundle wait times could also help lend priority to features like #6385

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

[ /] Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
[ /] Has Tests included if any functionality added or changed
[ /] Follows the commit message standard
[/ ] Meets the Tekton contributor standards (including functionality, content, code)
[ /] Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
[/ ] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

New gauge metrics are introduced that count the number of TaskRuns waiting for resolution of any Tasks they reference, as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution for their underlying TaskRuns.

@vdemeester @khrm @lbernick PTAL if you all have sufficient time to do so - thanks !!

tekton-robot · 2023-09-06T20:33:37Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

tekton-robot · 2023-09-06T20:34:19Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

khrm · 2023-09-07T08:42:40Z

pkg/taskrunmetrics/metrics.go

@@ -219,6 +224,11 @@ func viewRegister(cfg *config.Metrics) error {
 		Measure:     runningTRsThrottledByNodeCount,
 		Aggregation: view.LastValue(),
 	}
+	runningTRsWaitingOnBundleResolutionCountView = &view.View{


Where is this being used?

see below at https://github.com/tektoncd/pipeline/pull/7094/files#diff-7bb37a1fbfb2780494fcbb3f55f273ac112b0503172188595cf81c4fa8854848R245 and https://github.com/tektoncd/pipeline/pull/7094/files#diff-7bb37a1fbfb2780494fcbb3f55f273ac112b0503172188595cf81c4fa8854848R245

the associated metric is ultimately incremented at https://github.com/tektoncd/pipeline/pull/7094/files#diff-7bb37a1fbfb2780494fcbb3f55f273ac112b0503172188595cf81c4fa8854848R399

Got it. I got confused due to the naming.
Can you please rename it to runningTRsWaitingOnTaskResolutionCountView instead of runningTRsWaitingOnBundleResolutionCountView?

ah yeah I renamed the metric but forgot to rename the view ... thanks

update pushed @khrm

tekton-robot · 2023-09-07T13:45:47Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

tekton-robot · 2023-09-07T13:46:36Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

khrm

/approve

/assign @vdemeester @lbernick

pkg/pipelinerunmetrics/metrics.go

lbernick · 2023-09-08T15:31:13Z

pkg/pipelinerunmetrics/metrics_test.go

+		status      corev1.ConditionStatus
+		reason      string
+		prWaitCount float64
+		trWaitCount float64


can you add a test case for multiple taskruns/pipelineruns?

will do, though as I type today is EOB Friday, so it will be early next week hopefully.

lbernick · 2023-09-08T15:43:27Z

pkg/pipelinerunmetrics/metrics.go

+			succeedCondition := pr.Status.GetCondition(apis.ConditionSucceeded)
+			if succeedCondition != nil && succeedCondition.Status == corev1.ConditionUnknown {
+				switch succeedCondition.Reason {
+				case v1.TaskRunReasonResolvingTaskRef:


Does this occur if the pipelinerun has status v1.TaskRunReasonResolvingTaskRef? I'm a bit confused how the PR status is updated if it's waiting on multiple taskrun resolutions. Wouldn't the TR metric be inaccurate in this case, because it's incremented once per pipelinerun rather than once per taskrun in each pipelinerun?

So the answer to your first question is "yes".

Specifically, the code at

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 538 to 547 in c887347

pipelineRunState, err := c.resolvePipelineState(ctx, tasks, pipelineMeta.ObjectMeta, pr)

switch {

case errors.Is(err, remote.ErrRequestInProgress):

message := fmt.Sprintf("PipelineRun %s/%s awaiting remote resource", pr.Namespace, pr.Name)

pr.Status.MarkRunning(v1.TaskRunReasonResolvingTaskRef, message)

return nil

case err != nil:

return err

default:

}

is what motivated adding this metric.

Now, looking at the implementation of resolvePipelineState it errors out of that method on the first instance of remote.ErrRequestInProgress .... so it does not bother to see how many taskruns it is waiting on.

From at least my team's operational perspective, knowing that a pipelinesrun is blocked by the resolution of any of its taskruns is sufficient..... i.e. the number of pipelineruns blocked by taskruns does not have to equal the number of blocked taskruns.

Lastly, I looked at my description for the metric, and I'll admit it is not precisely clear on this nuance (i.e. I need the throw the word "any" in there in the right spot).

Hopefully that helps with the bit you said you were confused on and answers the second question, and of course if you agree with my logic/thinking here. And if you think I need to put some form of all this ^^ in comments in the code, or in the metrics.md file let me know.

Thansk

Ok I think this makes more sense now, thank you! I was thinking that it was supposed to reflect the number of blocked taskruns in a pipelinerun, but re-reading the metric names/descriptions this is more clear. I'm wondering if it's useful to separate out waiting on pipeline resolution vs waiting on task resolution, vs just the number of PRs waiting on resolution of any kind? Genuine question, curious to hear what is useful for you.

Yeah that sort of consolidation notion occurred to me too @lbernick

I landed on a finer grained distinction because in our deployment we already are considering/moving to a situation where there will be more than one container registry employed for our various bundles. Short term, the current level of granularity is OK for us for distinguishing between container registries, and the minimal labeling helps with the cardinality of the metric within the metric subsystem. But longer term, we might even pursue a separate PR to add say the container registry (minus image/tag) as a label. But I wanted to wait until the we had some vetting of the current metric before adding to it.

Probably another metric for the number of PRs waiting on resolution of any kind is useful. The other three are also fine.

Tag related to resolution like registry/repo can be helpful but this should be configurable via configmap. We don't want to increase the cardinality.

Ack on the resolution/configmap note @khrm ..... I'll make sure and including the configuration / opt in approach for those labels if and when I come back with a new PR that contributes that.

On the waiting for resolution of any other kind, I think the reason label we check for is for any of the resolver types. Though I've referenced image/bundle resolution in some of my comments here, the other types are not precluded. And I was careful in the metric description to not reference a specific resolver. Certainly though if we were to add configurable labels in a follow up PR, specific configurable labels for each of the types could make sense. Not adding the more detailed labels with this initial PR bypasses all that additional code delta.

gabemontero · 2023-09-08T21:41:50Z

@khrm @lbernick - in addition to the response to @lbernick 's comments from earlier today, I realized while processing them today that the PR title and commit text is not entirely accurate.

These metrics are not "wait times". Rather, they are a counter/gauge of how many pielineruns/taskruns are waiting on resolving pipelines/tasks.

I'll update those when I get back to making updates to this PR after the weekend.

Now, some tl;dr .... actual wait times ... i.e. how long was the "waiting on resolution" reasons set in the ConditionSuceeded condition, is something my team is interested in as well. I refrained from adding that metric in this PR for 2 basic reasons

keep the breath / amount of the change more manageableexactly
the manipulation of that condition's reason in the tekton code is spread out over a few packages/methods (i.e. it is set to running in pkg/pod, and the waiting on resolution is done in different methods in the taskrun reconciler) and I was a bit concerned about the fragility of maintaining the metric in the tekton controller

As it turns out, I can implement such "wait time" metrics more cleanly in other controllers outside of tekton that watch pipelineruns/taskruns as well (in the interest of brevity, I'll refrain on the why/how they are easier/more clean, but can elaborate if desired).

That said, if you all are interested in seeing how a "wait time" metric could look in tekton, I could submit another PR when we are done with this one, and we can go from there.

Thanks.

lbernick · 2023-09-11T14:18:00Z

thanks @gabemontero! this looks good to me other than the few comments and PR title/release notes as you mentioned. If it would be helpful for you to add wait time metrics too, please go for it! (Just in a separate PR, like you mentioned :) ) If it would be helpful to have some discussion on how best to implement these wait time metrics before actually creating a PR, I think the best way would be to create a tracking issue with a short description of why the metric would be helpful and how you're thinking of implementing it, and we can discuss there.

gabemontero · 2023-09-11T20:12:54Z

OK @lbernick @khrm I've updated the unit tests to do multiple pipelineruns / taskruns, and I have fixed the PR title / commit message to remove the bit about time waiting.

Also responded / believe agreed with @khrm comments from today.

Unless there are any disconnects on my unit test updates or my response to @khrm comments today, from end I am waiting on @lbernick to chime in on the separate issue / separate PR for the pipelinerun reason comment.

And yep @lbernick saw your advice about either a separate PR for the wait time metric or open the issue first to discuss. Once we are done with this one I'll reset, revisit a little, and see which way to go.

thanks folks

tekton-robot · 2023-09-11T20:15:16Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

tekton-robot · 2023-09-11T20:15:40Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

lbernick · 2023-09-12T13:57:11Z

pkg/pipelinerunmetrics/metrics_test.go

+		unregisterMetrics()
+		ctx, _ := ttesting.SetupFakeContext(t)
+		informer := fakepipelineruninformer.Get(ctx)
+		for i := 0; i < 3; i++ {


nit: can you make this magic number a named variable?

now a variable

tekton-robot · 2023-09-12T13:57:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khrm, lbernick

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lbernick]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot · 2023-09-13T14:59:53Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

…ks/pipelines This commit adds new experimental gauge metrics that count the number of TaskRuns who are waiting for resolution of any Tasks they reference, as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution for their underlying TaskRuns.

gabemontero · 2023-09-13T15:01:50Z

PR updated @lbernick @khrm to leverage the new location of the pipelinerun resolving ref constant

PTAL / thanks !

tekton-robot · 2023-09-13T15:07:58Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

tekton-robot · 2023-09-13T15:08:46Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	82.1%	2.1
pkg/taskrunmetrics/metrics.go	83.0%	83.4%	0.5

lbernick · 2023-09-13T15:17:28Z

thanks @gabemontero, would you mind updating the release note to correct what the metric is for? @khrm could you leave lgtm since I've already approved?

gabemontero · 2023-09-13T15:21:52Z

thanks @gabemontero, would you mind updating the release note to correct what the metric is for? @khrm could you leave lgtm since I've already approved?

ah good catch yep just a sec

khrm · 2023-09-13T15:24:19Z

@lbernick I don't have permission to give lgtm @vdemeester @afrittoli can you do that?
/lgtm

tekton-robot · 2023-09-13T15:24:21Z

@khrm: changing LGTM is restricted to collaborators

In response to this:

@lbernick I don't have permission to give lgtm @vdemeester @afrittoli can you do that?
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gabemontero · 2023-09-13T15:24:28Z

release note updated @lbernick thanks

lbernick · 2023-09-14T18:40:57Z

/lgtm

gabemontero · 2023-09-14T18:52:29Z

thanks @gabemontero! this looks good to me other than the few comments and PR title/release notes as you mentioned. If it would be helpful for you to add wait time metrics too, please go for it! (Just in a separate PR, like you mentioned :) ) If it would be helpful to have some discussion on how best to implement these wait time metrics before actually creating a PR, I think the best way would be to create a tracking issue with a short description of why the metric would be helpful and how you're thinking of implementing it, and we can discuss there.

so I've been able percolate on the wait time metric a bit more @lbernick @khrm , and there are at least 2 implementation paths where I see various pros and cons, depending on how you view the elements of those changes

fwiw I have an "outside of tekton" implementation for one of those paths, so have prototyped things to some degree, and have some POC / unit level testing to help validate

but given all this, I'll open up an issue vs. just providing a PR per your earlier comment @lbernick, where I'll try to describe the 2 choices I see, with my take on the pros/cons

we can then see what you all think about those, if you see other possible paths to go down, etc.

thanks

gabemontero · 2023-09-15T14:57:22Z

ok @lbernick @khrm I've opened #7116 for discussing how to do the wait time metric

tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Sep 6, 2023

tekton-robot requested review from bobcatfish and lbernick September 6, 2023 20:26

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 6, 2023

khrm reviewed Sep 7, 2023

View reviewed changes

gabemontero force-pushed the bundle-wait-metric branch from 423b81d to 50dff3d Compare September 7, 2023 13:38

khrm reviewed Sep 8, 2023

View reviewed changes

tekton-robot assigned lbernick and vdemeester Sep 8, 2023

lbernick reviewed Sep 8, 2023

View reviewed changes

gabemontero force-pushed the bundle-wait-metric branch from 50dff3d to b87813b Compare September 11, 2023 20:08

gabemontero changed the title ~~Add taskrun/pipelinerun gauge metrics for wait times around resolving respective tasks/pipelines~~ Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines Sep 11, 2023

lbernick mentioned this pull request Sep 12, 2023

Cleanup: Move PipelineRun Reasons to pkg/apis #7102

Merged

4 tasks

lbernick approved these changes Sep 12, 2023

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2023

gabemontero force-pushed the bundle-wait-metric branch from b87813b to 79d70e8 Compare September 12, 2023 14:27

gabemontero force-pushed the bundle-wait-metric branch from 79d70e8 to 213a8d8 Compare September 13, 2023 15:00

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 14, 2023

tekton-robot merged commit 284874d into tektoncd:main Sep 14, 2023

gabemontero deleted the bundle-wait-metric branch September 14, 2023 19:31

gabemontero mentioned this pull request Sep 15, 2023

add a metric for reference resolution request time to complete #7116

Open

	pipelineRunState, err := c.resolvePipelineState(ctx, tasks, pipelineMeta.ObjectMeta, pr)
	switch {
	case errors.Is(err, remote.ErrRequestInProgress):
	message := fmt.Sprintf("PipelineRun %s/%s awaiting remote resource", pr.Namespace, pr.Name)
	pr.Status.MarkRunning(v1.TaskRunReasonResolvingTaskRef, message)
	return nil
	case err != nil:
	return err
	default:
	}

Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines #7094

Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines #7094

Conversation

gabemontero commented Sep 6, 2023 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Sep 6, 2023

tekton-robot commented Sep 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Sep 7, 2023

tekton-robot commented Sep 7, 2023

khrm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khrm Sep 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabemontero commented Sep 8, 2023

lbernick commented Sep 11, 2023

gabemontero commented Sep 11, 2023

tekton-robot commented Sep 11, 2023

tekton-robot commented Sep 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Sep 12, 2023

tekton-robot commented Sep 13, 2023

gabemontero commented Sep 13, 2023

tekton-robot commented Sep 13, 2023

tekton-robot commented Sep 13, 2023

lbernick commented Sep 13, 2023

gabemontero commented Sep 13, 2023

khrm commented Sep 13, 2023

tekton-robot commented Sep 13, 2023

gabemontero commented Sep 13, 2023

lbernick commented Sep 14, 2023

gabemontero commented Sep 14, 2023

gabemontero commented Sep 15, 2023

gabemontero commented Sep 6, 2023 •

edited

Loading

khrm Sep 11, 2023 •

edited

Loading