Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines #7094

Merged
merged 1 commit into from
Sep 14, 2023

Conversation

gabemontero
Copy link
Contributor

@gabemontero gabemontero commented Sep 6, 2023

Changes

/kind feature

This commit adds new experimental gauge metrics that count the number of TaskRuns who are waiting for resolution of any Tasks they reference, as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution for their underlying TaskRuns.

This change's motivation is similar to #6744 and stems from the same deployment of tekton that motivated the #6744 changes I submitted back in May of this year. In particular, questions around "how much time is spent resolving bundles" have arisen with a fair amount of frequency from various stakeholders of our deployment.

Getting some precise data on bundle wait times could also help lend priority to features like #6385

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • [ /] Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • [ /] Has Tests included if any functionality added or changed
  • [ /] Follows the commit message standard
  • [/ ] Meets the Tekton contributor standards (including functionality, content, code)
  • [ /] Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [/ ] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

New gauge metrics are introduced that count the number of TaskRuns waiting for resolution of any Tasks they reference, as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution for their underlying TaskRuns.

@vdemeester @khrm @lbernick PTAL if you all have sufficient time to do so - thanks !!

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Sep 6, 2023
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 6, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

@@ -219,6 +224,11 @@ func viewRegister(cfg *config.Metrics) error {
Measure: runningTRsThrottledByNodeCount,
Aggregation: view.LastValue(),
}
runningTRsWaitingOnBundleResolutionCountView = &view.View{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I got confused due to the naming.
Can you please rename it to runningTRsWaitingOnTaskResolutionCountView instead of runningTRsWaitingOnBundleResolutionCountView?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah I renamed the metric but forgot to rename the view ... thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update pushed @khrm

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

Copy link
Contributor

@khrm khrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

/assign @vdemeester @lbernick

pkg/pipelinerunmetrics/metrics.go Outdated Show resolved Hide resolved
pkg/pipelinerunmetrics/metrics.go Show resolved Hide resolved
status corev1.ConditionStatus
reason string
prWaitCount float64
trWaitCount float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case for multiple taskruns/pipelineruns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, though as I type today is EOB Friday, so it will be early next week hopefully.

succeedCondition := pr.Status.GetCondition(apis.ConditionSucceeded)
if succeedCondition != nil && succeedCondition.Status == corev1.ConditionUnknown {
switch succeedCondition.Reason {
case v1.TaskRunReasonResolvingTaskRef:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this occur if the pipelinerun has status v1.TaskRunReasonResolvingTaskRef? I'm a bit confused how the PR status is updated if it's waiting on multiple taskrun resolutions. Wouldn't the TR metric be inaccurate in this case, because it's incremented once per pipelinerun rather than once per taskrun in each pipelinerun?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the answer to your first question is "yes".

Specifically, the code at

pipelineRunState, err := c.resolvePipelineState(ctx, tasks, pipelineMeta.ObjectMeta, pr)
switch {
case errors.Is(err, remote.ErrRequestInProgress):
message := fmt.Sprintf("PipelineRun %s/%s awaiting remote resource", pr.Namespace, pr.Name)
pr.Status.MarkRunning(v1.TaskRunReasonResolvingTaskRef, message)
return nil
case err != nil:
return err
default:
}
is what motivated adding this metric.

Now, looking at the implementation of resolvePipelineState it errors out of that method on the first instance of remote.ErrRequestInProgress .... so it does not bother to see how many taskruns it is waiting on.

From at least my team's operational perspective, knowing that a pipelinesrun is blocked by the resolution of any of its taskruns is sufficient..... i.e. the number of pipelineruns blocked by taskruns does not have to equal the number of blocked taskruns.

Lastly, I looked at my description for the metric, and I'll admit it is not precisely clear on this nuance (i.e. I need the throw the word "any" in there in the right spot).

Hopefully that helps with the bit you said you were confused on and answers the second question, and of course if you agree with my logic/thinking here. And if you think I need to put some form of all this ^^ in comments in the code, or in the metrics.md file let me know.

Thansk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think this makes more sense now, thank you! I was thinking that it was supposed to reflect the number of blocked taskruns in a pipelinerun, but re-reading the metric names/descriptions this is more clear. I'm wondering if it's useful to separate out waiting on pipeline resolution vs waiting on task resolution, vs just the number of PRs waiting on resolution of any kind? Genuine question, curious to hear what is useful for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sort of consolidation notion occurred to me too @lbernick

I landed on a finer grained distinction because in our deployment we already are considering/moving to a situation where there will be more than one container registry employed for our various bundles. Short term, the current level of granularity is OK for us for distinguishing between container registries, and the minimal labeling helps with the cardinality of the metric within the metric subsystem. But longer term, we might even pursue a separate PR to add say the container registry (minus image/tag) as a label. But I wanted to wait until the we had some vetting of the current metric before adding to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably another metric for the number of PRs waiting on resolution of any kind is useful. The other three are also fine.

Copy link
Contributor

@khrm khrm Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tag related to resolution like registry/repo can be helpful but this should be configurable via configmap. We don't want to increase the cardinality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on the resolution/configmap note @khrm ..... I'll make sure and including the configuration / opt in approach for those labels if and when I come back with a new PR that contributes that.

On the waiting for resolution of any other kind, I think the reason label we check for is for any of the resolver types. Though I've referenced image/bundle resolution in some of my comments here, the other types are not precluded. And I was careful in the metric description to not reference a specific resolver. Certainly though if we were to add configurable labels in a follow up PR, specific configurable labels for each of the types could make sense. Not adding the more detailed labels with this initial PR bypasses all that additional code delta.

@gabemontero
Copy link
Contributor Author

@khrm @lbernick - in addition to the response to @lbernick 's comments from earlier today, I realized while processing them today that the PR title and commit text is not entirely accurate.

These metrics are not "wait times". Rather, they are a counter/gauge of how many pielineruns/taskruns are waiting on resolving pipelines/tasks.

I'll update those when I get back to making updates to this PR after the weekend.

Now, some tl;dr .... actual wait times ... i.e. how long was the "waiting on resolution" reasons set in the ConditionSuceeded condition, is something my team is interested in as well. I refrained from adding that metric in this PR for 2 basic reasons

  • keep the breath / amount of the change more manageableexactly
  • the manipulation of that condition's reason in the tekton code is spread out over a few packages/methods (i.e. it is set to running in pkg/pod, and the waiting on resolution is done in different methods in the taskrun reconciler) and I was a bit concerned about the fragility of maintaining the metric in the tekton controller

As it turns out, I can implement such "wait time" metrics more cleanly in other controllers outside of tekton that watch pipelineruns/taskruns as well (in the interest of brevity, I'll refrain on the why/how they are easier/more clean, but can elaborate if desired).

That said, if you all are interested in seeing how a "wait time" metric could look in tekton, I could submit another PR when we are done with this one, and we can go from there.

Thanks.

@lbernick
Copy link
Member

thanks @gabemontero! this looks good to me other than the few comments and PR title/release notes as you mentioned. If it would be helpful for you to add wait time metrics too, please go for it! (Just in a separate PR, like you mentioned :) ) If it would be helpful to have some discussion on how best to implement these wait time metrics before actually creating a PR, I think the best way would be to create a tracking issue with a short description of why the metric would be helpful and how you're thinking of implementing it, and we can discuss there.

@gabemontero gabemontero changed the title Add taskrun/pipelinerun gauge metrics for wait times around resolving respective tasks/pipelines Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines Sep 11, 2023
@gabemontero
Copy link
Contributor Author

OK @lbernick @khrm I've updated the unit tests to do multiple pipelineruns / taskruns, and I have fixed the PR title / commit message to remove the bit about time waiting.

Also responded / believe agreed with @khrm comments from today.

Unless there are any disconnects on my unit test updates or my response to @khrm comments today, from end I am waiting on @lbernick to chime in on the separate issue / separate PR for the pipelinerun reason comment.

And yep @lbernick saw your advice about either a separate PR for the wait time metric or open the issue first to discuss. Once we are done with this one I'll reset, revisit a little, and see which way to go.

thanks folks

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

unregisterMetrics()
ctx, _ := ttesting.SetupFakeContext(t)
informer := fakepipelineruninformer.Get(ctx)
for i := 0; i < 3; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you make this magic number a named variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now a variable

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khrm, lbernick

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 12, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

…ks/pipelines

This commit adds new experimental gauge metrics that count the number of TaskRuns who are waiting for resolution of any Tasks they reference,
as well as count the number of PipelineRuns waiting on Pipeline resolution, and lastly count the number of PipelineRuns waiting on Task resolution
for their underlying TaskRuns.
@gabemontero
Copy link
Contributor Author

PR updated @lbernick @khrm to leverage the new location of the pipelinerun resolving ref constant

PTAL / thanks !

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 82.1% 2.1
pkg/taskrunmetrics/metrics.go 83.0% 83.4% 0.5

@lbernick
Copy link
Member

thanks @gabemontero, would you mind updating the release note to correct what the metric is for? @khrm could you leave lgtm since I've already approved?

@gabemontero
Copy link
Contributor Author

thanks @gabemontero, would you mind updating the release note to correct what the metric is for? @khrm could you leave lgtm since I've already approved?

ah good catch yep just a sec

@khrm
Copy link
Contributor

khrm commented Sep 13, 2023

@lbernick I don't have permission to give lgtm @vdemeester @afrittoli can you do that?
/lgtm

@tekton-robot
Copy link
Collaborator

@khrm: changing LGTM is restricted to collaborators

In response to this:

@lbernick I don't have permission to give lgtm @vdemeester @afrittoli can you do that?
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabemontero
Copy link
Contributor Author

release note updated @lbernick thanks

@lbernick
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 14, 2023
@gabemontero
Copy link
Contributor Author

thanks @gabemontero! this looks good to me other than the few comments and PR title/release notes as you mentioned. If it would be helpful for you to add wait time metrics too, please go for it! (Just in a separate PR, like you mentioned :) ) If it would be helpful to have some discussion on how best to implement these wait time metrics before actually creating a PR, I think the best way would be to create a tracking issue with a short description of why the metric would be helpful and how you're thinking of implementing it, and we can discuss there.

so I've been able percolate on the wait time metric a bit more @lbernick @khrm , and there are at least 2 implementation paths where I see various pros and cons, depending on how you view the elements of those changes

fwiw I have an "outside of tekton" implementation for one of those paths, so have prototyped things to some degree, and have some POC / unit level testing to help validate

but given all this, I'll open up an issue vs. just providing a PR per your earlier comment @lbernick, where I'll try to describe the 2 choices I see, with my take on the pros/cons

we can then see what you all think about those, if you see other possible paths to go down, etc.

thanks

@tekton-robot tekton-robot merged commit 284874d into tektoncd:main Sep 14, 2023
@gabemontero gabemontero deleted the bundle-wait-metric branch September 14, 2023 19:31
@gabemontero
Copy link
Contributor Author

ok @lbernick @khrm I've opened #7116 for discussing how to do the wait time metric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants