add a metric for reference resolution request time to complete #7116

gabemontero · 2023-09-15T14:56:51Z

Feature request

A new metric that tracks the time needed for any ResolutionRequest created by the TaskRun reconciler, PipelineRun reconciler, etc. to complete and provide the resolved content, such as the Task or Pipeline resource for example.

Use case

As the administrator of a Tekton deployment, I need to understand the amount of time being spent retrieving various Tekton resources from remote locations supported by the various Tekton Resolvers (bundles, cluster, hub, git etc.).

Being able to track this time allows for

understanding the time cost of storing Tasks, Pipelines, etc. in remote locations vs. embedding them directly in TaskRuns or PipelineRuns, or at least creating those items directly in the local cluster, so we can make informed decisions, comparing that cost against the typical execution times, plus the benefits unrelated to "time spent" that using remote storage provides
being able to track over time if the time spent increases from historical norms, to assess if there is some sort of inefficiency or problem that has been introduced into the Tekton deployment's environment

Background

When I provided taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines I discussed with @lbernick and @khrm that while those counts also help with overall monitoring of the resolver portion of our tekton deployment, we were certainly intereted in time spent. However, the way to implement the "time spent" metric was not as singular and clear cut as the count/gauge metric, which built upon the existing collection pattern for running metrics. I'm opening this feature issue to dive into the details as to why, as well as list out possible implementation options for dealing with the trickiness of it all.

How do we discern when a resolution request is active

Option 1)

Add metrics to the ResolutionRequest reconciler

Option 2)

Track the time when the TaskRunReasonResolvingTaskRef and PipelineRunReasonResolvingPipelineRef reasons are active in the ConditionSucceeded condition of a given *Run. Meaning what transition time for those reasons are set vs. the transition time when those reasons are changed to something else (i.e. Running).

Some more pros/cons with each option

Option 1

pro - tracking time spent based on the initial creation / completion of a ResoultionRequest, or tracking the time needed to hydrate a Task or Pipeline from an existing ResolutionRequest are straight forward

cons - Currently the ResolutionRequest reconciler has no metrics integration. This feature could be the motivator for starting that up. However, there is a non-trivial amount of overhead for ramping up metrics integration (the ConfigMap based configuration for example, making sure metric defs get passed down to the knative controller logic, among others).

Also, tracking at the ResolutionRequest layer potentially does not capture the cost of any k8s API server round trip when the TaskRun or PipelineRun reconcilers create/fetch ResolutionRequests. I have not had a chance to dive into the knative client code to confirm/deny how differing underlying caching clients within the same process leverage watch caches. So maybe the k8s get's for ResolutionRequests are minimal in the end. But some due diligence around this could be warranted.

Option 2

pros - leverages the existing metrics integration in pipelinerun/taskrun; add an "in-line" metric that can be called to check the condition change

Either
a) follow the existing pattern for other "in-line" metrics and add a metric obsservation in the task run status method MarkResourceOngoing and the pipeline run status method MarkRunning method
or
b) introduce a new pattern where we add a new k8s Informer().AddEventHandler(...) in pipelinerun/controller.go and taskrun/controller.go that adds a filter method that allows us access to both the previous and current objects of a given k8s informer event. In that function, we can observe the condition change prior to any machinations in the Reconciler paths

In both, you take the transition time of when we set to running minus the transition time of when it was set to resolving ref.

NOTE: because of Tekton's use of controller.NewRequeueAfter(waitTime) multiple calls to MarkResourceOngoing or MarkRunning methods while the reason field remains waiting on resolution, but knative already has logic which does not change the last transition time if the condition is otherwise unchanged. So we can depend on the last transition time for our metric.

// SetCondition sets or updates the Condition on Conditions for Condition.Type.
// If there is an update, Conditions are stored back sorted.
func (r conditionsImpl) SetCondition(cond Condition) {
	if r.accessor == nil {
		return
	}
	t := cond.Type
	var conditions Conditions
	for _, c := range r.accessor.GetConditions() {
		if c.Type != t {
			conditions = append(conditions, c)
		} else {
			// If we'd only update the LastTransitionTime, then return.
			cond.LastTransitionTime = c.LastTransitionTime
			if reflect.DeepEqual(cond, c) {
				return
			}
		}
	}

pros a)

follow existing patterns
cons a)
dependent on never diverting from using of MarkResourceOngoing or MarkRunning

pros b)

ensures we'll properly capture the metric regardless of how the reason field is changed, which method is used
cons b)
establishes a new pattern for collecting metrics
leverages a lower level k8s API for the metric that may not sit well
possibly other "conventions" not occurring to me off the top of my head

What did I miss ?

see comments below :-)

The text was updated successfully, but these errors were encountered:

gabemontero added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 15, 2023

gabemontero mentioned this issue Sep 15, 2023

Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines #7094

Merged

1 task

vdemeester mentioned this issue Nov 16, 2023

Splitting metrics "export" out of the main controller #7373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a metric for reference resolution request time to complete #7116

add a metric for reference resolution request time to complete #7116

gabemontero commented Sep 15, 2023

add a metric for reference resolution request time to complete #7116

add a metric for reference resolution request time to complete #7116

Comments

gabemontero commented Sep 15, 2023

Feature request

Use case

Background

How do we discern when a resolution request is active

Some more pros/cons with each option

Option 1

Option 2

What did I miss ?