You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A new metric that tracks the time needed for any ResolutionRequest created by the TaskRun reconciler, PipelineRun reconciler, etc. to complete and provide the resolved content, such as the Task or Pipeline resource for example.
Use case
As the administrator of a Tekton deployment, I need to understand the amount of time being spent retrieving various Tekton resources from remote locations supported by the various Tekton Resolvers (bundles, cluster, hub, git etc.).
Being able to track this time allows for
understanding the time cost of storing Tasks, Pipelines, etc. in remote locations vs. embedding them directly in TaskRuns or PipelineRuns, or at least creating those items directly in the local cluster, so we can make informed decisions, comparing that cost against the typical execution times, plus the benefits unrelated to "time spent" that using remote storage provides
being able to track over time if the time spent increases from historical norms, to assess if there is some sort of inefficiency or problem that has been introduced into the Tekton deployment's environment
Background
When I provided taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines I discussed with @lbernick and @khrm that while those counts also help with overall monitoring of the resolver portion of our tekton deployment, we were certainly intereted in time spent. However, the way to implement the "time spent" metric was not as singular and clear cut as the count/gauge metric, which built upon the existing collection pattern for running metrics. I'm opening this feature issue to dive into the details as to why, as well as list out possible implementation options for dealing with the trickiness of it all.
How do we discern when a resolution request is active
Option 1)
Add metrics to the ResolutionRequest reconciler
Option 2)
Track the time when the TaskRunReasonResolvingTaskRef and PipelineRunReasonResolvingPipelineRef reasons are active in the ConditionSucceeded condition of a given *Run. Meaning what transition time for those reasons are set vs. the transition time when those reasons are changed to something else (i.e. Running).
Some more pros/cons with each option
Option 1
pro - tracking time spent based on the initial creation / completion of a ResoultionRequest, or tracking the time needed to hydrate a Task or Pipeline from an existing ResolutionRequest are straight forward
cons - Currently the ResolutionRequest reconciler has no metrics integration. This feature could be the motivator for starting that up. However, there is a non-trivial amount of overhead for ramping up metrics integration (the ConfigMap based configuration for example, making sure metric defs get passed down to the knative controller logic, among others).
Also, tracking at the ResolutionRequest layer potentially does not capture the cost of any k8s API server round trip when the TaskRun or PipelineRun reconcilers create/fetch ResolutionRequests. I have not had a chance to dive into the knative client code to confirm/deny how differing underlying caching clients within the same process leverage watch caches. So maybe the k8s get's for ResolutionRequests are minimal in the end. But some due diligence around this could be warranted.
Option 2
pros - leverages the existing metrics integration in pipelinerun/taskrun; add an "in-line" metric that can be called to check the condition change
Either
a) follow the existing pattern for other "in-line" metrics and add a metric obsservation in the task run status method MarkResourceOngoing and the pipeline run status method MarkRunning method
or
b) introduce a new pattern where we add a new k8s Informer().AddEventHandler(...) in pipelinerun/controller.go and taskrun/controller.go that adds a filter method that allows us access to both the previous and current objects of a given k8s informer event. In that function, we can observe the condition change prior to any machinations in the Reconciler paths
In both, you take the transition time of when we set to running minus the transition time of when it was set to resolving ref.
NOTE: because of Tekton's use of controller.NewRequeueAfter(waitTime) multiple calls to MarkResourceOngoing or MarkRunning methods while the reason field remains waiting on resolution, but knative already has logic which does not change the last transition time if the condition is otherwise unchanged. So we can depend on the last transition time for our metric.
// SetCondition sets or updates the Condition on Conditions for Condition.Type.
// If there is an update, Conditions are stored back sorted.
func (r conditionsImpl) SetCondition(cond Condition) {
if r.accessor == nil {
return
}
t := cond.Type
var conditions Conditions
for _, c := range r.accessor.GetConditions() {
if c.Type != t {
conditions = append(conditions, c)
} else {
// If we'd only update the LastTransitionTime, then return.
cond.LastTransitionTime = c.LastTransitionTime
if reflect.DeepEqual(cond, c) {
return
}
}
}
pros a)
follow existing patterns
cons a)
dependent on never diverting from using of MarkResourceOngoing or MarkRunning
pros b)
ensures we'll properly capture the metric regardless of how the reason field is changed, which method is used
cons b)
establishes a new pattern for collecting metrics
leverages a lower level k8s API for the metric that may not sit well
possibly other "conventions" not occurring to me off the top of my head
What did I miss ?
see comments below :-)
The text was updated successfully, but these errors were encountered:
Feature request
A new metric that tracks the time needed for any
ResolutionRequest
created by theTaskRun
reconciler,PipelineRun
reconciler, etc. to complete and provide the resolved content, such as theTask
orPipeline
resource for example.Use case
As the administrator of a Tekton deployment, I need to understand the amount of time being spent retrieving various Tekton resources from remote locations supported by the various Tekton Resolvers (bundles, cluster, hub, git etc.).
Being able to track this time allows for
Tasks
,Pipelines
, etc. in remote locations vs. embedding them directly inTaskRuns
orPipelineRuns
, or at least creating those items directly in the local cluster, so we can make informed decisions, comparing that cost against the typical execution times, plus the benefits unrelated to "time spent" that using remote storage providesBackground
When I provided taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines I discussed with @lbernick and @khrm that while those counts also help with overall monitoring of the resolver portion of our tekton deployment, we were certainly intereted in time spent. However, the way to implement the "time spent" metric was not as singular and clear cut as the count/gauge metric, which built upon the existing collection pattern for running metrics. I'm opening this feature issue to dive into the details as to why, as well as list out possible implementation options for dealing with the trickiness of it all.
How do we discern when a resolution request is active
Option 1)
Add metrics to the
ResolutionRequest
reconcilerOption 2)
Track the time when the
TaskRunReasonResolvingTaskRef
andPipelineRunReasonResolvingPipelineRef
reasons are active in theConditionSucceeded
condition of a given *Run. Meaning what transition time for those reasons are set vs. the transition time when those reasons are changed to something else (i.e. Running).Some more pros/cons with each option
Option 1
pro - tracking time spent based on the initial creation / completion of a
ResoultionRequest
, or tracking the time needed to hydrate aTask
orPipeline
from an existingResolutionRequest
are straight forwardcons - Currently the
ResolutionRequest
reconciler has no metrics integration. This feature could be the motivator for starting that up. However, there is a non-trivial amount of overhead for ramping up metrics integration (theConfigMap
based configuration for example, making sure metric defs get passed down to the knative controller logic, among others).Also, tracking at the
ResolutionRequest
layer potentially does not capture the cost of any k8s API server round trip when theTaskRun
orPipelineRun
reconcilers create/fetchResolutionRequests
. I have not had a chance to dive into the knative client code to confirm/deny how differing underlying caching clients within the same process leverage watch caches. So maybe the k8s get's forResolutionRequests
are minimal in the end. But some due diligence around this could be warranted.Option 2
pros - leverages the existing metrics integration in pipelinerun/taskrun; add an "in-line" metric that can be called to check the condition change
Either
a) follow the existing pattern for other "in-line" metrics and add a metric obsservation in the task run status method
MarkResourceOngoing
and the pipeline run status methodMarkRunning
methodor
b) introduce a new pattern where we add a new k8s
Informer().AddEventHandler(...)
in pipelinerun/controller.go and taskrun/controller.go that adds a filter method that allows us access to both the previous and current objects of a given k8s informer event. In that function, we can observe the condition change prior to any machinations in the Reconciler pathsIn both, you take the transition time of when we set to running minus the transition time of when it was set to resolving ref.
NOTE: because of Tekton's use of
controller.NewRequeueAfter(waitTime)
multiple calls toMarkResourceOngoing
orMarkRunning
methods while the reason field remains waiting on resolution, but knative already has logic which does not change the last transition time if the condition is otherwise unchanged. So we can depend on the last transition time for our metric.pros a)
cons a)
MarkResourceOngoing
orMarkRunning
pros b)
cons b)
What did I miss ?
see comments below :-)
The text was updated successfully, but these errors were encountered: