Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metrics observe calls to report metrics as soon as possible #456

Merged
merged 4 commits into from
Nov 3, 2020
Merged

Refactor metrics observe calls to report metrics as soon as possible #456

merged 4 commits into from
Nov 3, 2020

Conversation

HeavyWombat
Copy link
Contributor

Based on the suggestion by @SaschaSchwarze0, there is a gap in the metrics reporting. At the moment, the metrics are observed when the BuildRun is done. However, a metric like the BuildRun Establised Duration could be reported as soon as the BuildRun is available in the system. Furthermore, in case the BuildRun fails, the metrics for this execution are not reported at all.

Idea for refactoring: Move metrics for build-run-count, build-run-established, and build-run-rampup to a section in Reconcile where the details are already available and where in theory the metrics should only be reported once.

Comment on lines 344 to 349
// Report the buildrun established duration (time between the creation of the buildrun and the start of the buildrun)
buildmetrics.BuildRunEstablishObserve(
buildRun.Status.BuildSpec.StrategyRef.Name,
buildRun.Namespace,
buildRun.Status.StartTime.Time.Sub(buildRun.CreationTimestamp.Time),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this one can work here because the buildRun does not yet have a StartTime. This is set later when the TaskRun started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what I did in my local test with minikube that did not show up. Today, I was able to reproduce that this leads to an expected nil pointer de-reference scenario. Therefore, I moved the code section to the part of the code where the required start time is actually set for the first time.

Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think we need to handle nil in one if clause.

@@ -383,35 +393,29 @@ func (r *ReconcileBuildRun) Reconcile(request reconcile.Request) (reconcile.Resu
}

buildRun.Status.LatestTaskRunRef = &lastTaskRun.Name
buildRun.Status.StartTime = lastTaskRun.Status.StartTime

if buildRun.Status.StartTime == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Tekton's TaskRun, Status.StartTime is defined as *metav1.Time and therefore can be nil. We might only be hitting this case here if a TaskRun fails right away and is never started. If my understanding of the IsZero function is correct, I think we could do:

if buildRun.Status.StartTime == nil && !lastTaskRun.Status.StartTime.IsZero() {

As I do not think that Tekton will ever set a zero timestamp, this will probably work the same way:

if buildRun.Status.StartTime == nil && lastTaskRun.Status.StartTime != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add the second check.

Order imports based on import mode using `goimports -w`.
There are a couple of TYPOs in the metrics test package, one even in the
variable name for known histogram metrics.

Fix TYPOs in `metrics_test.go` source file.
There are metrics that are observed when the build run is finished even
though the actual details were available already. Therefore, it could
happen that some metrics are *not* reported due to a failing build run
and would be lost. If the metrics observation would be right after the
creation of the `TaskRun`, then the metrics should be reported in any
case and should also only be reported once.

Move `BuildRunCountInc` and `BuildRunRampUpDurationObserve` calls to
after when the `TaskRun` is created, because at that point in time the
required information are already available.

Move `BuildRunEstablishObserve` to the section of the code, where the
required time fields are set for the first time.
Since some metrics need to be reported as early as possible, the
respective mocks used in the test cases require to have start time and
build spec details.

Add start time to task run mocks.

Add build spec to build run mocks.
Copy link
Member

@SaschaSchwarze0 SaschaSchwarze0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SaschaSchwarze0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2020
@openshift-merge-robot openshift-merge-robot merged commit d34f948 into shipwright-io:master Nov 3, 2020
@HeavyWombat HeavyWombat deleted the refactor/send-metrics branch December 8, 2020 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants