Disentangle metric reporting from the actual reconciler. #2762

mattmoor · 2020-06-05T14:13:48Z

This changes the metric reporting to happen periodically vs. being
triggered by reconciliation. The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h). With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

Fixes: #2729

tekton-robot · 2020-06-05T14:14:21Z

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

tekton-robot · 2020-06-05T14:15:48Z

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

tekton-robot · 2020-06-05T14:16:51Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/metrics.go	86.8%	84.7%	-2.0
pkg/reconciler/pipelinerun/pipelinerun.go	80.2%	79.9%	-0.4
pkg/reconciler/taskrun/controller.go	96.2%	96.3%	0.1
pkg/reconciler/taskrun/metrics.go	86.7%	85.4%	-1.2
pkg/reconciler/taskrun/taskrun.go	76.0%	75.4%	-0.6

mattmoor · 2020-06-05T14:20:05Z

/kind bug

vdemeester

SGTM 👼
/cc @hrishin @afrittoli

pkg/reconciler/pipelinerun/controller.go

hrishin · 2020-06-05T17:18:17Z

pkg/reconciler/pipelinerun/metrics.go

 }

 // NewRecorder creates a new metrics recorder instance
 // to log the PipelineRun related metrics
 func NewRecorder() (*Recorder, error) {
 	r := &Recorder{
 		initialized: true,
+
+		// Default to 30s intervals.
+		ReportingPeriod: 30 * time.Second,


It could be nice if we keep this configurable? 🤔

It's a public field, so it can be changed after the resource is created, but I see that this is just consumed by NewController without really having a place to hook into. I'm happy to whatever is conventional for Tekton here.

I'm not sure as well(I was away for some time). It's better if @afrittoli @vdemeester comment more on it.
In my opinion, we can have it as it is. If required in the future we can provide the relevant configuration option :)

However, maybe it's nice to document it? so users are aware of 30 seconds window to get updated metrics

hrishin

Overall /LGTM, Thank you very much @mattmoor for the findings 🙇 . I didn't realize this could happen. One thing is I was under the impression that the client-go reconciles the state quite effectively by updating the lister's cache before firing the reconciling event but missed resync part.
However, wonders about 2 things 🤔

Processing o(n) resources every time could result in any unwanted bitter side effect for the controller deployment? (You're right then deep down metrics implementation may need some redesigning).
If resync could be a problem, isn't it a general reconciliation problem? (forgive me if I missing some point)

mattmoor · 2020-06-05T20:36:49Z

Processing o(n) resources every time could result in any unwanted bitter side effect for the controller deployment

O(n) isn't a change from before, but the way this is written now ensures that only one scan i happening at a time. Suppose "n" is huge and it takes 10 minutes to scan the lister cache.

This approach would just take 10:30 to expose the metric. The previous approach would leak go routines every time we update a single resources status. So in addition to being O(n) in runtime, it also might leak O(n) go routines[1] if all n pipelines are actively running.

[1] - I'm assuming there are roughly a constant number of status updates, but this could be much higher given that there is no deduplication by key.

If resync could be a problem, isn't it a general reconciliation problem?

It is. For global resyncs we jitter to avoid starving real work for a period of time. The default resync period is also every 10 hours to mitigate this, but that seems overly coarse for metrics resolution to be useful.

mattmoor · 2020-06-05T20:37:01Z

/retest

hrishin · 2020-06-05T22:19:21Z

Thank you @mattmoor for the explanation.

O(n) isn't a change from before, but the way this is written now ensures that only one scan i happening at a time. Suppose "n" is huge and it takes 10 minutes to scan the lister cache.

O(n) holds true, sorry to rephrase it, I meant every time vs only when something changes. :)

The previous approach would leak go routines every time we update a single resources status.

True, as there wasn't serialization in place (frankly I was thinking to serialize it during the implementation using queue but later on felt that would be over-engineering) thus it could trigger the create stale metrics. However, the current PR approach deals better in that case.

[2]

The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h). With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.

On a different note, I'm still not able to get comment[2] and bit curious, would you like to elaborate on it?
As Comment mentions

Now, informers are fast, but in all likelihood the lister that is being used to drive this metric hasn't received the updated version of the resource from the API Server, so unless the informer cache's copy is being modified (a no no) then the cache won't reflect what it's clearly (given the if updated check) trying to capture.

However, I'm under the impression that either through the list or watch the updated copy of the resource is placed into the cache first then informer is notified about the updated resource which triggers reconciliation. Thus lister in the reconciler() refers the same copy from the cache which is up to the update. Is it right?

mattmoor · 2020-06-06T14:44:09Z

I'm under the impression that either through the list or watch the updated copy of the resource is placed into the cache first then informer is notified about the updated resource which triggers reconciliation

Yes, the informer model uses watches, lists, and changes on those lists to simulate events by observing changes on those lists through the watch. The premise of using the informer cache for reads during reconciliation is that the data is never a stale read of the change that motivated the reconciliation (because the event was generated by an update to the informer cache). It could still be stale if subsequent reads came in, but this is why "in-flight" work isn't deduplicated with "queued" work (e.g. even if a key is being processed it can be duplicated in the work queue).

However, in the case at HEAD, the metrics logic is keying off of if updated, which is NOT an informer event (e.g. observer-side where lister caches have been updated). So the sequence of events where this works is:

update status (updated = true)
informer QUICKLY observes the change and updates the lister cache (the key is also requeued)
metrics go routine is spawned and sees correct data
The key is processed again, nothing to do (updated = false)
metrics go routine is not spawned because nothing changed.

2. -> 3. is the informer event happening first by chance, but there is no causal link here. We aren't updating metrics because the informer cache reflects the updated state.

The sequence of events where this doesn't work is:

update status (updated = true)
metrics go routine is spawned and sees INCORRECT data
informer SLOWLY observes the change and updates the lister cache (the key is also requeued)
The key is processed again, nothing to do (updated = false)
metrics go routine is not spawned because nothing changed.

In the former case we get really lucky (informers are fast, but not "faster than a local go routine" fast). There used to be a bug that meant we didn't have to get lucky, but that is fixed at HEAD. Two other notable aspects to this:

The metrics data is never reevaluated on a nop resync, so bad data stays bad until there's a change (and we get lucky enough that there isn't new bad data).
When the controller process starts up, it won't surface this metric data until it observes a change.

Fundamentally what's at HEAD is "edge triggered" (keying off of deltas), vs. "level based" (changes that trigger more holistic reconsideration). It's not that "level based" is the only correct way to do things and it can seem wasteful to reconsider everything when we think we know exactly what changed, but it can be much more forgiving of errors and dropped events. I had some slides on this that it seems I never shared publicly and are locked up at Google somewhere 😞

afrittoli

Thanks for this! It looks like a move into the right direction.
The more logic we manage to take out of the core reconciler the better, as it's becoming harder to maintain otherwise.
/approve

vdemeester

/lgtm
/meow

tekton-robot · 2020-06-08T08:50:18Z

@vdemeester:

In response to this:

/lgtm
/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2020-06-08T08:50:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, hrishin, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [afrittoli,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

afrittoli · 2020-06-08T09:53:31Z

@mattmoor I'm afraid this will need a rebase :)

This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: tektoncd#2729

mattmoor · 2020-06-08T13:53:41Z

Rebased.

mattmoor · 2020-06-08T13:54:10Z

(will need a fresh /lgtm)

I'll work on rebasing the others now as well.

tekton-robot · 2020-06-08T13:56:26Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/pipelinerun/metrics.go	86.8%	84.7%	-2.0
pkg/reconciler/pipelinerun/pipelinerun.go	81.5%	81.1%	-0.4
pkg/reconciler/taskrun/controller.go	96.4%	96.6%	0.1
pkg/reconciler/taskrun/metrics.go	86.7%	85.4%	-1.2
pkg/reconciler/taskrun/taskrun.go	76.1%	75.5%	-0.6

afrittoli · 2020-06-08T14:28:02Z

/lgtm

tekton-robot requested review from dibyom and a user June 5, 2020 14:13

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 5, 2020

mattmoor mentioned this pull request Jun 5, 2020

Migrate the PipelineRun controller to genreconciler. #2760

Merged

tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 5, 2020

vdemeester reviewed Jun 5, 2020

View reviewed changes

tekton-robot requested review from afrittoli and hrishin June 5, 2020 14:30

afrittoli reviewed Jun 5, 2020

View reviewed changes

pkg/reconciler/pipelinerun/controller.go Show resolved Hide resolved

hrishin reviewed Jun 5, 2020

View reviewed changes

hrishin approved these changes Jun 5, 2020

View reviewed changes

afrittoli reviewed Jun 6, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 6, 2020

vdemeester approved these changes Jun 8, 2020

View reviewed changes

tekton-robot assigned vdemeester Jun 8, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020

mattmoor force-pushed the periodic-metrics branch from 2f44504 to 79c0664 Compare June 8, 2020 13:53

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020

tekton-robot assigned afrittoli Jun 8, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020

tekton-robot merged commit 5aafdf9 into tektoncd:master Jun 8, 2020

mattmoor deleted the periodic-metrics branch June 8, 2020 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disentangle metric reporting from the actual reconciler. #2762

Disentangle metric reporting from the actual reconciler. #2762

mattmoor commented Jun 5, 2020

tekton-robot commented Jun 5, 2020

tekton-robot commented Jun 5, 2020

tekton-robot commented Jun 5, 2020

mattmoor commented Jun 5, 2020

vdemeester left a comment

hrishin Jun 5, 2020

mattmoor Jun 5, 2020

hrishin Jun 6, 2020

hrishin left a comment

mattmoor commented Jun 5, 2020

mattmoor commented Jun 5, 2020

hrishin commented Jun 5, 2020

mattmoor commented Jun 6, 2020

afrittoli left a comment

vdemeester left a comment

tekton-robot commented Jun 8, 2020

tekton-robot commented Jun 8, 2020

afrittoli commented Jun 8, 2020

mattmoor commented Jun 8, 2020

mattmoor commented Jun 8, 2020

tekton-robot commented Jun 8, 2020

afrittoli commented Jun 8, 2020

Disentangle metric reporting from the actual reconciler. #2762

Disentangle metric reporting from the actual reconciler. #2762

Conversation

mattmoor commented Jun 5, 2020

tekton-robot commented Jun 5, 2020

tekton-robot commented Jun 5, 2020

tekton-robot commented Jun 5, 2020

mattmoor commented Jun 5, 2020

vdemeester left a comment

Choose a reason for hiding this comment

hrishin Jun 5, 2020

Choose a reason for hiding this comment

mattmoor Jun 5, 2020

Choose a reason for hiding this comment

hrishin Jun 6, 2020

Choose a reason for hiding this comment

hrishin left a comment

Choose a reason for hiding this comment

mattmoor commented Jun 5, 2020

mattmoor commented Jun 5, 2020

hrishin commented Jun 5, 2020

mattmoor commented Jun 6, 2020

afrittoli left a comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

tekton-robot commented Jun 8, 2020

tekton-robot commented Jun 8, 2020

afrittoli commented Jun 8, 2020

mattmoor commented Jun 8, 2020

mattmoor commented Jun 8, 2020

tekton-robot commented Jun 8, 2020

afrittoli commented Jun 8, 2020