-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disentangle metric reporting from the actual reconciler. #2762
Conversation
This PR cannot be merged: expecting exactly one kind/ label Available
|
This PR cannot be merged: expecting exactly one kind/ label Available
|
The following is the coverage report on the affected files.
|
/kind bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM 👼
/cc @hrishin @afrittoli
} | ||
|
||
// NewRecorder creates a new metrics recorder instance | ||
// to log the PipelineRun related metrics | ||
func NewRecorder() (*Recorder, error) { | ||
r := &Recorder{ | ||
initialized: true, | ||
|
||
// Default to 30s intervals. | ||
ReportingPeriod: 30 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be nice if we keep this configurable? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a public field, so it can be changed after the resource is created, but I see that this is just consumed by NewController without really having a place to hook into. I'm happy to whatever is conventional for Tekton here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure as well(I was away for some time). It's better if @afrittoli @vdemeester comment more on it.
In my opinion, we can have it as it is. If required in the future we can provide the relevant configuration option :)
However, maybe it's nice to document it? so users are aware of 30 seconds window to get updated metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall /LGTM, Thank you very much @mattmoor for the findings 🙇 . I didn't realize this could happen. One thing is I was under the impression that the client-go reconciles the state quite effectively by updating the lister's cache before firing the reconciling event but missed resync part.
However, wonders about 2 things 🤔
- Processing o(n) resources every time could result in any unwanted bitter side effect for the controller deployment? (You're right then deep down metrics implementation may need some redesigning).
- If resync could be a problem, isn't it a general reconciliation problem? (forgive me if I missing some point)
O(n) isn't a change from before, but the way this is written now ensures that only one scan i happening at a time. Suppose "n" is huge and it takes 10 minutes to scan the lister cache. This approach would just take 10:30 to expose the metric. The previous approach would leak go routines every time we update a single resources status. So in addition to being O(n) in runtime, it also might leak O(n) go routines[1] if all n pipelines are actively running. [1] - I'm assuming there are roughly a constant number of status updates, but this could be much higher given that there is no deduplication by key.
It is. For global resyncs we jitter to avoid starving real work for a period of time. The default resync period is also every 10 hours to mitigate this, but that seems overly coarse for metrics resolution to be useful. |
/retest |
Thank you @mattmoor for the explanation.
O(n) holds true, sorry to rephrase it, I meant every time vs only when something changes. :)
True, as there wasn't serialization in place (frankly I was thinking to serialize it during the implementation using queue but later on felt that would be over-engineering) thus it could trigger the create stale metrics. However, the current PR approach deals better in that case. [2]
On a different note, I'm still not able to get comment[2] and bit curious, would you like to elaborate on it?
However, I'm under the impression that either through the list or watch the updated copy of the resource is placed into the cache first then informer is notified about the updated resource which triggers reconciliation. Thus lister in the reconciler() refers the same copy from the cache which is up to the update. Is it right? |
Yes, the informer model uses watches, lists, and changes on those lists to simulate events by observing changes on those lists through the watch. The premise of using the informer cache for reads during reconciliation is that the data is never a stale read of the change that motivated the reconciliation (because the event was generated by an update to the informer cache). It could still be stale if subsequent reads came in, but this is why "in-flight" work isn't deduplicated with "queued" work (e.g. even if a key is being processed it can be duplicated in the work queue). However, in the case at HEAD, the metrics logic is keying off of
The sequence of events where this doesn't work is:
In the former case we get really lucky (informers are fast, but not "faster than a local go routine" fast). There used to be a bug that meant we didn't have to get lucky, but that is fixed at HEAD. Two other notable aspects to this:
Fundamentally what's at HEAD is "edge triggered" (keying off of deltas), vs. "level based" (changes that trigger more holistic reconsideration). It's not that "level based" is the only correct way to do things and it can seem wasteful to reconsider everything when we think we know exactly what changed, but it can be much more forgiving of errors and dropped events. I had some slides on this that it seems I never shared publicly and are locked up at Google somewhere 😞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! It looks like a move into the right direction.
The more logic we manage to take out of the core reconciler the better, as it's becoming harder to maintain otherwise.
/approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/meow
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli, hrishin, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@mattmoor I'm afraid this will need a rebase :) |
This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: tektoncd#2729
2f44504
to
79c0664
Compare
Rebased. |
(will need a fresh I'll work on rebasing the others now as well. |
The following is the coverage report on the affected files.
|
/lgtm |
This changes the metric reporting to happen periodically vs. being
triggered by reconciliation. The previous method was prone to stale
data because it was driven by the informer cache immediately following
writes through the client, and might not fix itself since the status
may not change (even on resync every 10h). With this change it should
exhibit the correct value within 30s + {informer delay}, where the 30s
is configurable on the Recorders.
Fixes: #2729