-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add syncer in error state metrics #2156
Add syncer in error state metrics #2156
Conversation
Hi @sawsa307. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @swetharepakula |
pkg/neg/metrics/syncer_metrics.go
Outdated
syncerInErrorState = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Subsystem: negControllerSubsystem, | ||
Name: "syncer_in_error_tate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
By the way, if the metric name is still up for discussion, I'd like to suggest we improve the name to plural form (since this is counting something)
"syncers_in_error_state"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
pkg/neg/metrics/syncer_metrics.go
Outdated
@@ -64,6 +67,15 @@ var ( | |||
}, | |||
[]string{"state"}, | |||
) | |||
|
|||
syncerInErrorState = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some context on why we didn't want to use the other metric which ALSO counts the number of syncers?
syncerState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In syncerState
, the state of a syncer is dependent on the result of last sync, so if a syncer is in error-state, its state will be success after degraded mode intervention. It means the syncer has successfully added endpoints to its NEG, which indicates our degraded mode procedures are working as we expected.
For syncerInErrorState
, we want to track if a syncer is consistently in error-state, and this gives us an indication of whether there is bug in dependent system(Arcus API, kubelet, EPS controller and etc.)
/assign @gauravkghildiyal |
a63097b
to
c5562ee
Compare
@@ -65,6 +69,16 @@ var ( | |||
[]string{"state"}, | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add an error state label key and add it to this metric? It also matches how you are tracking the information.
I agree with Gaurav, it is confusing to have two similar metrics
c5562ee
to
149530c
Compare
149530c
to
f2d8231
Compare
/ok-to-test |
/retest |
1 similar comment
/retest |
@@ -62,7 +62,7 @@ var ( | |||
Name: "syncer_state", | |||
Help: "Current count of syncers in each state", | |||
}, | |||
[]string{"state"}, | |||
[]string{"state", "in_error_state"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets change this to degraded mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func PublishSyncerStateMetrics(stateCount syncerStateCount) { | ||
for state, count := range stateCount { | ||
if state.inErrorState { | ||
syncerState.WithLabelValues(string(state.state), "inErrorState").Set(float64(count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this to DegradedModeEnabled and DegradedModeDisabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Now we would have inErrorState as label, "true" and "false" as values.
syncerStateMap map[negtypes.NegSyncerKey]negtypes.Reason | ||
// syncerErrorStateMap tracks if each syncer is in error-state | ||
syncerErrorStateMap map[negtypes.NegSyncerKey]bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine these into a single map. And instead of storing just the reason or the bool store it with the stateWithErrorState struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
type stateWithErrorState struct { | ||
state negtypes.Reason | ||
inErrorState bool | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename the struct to syncerState, and possibly change the state field to lastSyncResult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks!
f2d8231
to
d21a4f3
Compare
// syncerState tracks the count of syncer in different states | ||
syncerState = prometheus.NewGaugeVec( | ||
// SyncerCountBySyncResult tracks the count of syncer in different states | ||
SyncerCountBySyncResult = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a SyncerCountByEndpointType metrics. Add suffix for clarification.
Add syncer in error state metrics
d21a4f3
to
dbd55aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sawsa307, swetharepakula The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
Add syncer in error state metrics