Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add syncer in error state metrics #2156

Merged

Conversation

sawsa307
Copy link
Contributor

Add syncer in error state metrics

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 30, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @sawsa307. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 30, 2023
@sawsa307
Copy link
Contributor Author

/assign @swetharepakula

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 30, 2023
syncerInErrorState = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Subsystem: negControllerSubsystem,
Name: "syncer_in_error_tate",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

By the way, if the metric name is still up for discussion, I'd like to suggest we improve the name to plural form (since this is counting something)

"syncers_in_error_state"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@@ -64,6 +67,15 @@ var (
},
[]string{"state"},
)

syncerInErrorState = prometheus.NewGaugeVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some context on why we didn't want to use the other metric which ALSO counts the number of syncers?

syncerState

Copy link
Contributor Author

@sawsa307 sawsa307 May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In syncerState, the state of a syncer is dependent on the result of last sync, so if a syncer is in error-state, its state will be success after degraded mode intervention. It means the syncer has successfully added endpoints to its NEG, which indicates our degraded mode procedures are working as we expected.
For syncerInErrorState, we want to track if a syncer is consistently in error-state, and this gives us an indication of whether there is bug in dependent system(Arcus API, kubelet, EPS controller and etc.)

@sawsa307 sawsa307 requested a review from gauravkghildiyal May 30, 2023 19:08
@sawsa307
Copy link
Contributor Author

/assign @gauravkghildiyal

@sawsa307 sawsa307 force-pushed the add-in-error-state-metrics branch 2 times, most recently from a63097b to c5562ee Compare May 30, 2023 22:54
@@ -65,6 +69,16 @@ var (
[]string{"state"},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add an error state label key and add it to this metric? It also matches how you are tracking the information.

I agree with Gaurav, it is confusing to have two similar metrics

@sawsa307 sawsa307 force-pushed the add-in-error-state-metrics branch from c5562ee to 149530c Compare May 30, 2023 23:53
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 30, 2023
@sawsa307 sawsa307 requested a review from swetharepakula May 30, 2023 23:57
@sawsa307 sawsa307 force-pushed the add-in-error-state-metrics branch from 149530c to f2d8231 Compare May 31, 2023 00:08
@swetharepakula
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 31, 2023
@sawsa307
Copy link
Contributor Author

/retest

1 similar comment
@sawsa307
Copy link
Contributor Author

/retest

@@ -62,7 +62,7 @@ var (
Name: "syncer_state",
Help: "Current count of syncers in each state",
},
[]string{"state"},
[]string{"state", "in_error_state"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets change this to degraded mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func PublishSyncerStateMetrics(stateCount syncerStateCount) {
for state, count := range stateCount {
if state.inErrorState {
syncerState.WithLabelValues(string(state.state), "inErrorState").Set(float64(count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to DegradedModeEnabled and DegradedModeDisabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Now we would have inErrorState as label, "true" and "false" as values.

Comment on lines 64 to 66
syncerStateMap map[negtypes.NegSyncerKey]negtypes.Reason
// syncerErrorStateMap tracks if each syncer is in error-state
syncerErrorStateMap map[negtypes.NegSyncerKey]bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine these into a single map. And instead of storing just the reason or the bool store it with the stateWithErrorState struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

Comment on lines 144 to 147
type stateWithErrorState struct {
state negtypes.Reason
inErrorState bool
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the struct to syncerState, and possibly change the state field to lastSyncResult?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@sawsa307 sawsa307 force-pushed the add-in-error-state-metrics branch from f2d8231 to d21a4f3 Compare May 31, 2023 23:42
// syncerState tracks the count of syncer in different states
syncerState = prometheus.NewGaugeVec(
// SyncerCountBySyncResult tracks the count of syncer in different states
SyncerCountBySyncResult = prometheus.NewGaugeVec(
Copy link
Contributor Author

@sawsa307 sawsa307 May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a SyncerCountByEndpointType metrics. Add suffix for clarification.

Add syncer in error state metrics
@sawsa307 sawsa307 force-pushed the add-in-error-state-metrics branch from d21a4f3 to dbd55aa Compare May 31, 2023 23:49
Copy link
Member

@swetharepakula swetharepakula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sawsa307, swetharepakula

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2023
@sawsa307
Copy link
Contributor Author

sawsa307 commented Jun 1, 2023

/retest

1 similar comment
@sawsa307
Copy link
Contributor Author

sawsa307 commented Jun 1, 2023

/retest

@k8s-ci-robot k8s-ci-robot merged commit 6b516dd into kubernetes:master Jun 1, 2023
@sawsa307 sawsa307 deleted the add-in-error-state-metrics branch September 2, 2023 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants