Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track GCE/K8s server error #2097

Merged
merged 1 commit into from
May 23, 2023

Conversation

sawsa307
Copy link
Contributor

@sawsa307 sawsa307 commented May 2, 2023

Define NegControllerErrorCount that tracks both internal sync error and API server errors(GCE/K8s).

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 2, 2023
@k8s-ci-robot k8s-ci-robot requested review from aojea and code-elinka May 2, 2023 21:46
@k8s-ci-robot
Copy link
Contributor

Hi @sawsa307. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 2, 2023
@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch from 62bc4e8 to 98f175b Compare May 2, 2023 21:51
@sawsa307
Copy link
Contributor Author

sawsa307 commented May 2, 2023

/assign @swetharepakula

@bowei
Copy link
Member

bowei commented May 3, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 3, 2023
@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch from 98f175b to 50ee81c Compare May 10, 2023 20:26
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2023
@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch 6 times, most recently from 24d82fe to 8acdbf3 Compare May 12, 2023 16:34
Copy link
Member

@swetharepakula swetharepakula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these errors are returned at a central point, should we just emit the metric then?

@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch from 8acdbf3 to 7deab1f Compare May 13, 2023 00:28
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 13, 2023
@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch 3 times, most recently from 55df15e to d3d2a02 Compare May 16, 2023 19:23
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 16, 2023
@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch from d3d2a02 to b25fe05 Compare May 16, 2023 20:51
Copy link
Member

@swetharepakula swetharepakula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors in transaction.go that result in a failed sync should not be an ignored error

@@ -573,6 +573,7 @@ func (s *transactionSyncer) commitTransaction(err error, networkEndpointMap map[
// This is to prevent if the NEG object is deleted or misconfigured by user
s.needInit = true
needRetry = true
metrics.PublishNegControllerErrorCountMetrics(err, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this error already being counted? This would be counted where commitTransaction occurs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should count the metrics at the end of the sync instead?

Copy link
Contributor Author

@sawsa307 sawsa307 May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should count the metrics at the end of the sync instead?

We collect error from sync in syncer.go where we called s.core.sync(). Here we are collecting errors from each goroutine spawned by attach/detach calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking instead of operating on a error that is passed in, to do it where it occurs. However in this case, this is where the error is handled so I am okay with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be counted as ignored. An error here causes triggers a retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@sawsa307
Copy link
Contributor Author

The errors in transaction.go that result in a failed sync should not be an ignored error

Understood. So the definition for ignored_error are errors that are not being handled is that correct?

@sawsa307 sawsa307 requested a review from swetharepakula May 17, 2023 17:06
@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch from b25fe05 to 2a5ba46 Compare May 17, 2023 21:02
Comment on lines 56 to 58
ignoredControllerError = "ignored_controller_error"
otherControllerError = "other_controller_error"
totalNegControllerError = "total_neg_controller_error"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove controller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@@ -186,6 +187,7 @@ func (d *Migrator) Continue(err error) {
}

if err != nil {
metrics.PublishNegControllerErrorCountMetrics(err, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error will be passed into commitTransaction as well so it will be tracked there. It does not need to be tracked here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@@ -573,6 +573,7 @@ func (s *transactionSyncer) commitTransaction(err error, networkEndpointMap map[
// This is to prevent if the NEG object is deleted or misconfigured by user
s.needInit = true
needRetry = true
metrics.PublishNegControllerErrorCountMetrics(err, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking instead of operating on a error that is passed in, to do it where it occurs. However in this case, this is where the error is handled so I am okay with this.

@@ -573,6 +573,7 @@ func (s *transactionSyncer) commitTransaction(err error, networkEndpointMap map[
// This is to prevent if the NEG object is deleted or misconfigured by user
s.needInit = true
needRetry = true
metrics.PublishNegControllerErrorCountMetrics(err, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be counted as ignored. An error here causes triggers a retry.

@sawsa307 sawsa307 requested a review from swetharepakula May 22, 2023 17:07
Add NegControllerErrorCount metrics to track the count of all errors
from NEG controller, and counts of server errors from GCE/K8s.
@sawsa307 sawsa307 force-pushed the track-gce-k8s-errors branch from 2a5ba46 to 24283c1 Compare May 22, 2023 17:09
@swetharepakula
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sawsa307, swetharepakula

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2023
@k8s-ci-robot k8s-ci-robot merged commit 23956d1 into kubernetes:master May 23, 2023
}
for {
if apiErr, ok := err.(*googleapi.Error); ok {
return apiErr.Code >= http.StatusInternalServerError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to check in the range [500, 599] not >= 500

}
for {
if apiErr, ok := err.(*k8serrors.StatusError); ok {
return apiErr.ErrStatus.Code >= http.StatusInternalServerError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to check in the range [500, 599] not >= 500

@sawsa307 sawsa307 deleted the track-gce-k8s-errors branch September 2, 2023 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants