Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for sync result #1911

Merged
merged 3 commits into from
Feb 25, 2023

Conversation

sawsa307
Copy link
Contributor

Created sync metrics collector to collect sync related metrics. Added metrics to collect the cumulative count of sync results. Specified metrics emit interval in flags.go.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 24, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @sawsa307. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 24, 2023
@k8s-ci-robot k8s-ci-robot requested review from aojea and bowei January 24, 2023 20:04
@sawsa307
Copy link
Contributor Author

/assign @swetharepakula

@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch 7 times, most recently from 2b4b322 to 16db295 Compare January 26, 2023 20:02
@swetharepakula
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 28, 2023
@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch 3 times, most recently from d2af05f to b36169a Compare January 30, 2023 17:51
}

// UpdateSyncer updates the count of sync results based on the result/error of sync
func (im *SyncerMetrics) UpdateSyncer(key negtypes.NegSyncerKey, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe sm or just m?

if err == nil {
im.countSinceLastExport[Success] += 1
} else {
syncErr := errors.Unwrap(err).(syncError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we verify that the error is of type syncError?

Or can we guarantee that this is the type of error always?

im.mu.Lock()
defer im.mu.Unlock()
if im.countSinceLastExport == nil {
klog.Fatalf("Syncer Metrics failed to initialize correctly, countSinceLastExport: %v", im.countSinceLastExport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fatal will cause the binary to exit. should we just log the error and initialize here?

@@ -0,0 +1,86 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these errors should be in the syncer code that will return this errors.

@@ -0,0 +1,86 @@
/*
Copyright 2020 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2023

@@ -256,7 +257,8 @@ L7 load balancing. CSV values accepted. Example: -node-port-ranges=80,8080,400-5
flag.BoolVar(&F.EnableL4ILBDualStack, "enable-l4ilb-dual-stack", false, "Enable Dual-Stack handling for L4 Internal Load Balancers")
flag.BoolVar(&F.EnableMultipleIGs, "enable-multiple-igs", false, "Enable using multiple unmanaged instance groups")
flag.IntVar(&F.MaxIGSize, "max-ig-size", 1000, "Max number of instances in Instance Group")
flag.DurationVar(&F.MetricsExportInterval, "metrics-export-interval", 10*time.Minute, `Period for calculating and exporting metrics related to state of managed objects.`)
flag.DurationVar(&F.UsageMetricsExportInterval, "usage-metrics-export-interval", 10*time.Minute, `Period for calculating and exporting metrics related to state of managed objects.`)
flag.DurationVar(&F.SyncMetricsExportInterval, "sync-metrics-export-interval", 5*time.Second, `Period for calculating and exporting metrics related to state of syncer.`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe neg-metrics-export-interval

In the description, add that these are neg controller internal metrics, not usage.

cmd/glbc/main.go Outdated
@@ -329,6 +329,7 @@ func runControllers(ctx *ingctx.ControllerContext) {
ctx.SvcNegInformer,
ctx.HasSynced,
ctx.ControllerMetrics,
ctx.SyncerMetrics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since moving to the controller, we don't nee to add to context.

@@ -217,7 +217,7 @@ func (s *transactionSyncer) syncInternalImpl() error {

currentMap, err := retrieveExistingZoneNetworkEndpointMap(s.NegSyncerKey.NegName, s.zoneGetter, s.cloud, s.NegSyncerKey.GetAPIVersion(), s.endpointsCalculator.Mode())
if err != nil {
return err
return fmt.Errorf("%w, reason: %v", metrics.ErrCurrentEPNotFound, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slightly obscures what is the error we are returning. Typically we wrap the error we received instead of wrapping the context we are returning.

Implement a result instead where you can provide the reason. Look at what was done in L4lb:

type L4ILBSyncResult struct {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

// NewNEGMetricsCollector initializes SyncerMetrics and starts a go routine to compute and export metrics periodically.
func NewNegMetricsCollector(exportInterval time.Duration) *SyncerMetrics {
return &SyncerMetrics{
countSinceLastExport: map[syncError]int{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a counter, we don't need to store this state. This will be necessary only when we have gauge metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch from b36169a to bb394ad Compare January 31, 2023 18:29
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 31, 2023
@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch from bb394ad to 3b2b1c7 Compare January 31, 2023 18:36
@k8s-ci-robot k8s-ci-robot requested a review from thockin February 14, 2023 01:06
@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch 2 times, most recently from b14be83 to 4d3b024 Compare February 16, 2023 05:37
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2023
@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch from 4d3b024 to f5b9857 Compare February 17, 2023 09:29
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2023
}
s.logEndpoints(addEndpoints, "adding endpoint")
s.logEndpoints(removeEndpoints, "removing endpoint")

return s.syncNetworkEndpoints(addEndpoints, removeEndpoints)
err = s.syncNetworkEndpoints(addEndpoints, removeEndpoints)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just make syncNetworkEndpoints return the sync result?

@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch 2 times, most recently from 48bc2e8 to b64514b Compare February 24, 2023 00:51
//
// 1. The endpoint count from endpointData doesn't equal to the one from endpointPodMap:
// endpiontPodMap removes the duplicated endpoints, and dupCount stores the number of duplicated it removed
// and we compare the endpoint counts with duplicates
// 2. The endpoint count from endpointData or the one from endpointPodMap is 0
func (s *transactionSyncer) isValidEndpointInfo(eds []negtypes.EndpointsData, endpointPodMap negtypes.EndpointPodMap, dupCount int) (bool, string) {
func (s *transactionSyncer) checkValidEndpointInfo(eds []negtypes.EndpointsData, endpointPodMap negtypes.EndpointPodMap, dupCount int) (bool, *negtypes.NegSyncResult) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitL checkEndpointInfo or validateEndpointInfo

if errors.Is(err, ErrEPMissingNodeName) {
// checkValidEPField checks if endpoints have valid field
// It return the result boolean with the corresponding reason
func (s *transactionSyncer) checkValidEPField(err error) (bool, *negtypes.NegSyncResult) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: checkEPField or validateEPField

Comment on lines 520 to 527
errorState := s.errorState
if errorState == negtypes.ResultInvalidAPIResponse {
syncResult = negtypes.NewNegSyncResult(negtypes.ErrInvalidAPIResponse, errorState)
}
if errorState == negtypes.ResultInvalidEPAttach {
syncResult = negtypes.NewNegSyncResult(negtypes.ErrInvalidEPAttach, errorState)
}
if errorState == negtypes.ResultInvalidEPDetach {
syncResult = negtypes.NewNegSyncResult(negtypes.ErrInvalidEPDetach, errorState)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a switch and make the default case do the last one

@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch from b64514b to 72c4296 Compare February 24, 2023 22:36
defer s.syncLock.Unlock()

var syncResult *negtypes.NegSyncResult
if s.inErrorState() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to check for error state here? We just need the syncResult, so there should be a nil case and the detach case added to the switch.

@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch from 72c4296 to 9933f59 Compare February 24, 2023 23:12
if got, reason := transactionSyncer.isValidEndpointInfo(tc.endpointsData, tc.endpointPodMap, tc.dupCount); got != tc.expect && reason != tc.expectedReason {
t.Errorf("invalidEndpointInfo() = %t, expected %t", got, tc.expect)
if got, result := transactionSyncer.CheckEndpointInfo(tc.endpointsData, tc.endpointPodMap, tc.dupCount); got != tc.expect || result.Result != tc.expectedResult.Result || !errors.Is(result.Error, tc.expectedResult.Error) {
t.Errorf("CheckEndpointInfo() = %t|%s|%t, expected %t|%s|%t", got, result.Result, result.Error, tc.expect, tc.expectedResult.Result, tc.expectedResult.Error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do got then wanted or expected based on go style guide.

Break it up a little more so that it is easier to understand the test and debug in the future:

got, result := transactionSyncer.CheckEndpointInfo(tc.endpointsData, tc.endpointPodMap, tc.dupCount)
if got != tc.expect {
    t.Errorf(...)
}
if result.Result != tc.expectedResult {
    t.Errorf(...)
}

same for the other test.

@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch from 9933f59 to dd0242e Compare February 24, 2023 23:32
Added metrics to collect the cumulative count of sync results.
@sawsa307 sawsa307 force-pushed the neg-sync-result-metric branch from dd0242e to ff4751f Compare February 24, 2023 23:58
Copy link
Member

@swetharepakula swetharepakula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sawsa307, swetharepakula

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2023
@k8s-ci-robot k8s-ci-robot merged commit 88f329d into kubernetes:master Feb 25, 2023
sawsa307 added a commit to sawsa307/ingress-gce that referenced this pull request Feb 28, 2023
sawsa307 added a commit to sawsa307/ingress-gce that referenced this pull request Feb 28, 2023
sawsa307 added a commit to sawsa307/ingress-gce that referenced this pull request Feb 28, 2023
@sawsa307 sawsa307 deleted the neg-sync-result-metric branch April 3, 2023 16:48
code-elinka pushed a commit to akhmanova/ingress-gce that referenced this pull request Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants