fix(controller): handle race when starting metrics server. Fixes #10807 #13731

MasonM · 2024-10-09T04:56:37Z

Partial fix for #10807

Motivation

While investigating the flaky MetricsSuite/TestMetricsEndpoint test, I noticed listen tcp :9090: bind: address already in use errors in the controller logs (example):

controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics"
server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4
server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600
server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746"
controller: I1007 18:22:14.800947   25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller
controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local
controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers"
controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
controller: panic: listen tcp :9090: bind: address already in use
controller:
controller: goroutine 37 [running]:
controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2()
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a
controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36
controller: 	/home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c
2024/10/07 18:22:14 controller: process exited 25045: exit status 2
controller: exit status 2
2024/10/07 18:22:14 controller: backing off 4s

The controller gracefully handles this by automatically restarting, but it definitely shouldn't be happening. I believe this is a race condition introduced in #11295. Here's the sequence of events that trigger this:

Controller starts
Dummy metrics server started on port 9090:

argo-workflows/cmd/workflow-controller/main.go

Line 139 in 07703ab

go wfController.RunPrometheusServer(dummyCtx, true)

Leader election takes place and controller starts leading:

argo-workflows/cmd/workflow-controller/main.go

Lines 141 to 151 in 07703ab

    
           go leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{ 
        
           	Lock: &resourcelock.LeaseLock{ 
        
           		LeaseMeta: metav1.ObjectMeta{Name: leaderName, Namespace: namespace}, Client: kubeclientset.CoordinationV1(), 
        
           		LockConfig: resourcelock.ResourceLockConfig{Identity: nodeID, EventRecorder: events.NewEventRecorderManager(kubeclientset).Get(namespace)}, 
        
           	}, 
        
           	ReleaseOnCancel: false, 
        
           	LeaseDuration:   env.LookupEnvDurationOr("LEADER_ELECTION_LEASE_DURATION", 15*time.Second), 
        
           	RenewDeadline:   env.LookupEnvDurationOr("LEADER_ELECTION_RENEW_DEADLINE", 10*time.Second), 
        
           	RetryPeriod:     env.LookupEnvDurationOr("LEADER_ELECTION_RETRY_PERIOD", 5*time.Second), 
        
           	Callbacks: leaderelection.LeaderCallbacks{ 
        
           		OnStartedLeading: func(ctx context.Context) {

Context for dummy metrics server cancelled:

argo-workflows/cmd/workflow-controller/main.go

Line 152 in 07703ab

dummyCancel()

Dummy metrics server shuts down:

argo-workflows/util/telemetry/exporter_prometheus.go

Lines 106 to 112 in 07703ab

    
           // Waiting for stop signal 
        
           <-ctx.Done() 
        
           // Shutdown the server gracefully with a 1 second timeout 
        
           ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) 
        
           defer cancel() 
        
           if err := srv.Shutdown(ctx); err != nil {

Prometheus metrics server started on 9090:

argo-workflows/cmd/workflow-controller/main.go

Line 154 in 07703ab

go wfController.RunPrometheusServer(ctx, false)

The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the context is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop".

Modifications

The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a WaitGroup, as that's the simplest approach I could think of.

Verification

Ran tests locally

…er. Fixes argoproj#10807 While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test that's been failing periodically for awhile now, I noticed this in the controller logs ([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)): ``` controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics" server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4 server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600 server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746" controller: I1007 18:22:14.800947 25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers" controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics" controller: panic: listen tcp :9090: bind: address already in use controller: controller: goroutine 37 [running]: controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2() controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36 controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c 2024/10/07 18:22:14 controller: process exited 25045: exit status 2 controller: exit status 2 2024/10/07 18:22:14 controller: backing off 4s ``` I believe this is a race condition introduced in argoproj#11295. Here's the sequence of events that trigger this: 1. Controller starts 2. Dummy metrics server started on port 9090 3. Leader election takes place and controller starts leading 4. Context for dummy metrics server cancelled 5. Metrics server shuts down 6. Prometheus metrics server started on 9090 The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the contxt is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc). The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a `WaitGroup`, as that's the simplest approach I could think of. Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

agilgur5 · 2024-10-09T16:02:58Z

I believe this is a race condition introduced in #11295.

I guess some changes from #13265 or related made this happen more frequently? I was trying to figure out why it recently started 🤔

agilgur5

some ideas below

util/telemetry/exporter_prometheus.go

cmd/workflow-controller/main.go

workflow/controller/controller.go

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

agilgur5

LGTM, thanks for investigating the test flake and adding all the details here!

I am still curious about why it started happening more frequently (per above) though if you uncovered that already. If not, then I wouldn't spend more time on it since this is a root cause fix anyway

Joibel · 2024-10-10T09:29:01Z

Thank you for fixing this.

I am not sure why #13265 made this worse, but I suspect it did. Whatever is happening under the hood between otel and prometheus go library will be radically different.

I would probably like a fix where the waitgroup was encapsulated on the metrics side so the tests could test the same thing as the actual running code - a redesign of the interface to switch prometheus metrics on and off. This is good for now though, so thank you.

I was going to revisit shutdown of telemetry at some point because it becomes more fun when you're trying to ensure traces have all been dispatched before you terminate - so it might come out of that.

MasonM added 3 commits October 8, 2024 21:48

test: fix TestDisablePrometheusServer

1d1e196

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

Merge branch 'main' into fix-race-condition

6c2bd96

agilgur5 requested a review from Joibel October 9, 2024 16:04

MasonM marked this pull request as ready for review October 10, 2024 02:13

Merge branch 'main' into fix-race-condition

2c3bd7b

agilgur5 added area/controller Controller issues, panics area/metrics labels Oct 10, 2024

agilgur5 reviewed Oct 10, 2024

View reviewed changes

util/telemetry/exporter_prometheus.go Outdated Show resolved Hide resolved

cmd/workflow-controller/main.go Outdated Show resolved Hide resolved

workflow/controller/controller.go Outdated Show resolved Hide resolved

MasonM added 2 commits October 9, 2024 20:36

refactor: simplify waiting a bit

5dbe063

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

fix: wait properly when stopping leading

e3f666c

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

agilgur5 changed the title ~~fix(controller): race condition when starting metrics server. Fixes #10807~~ fix(controller): handle race when starting metrics server. Fixes #10807 Oct 10, 2024

agilgur5 approved these changes Oct 10, 2024

View reviewed changes

agilgur5 merged commit 8100f99 into argoproj:main Oct 10, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controller): handle race when starting metrics server. Fixes #10807 #13731

fix(controller): handle race when starting metrics server. Fixes #10807 #13731

MasonM commented Oct 9, 2024 •

edited

Loading

agilgur5 commented Oct 9, 2024

agilgur5 left a comment

agilgur5 left a comment •

edited

Loading

Joibel commented Oct 10, 2024

	go leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
	Lock: &resourcelock.LeaseLock{
	LeaseMeta: metav1.ObjectMeta{Name: leaderName, Namespace: namespace}, Client: kubeclientset.CoordinationV1(),
	LockConfig: resourcelock.ResourceLockConfig{Identity: nodeID, EventRecorder: events.NewEventRecorderManager(kubeclientset).Get(namespace)},
	},
	ReleaseOnCancel: false,
	LeaseDuration: env.LookupEnvDurationOr("LEADER_ELECTION_LEASE_DURATION", 15*time.Second),
	RenewDeadline: env.LookupEnvDurationOr("LEADER_ELECTION_RENEW_DEADLINE", 10*time.Second),
	RetryPeriod: env.LookupEnvDurationOr("LEADER_ELECTION_RETRY_PERIOD", 5*time.Second),
	Callbacks: leaderelection.LeaderCallbacks{
	OnStartedLeading: func(ctx context.Context) {

	// Waiting for stop signal
	<-ctx.Done()

	// Shutdown the server gracefully with a 1 second timeout
	ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)
	defer cancel()
	if err := srv.Shutdown(ctx); err != nil {

fix(controller): handle race when starting metrics server. Fixes #10807 #13731

fix(controller): handle race when starting metrics server. Fixes #10807 #13731

Conversation

MasonM commented Oct 9, 2024 • edited Loading

Motivation

Modifications

Verification

agilgur5 commented Oct 9, 2024

agilgur5 left a comment

Choose a reason for hiding this comment

agilgur5 left a comment • edited Loading

Choose a reason for hiding this comment

Joibel commented Oct 10, 2024

MasonM commented Oct 9, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading