-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(controller): handle race when starting metrics server. Fixes #10807 #13731
Conversation
…er. Fixes argoproj#10807 While investigating the flaky `MetricsSuite/TestMetricsEndpoint` test that's been failing periodically for awhile now, I noticed this in the controller logs ([example](https://github.com/argoproj/argo-workflows/actions/runs/11221357877/job/31191811077)): ``` controller: time="2024-10-07T18:22:14.793Z" level=info msg="Starting dummy metrics server at localhost:9090/metrics" server: time="2024-10-07T18:22:14.793Z" level=info msg="Creating event controller" asyncDispatch=false operationQueueSize=16 workerCount=4 server: time="2024-10-07T18:22:14.800Z" level=info msg="GRPC Server Max Message Size, MaxGRPCMessageSize, is set" GRPC_MESSAGE_SIZE=104857600 server: time="2024-10-07T18:22:14.800Z" level=info msg="Argo Server started successfully on http://localhost:2746" url="http://localhost:2746" controller: I1007 18:22:14.800947 25045 leaderelection.go:260] successfully acquired lease argo/workflow-controller controller: time="2024-10-07T18:22:14.801Z" level=info msg="new leader" leader=local controller: time="2024-10-07T18:22:14.801Z" level=info msg="Generating Self Signed TLS Certificates for Telemetry Servers" controller: time="2024-10-07T18:22:14.802Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics" controller: panic: listen tcp :9090: bind: address already in use controller: controller: goroutine 37 [running]: controller: github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer.func2() controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:94 +0x16a controller: created by github.com/argoproj/argo-workflows/v3/util/telemetry.(*Metrics).RunPrometheusServer in goroutine 36 controller: /home/runner/work/argo-workflows/argo-workflows/util/telemetry/exporter_prometheus.go:91 +0x53c 2024/10/07 18:22:14 controller: process exited 25045: exit status 2 controller: exit status 2 2024/10/07 18:22:14 controller: backing off 4s ``` I believe this is a race condition introduced in argoproj#11295. Here's the sequence of events that trigger this: 1. Controller starts 2. Dummy metrics server started on port 9090 3. Leader election takes place and controller starts leading 4. Context for dummy metrics server cancelled 5. Metrics server shuts down 6. Prometheus metrics server started on 9090 The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the contxt is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop" (https://pkg.go.dev/context#CancelFunc). The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a `WaitGroup`, as that's the simplest approach I could think of. Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some ideas below
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for investigating the test flake and adding all the details here!
I am still curious about why it started happening more frequently (per above) though if you uncovered that already. If not, then I wouldn't spend more time on it since this is a root cause fix anyway
Thank you for fixing this. I am not sure why #13265 made this worse, but I suspect it did. Whatever is happening under the hood between otel and prometheus go library will be radically different. I would probably like a fix where the waitgroup was encapsulated on the metrics side so the tests could test the same thing as the actual running code - a redesign of the interface to switch prometheus metrics on and off. This is good for now though, so thank you. I was going to revisit shutdown of telemetry at some point because it becomes more fun when you're trying to ensure traces have all been dispatched before you terminate - so it might come out of that. |
Partial fix for #10807
Motivation
While investigating the flaky
MetricsSuite/TestMetricsEndpoint
test, I noticedlisten tcp :9090: bind: address already in use
errors in the controller logs (example):The controller gracefully handles this by automatically restarting, but it definitely shouldn't be happening. I believe this is a race condition introduced in #11295. Here's the sequence of events that trigger this:
argo-workflows/cmd/workflow-controller/main.go
Line 139 in 07703ab
argo-workflows/cmd/workflow-controller/main.go
Lines 141 to 151 in 07703ab
argo-workflows/cmd/workflow-controller/main.go
Line 152 in 07703ab
argo-workflows/util/telemetry/exporter_prometheus.go
Lines 106 to 112 in 07703ab
argo-workflows/cmd/workflow-controller/main.go
Line 154 in 07703ab
The problems is steps 5-6 can happen out-of-order, because the shutdown happens after the context is cancelled. Per the docs, "a CancelFunc does not wait for the work to stop".
Modifications
The controller needs to explicitly wait for the dummy metrics server to shut down properly before starting the Prometheus metrics server. There's many ways of doing that, and this uses a
WaitGroup
, as that's the simplest approach I could think of.Verification
Ran tests locally