-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix indefinite memory and CPU consumption when waiting fleet to be ready #5034
Fix indefinite memory and CPU consumption when waiting fleet to be ready #5034
Conversation
This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
NOTE: |
027dc1c
to
7ff0626
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and the test covers the fix. Nice!
|
What does this PR do?
Fixes the wait for Fleet Server to be ready
Why is it important?
When waiting for Fleet Server to start, the Elastic Agent does not account for the timeout when waiting for Fleet Server to be ready.
Currently, when the timeout is reached, the operation isn't interrupted and the goroutine waiting for Fleet Server to be ready gets stuck in an infinite loop without any delay between iterations. It continually prints a log like:
{"log.level":"info","@timestamp":"2024-07-02T13:18:59.354Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":812},"message":"Waiting for Elastic Agent to start: rpc error: code = Canceled desc = context canceled","ecs.version":"1.6.0"}
.This causes a spike in memory and CPU consumption until the agent is killed by the OS, potentially jeopardising the normal operation of the host.
Checklist
[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature works[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
How to test this PR locally
Try to reproduce #5033, the issue should not be reproducible with this fix.
Related issues
Questions to ask yourself