-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix startup with failing configuration #26126
Conversation
Pinging @elastic/agent (Team:Agent) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
@@ -192,16 +196,32 @@ func (a *Application) watch(ctx context.Context, p app.Taggable, proc *process.I | |||
case ps := <-a.waitProc(proc.Process): | |||
procState = ps | |||
case <-a.bgContext.Done(): | |||
a.Stop() | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: is there any connection between a.bgContext
and ctx
? Does ctx
has to be cancelled when a.bgContext
is cancelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bgContext is like a cancellation token passed from top of the app and cancelled on exit or unenroll, so agent cleans and backs up everything in a nice manner. we want to avoid just shutting down agent without any cleaning as this may turn out problematic
return | ||
} | ||
|
||
a.appLock.Lock() | ||
if a.state.ProcessInfo != proc { | ||
// kill original process if possible | ||
if proc != nil && proc.Process != nil { | ||
_ = proc.Process.Kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we wait or check for other errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is best effort
|
||
// was already stopped by Stop, do not restart | ||
if a.state.Status == state.Stopped { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we missing a `a.appLock.Unlock()t here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking the complete function body, we always use Unlock
for each exit path. Better add the defer unlock
right after the lock
in order to reduce the chance of introducing new deadlocks in the future.
This pull request is now in conflicts. Could you fix it? 🙏
|
} | ||
|
||
// send stop signal to request stop | ||
proc.Process.Signal(os.Interrupt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on Windows? I thought process.Info
had each OS implementation that is why it was added. Should we just move this logic into the Stop()
function of process.Info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right windows does not implement sending interupt. using Stop function instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest change to using Stop
this looks good.
not sure if the 7.13.2 label helps, but i def would like to see this back ported to 7.13.x - @michalpristas |
Fix startup with failing configuration (elastic#26126)
Fix startup with failing configuration (elastic#26126)
This reverts commit 5a294a4.
* master: (26 commits) Report total and free CPU for vSphere virtual machines (elastic#26167) [filebeat] Add preserve_original_event option to o365audit input (elastic#26273) Change xml processor names in script processor to match convention (elastic#26263) [Oracle] Fixing default values for paths in config template (elastic#26276) Add more ECS fields to logs (elastic#25998) [Heartbeat] Fix broken invocation of synth package (elastic#26228) rename sqs file name (elastic#26227) Populate the agent action result if there is no matching action handlers (elastic#26152) Add ISO8601 as supported timestamp type (elastic#25564) Move Filebeat azure module to GA (elastic#26168) Filebeat azure module pipeline fixes and changes (elastic#26148) libbeat: monitor version (elastic#26214) Add new parser to filestream input: container (elastic#26115) [Metricbeat] Add state_statefulset replicas.ready (elastic#26088) Disable test processors system test for windows 10 (elastic#26216) Fix startup with failing configuration (elastic#26126) Remove 32 bits version of Elastic Agent. (elastic#25708) Chane fleetmode detection to ony use management.enabled (elastic#26180) Make `filestream` input GA (elastic#26127) libbeat/idxmgmt/ilm: fix alias creation (elastic#26146) ...
What does this PR do?
What is going on is a bit weird race.
Basically we start ok but troubles come on restart.
On restart we apply stored config to start MB (
mb1
) with failing configuration,MB does not exit just reports Failed state because it cannot apply
system/load
on windows.At this time 10 second timer to recover is started otherwise we plan a restart.
Then we pull config from fleet and decide MB should be started.
We start MB
mb2
whilemb1
is still running. We are starting it because check checks for terminal statuses and Failed is one of them.This
mb2
starts reports running, timer for killingmb1
is stopped. Because MB is running and it does not differentiate between processes.mb2
fails ondata.path
conflict withmb1
watcher detects stopped metricbeat and
mb3
is started,mb1
is still running, we dont have any information aboutmb1
anymore anywhere.mb3
fails on same thingmb2
failed and exits.mb1
still running in Failed state and we trymbX
over and over again.This PR adds some closers to watcher, some checks on Failure termination and passing proc so watcher and terminator are killing the process they were designed to kill.
While this fix works, i dont like the whole approach where we handle start/stop/restart from 4-5 places and they can be conflicting. I would like to see this redesigned. But i've spent 5 days chasing this and need some social distancing from the topic so this fix is OK at the moment for me.
How to test:
How it behaved before the fix: we had 3 metricbeats, 2 with stable PID (one for monitoring, one is
mb1
) and then thirds MB process kept changing PID (crash-restart loop)Why is it important?
Fixes #25829
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.