Cherry-pick #26126 to 7.x: Fix startup with failing configuration #26262

michalpristas · 2021-06-11T15:22:23Z

Cherry-pick of PR #17391 to 7.x branch. Original message:

What does this PR do?

What is going on is a bit weird race.
Basically we start ok but troubles come on restart.

On restart we apply stored config to start MB (mb1) with failing configuration,
MB does not exit just reports Failed state because it cannot apply system/load on windows.
At this time 10 second timer to recover is started otherwise we plan a restart.

Then we pull config from fleet and decide MB should be started.
We start MB mb2 while mb1 is still running. We are starting it because check checks for terminal statuses and Failed is one of them.
This mb2 starts reports running, timer for killing mb1 is stopped. Because MB is running and it does not differentiate between processes.
mb2 fails on data.path conflict with mb1

watcher detects stopped metricbeat and mb3 is started, mb1 is still running, we dont have any information about mb1 anymore anywhere.
mb3 fails on same thing mb2 failed and exits.

mb1 still running in Failed state and we try mbX over and over again.

This PR adds some closers to watcher, some checks on Failure termination and passing proc so watcher and terminator are killing the process they were designed to kill.

While this fix works, i dont like the whole approach where we handle start/stop/restart from 4-5 places and they can be conflicting. I would like to see this redesigned. But i've spent 5 days chasing this and need some social distancing from the topic so this fix is OK at the moment for me.

How to test:

Build
Create config with fleet-server, system and endpoint integration
Deploy on windows machine
Wait for everything to settle and restart a lot of times to check if you see 3 metricbeat processes running in Task Manager.

How it behaved before the fix: we had 3 metricbeats, 2 with stable PID (one for monitoring, one is mb1) and then thirds MB process kept changing PID (crash-restart loop)

Why is it important?

Fixes #25829

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Fix startup with failing configuration (elastic#26126)

elasticmachine · 2021-06-11T15:22:25Z

Pinging @elastic/agent (Team:Agent)

elasticmachine · 2021-06-11T16:48:30Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #26262 opened
Start Time: 2021-06-11T15:22:40.900+0000
Duration: 86 min 39 sec
Commit: 4ca5651

Test stats 🧪

Test	Results
Failed	0
Passed	6860
Skipped	24
Total	6884

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	6860
Skipped	24
Total	6884

michalpristas added 8 commits April 30, 2021 14:10

Do not use unversioned home with watcher binary path (elastic#25423)

27a606b

Merge branch '7.x' of github.com:elastic/beats into 7.x

a6f1e76

Merge branch '7.x' of github.com:elastic/beats into 7.x

420935c

Merge branch '7.x' of github.com:elastic/beats into 7.x

66a2ae3

Merge branch '7.x' of github.com:elastic/beats into 7.x

a06b06e

Merge branch '7.x' of github.com:elastic/beats into 7.x

2bbc22e

Merge branch '7.x' of github.com:elastic/beats into 7.x

38d759c

Fix startup with failing configuration (elastic#26126)

4ca5651

Fix startup with failing configuration (elastic#26126)

michalpristas added bug backport Team:Elastic-Agent Label for the Agent team labels Jun 11, 2021

michalpristas self-assigned this Jun 11, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 11, 2021

michalpristas merged commit e73b271 into elastic:7.x Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick #26126 to 7.x: Fix startup with failing configuration #26262

Cherry-pick #26126 to 7.x: Fix startup with failing configuration #26262

michalpristas commented Jun 11, 2021 •

edited by zube bot

Loading

elasticmachine commented Jun 11, 2021

elasticmachine commented Jun 11, 2021

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

Cherry-pick #26126 to 7.x: Fix startup with failing configuration #26262

Cherry-pick #26126 to 7.x: Fix startup with failing configuration #26262

Conversation

michalpristas commented Jun 11, 2021 • edited by zube bot Loading

What does this PR do?

Why is it important?

Checklist

elasticmachine commented Jun 11, 2021

elasticmachine commented Jun 11, 2021

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

michalpristas commented Jun 11, 2021 •

edited by zube bot

Loading