Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start/stop components in a more synchronised manner #2226

Merged
merged 4 commits into from
Feb 7, 2023

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Feb 3, 2023

What does this PR do?

Two things being fixed here:

First one is order of start/stop.
Currently we start new components before stopping old. This is usually fine unless component is a service (start/stop is slower).
When new service is started old one can be still running and using resources needed for new service to start.
We can than see something like bind port already in use.

Second thing fixed here is waiting for uninstall to finish.
Again problem is when we change output and we try to uninstall component service-output-one and start service-output-two
We kick off uninstall of service-output-one and we actually don't wait for operation to finish. Then we start service-output-two
This may be fine as service-output-one may already be stopped but uninstall may be still in progress and this uninstall will remove service even though it's needed for service-output-two

Why is it important?

Service failures when

  • we swap outputs
  • we upgrade agent.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

@michalpristas michalpristas added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team backport-v8.6.0 Automated backport with mergify labels Feb 3, 2023
@michalpristas michalpristas requested a review from a team as a code owner February 3, 2023 15:42
@michalpristas michalpristas self-assigned this Feb 3, 2023
@michalpristas michalpristas requested review from blakerouse and pchila and removed request for a team February 3, 2023 15:42
@elasticmachine
Copy link
Contributor

elasticmachine commented Feb 3, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-02-07T06:48:15.892+0000

  • Duration: 19 min 42 sec

Test stats 🧪

Test Results
Failed 0
Passed 4893
Skipped 13
Total 4906

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Feb 3, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.333% (59/60) 👍
Files 69.082% (143/207) 👍
Classes 68.514% (272/397) 👍
Methods 53.802% (835/1552) 👍 0.03
Lines 39.031% (9107/23333) 👎 -0.034
Conditionals 100.0% (0/0) 💚

// stop is async, wait for operation to finish,
// otherwise new instance may be started and components
// may fight for resources (e.g ports, files, locks)
m.waitForStopped(existing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking here: can't we stop and wait for stopped services in parallel using goroutines and a WaitGroup? This would allow us to be a bit faster instead of shutting down one component at a time... Maybe this can be a further optimization down the line ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should absolutely happen in parallel.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should handle the stopped in parallel, as @pchila stated.

This also really needs to have a test. We have many tests that cover the runtime manager and this being critical behavior this needs to be tested and covered.

Should be easy to subscribe to component state changes from the runtime and watch that the all components that should be stopped are stopped before even spawning the new components.

// stop is async, wait for operation to finish,
// otherwise new instance may be started and components
// may fight for resources (e.g ports, files, locks)
m.waitForStopped(existing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should absolutely happen in parallel.

@michalpristas
Copy link
Contributor Author

@blakerouse added parallelism and test

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the parallelism and the test!

@jlind23
Copy link
Contributor

jlind23 commented Feb 6, 2023

@pchila @blakerouse where you able to test this new behaviour locally? @cmacknz @michalpristas any corner cases we might have missed?

@michalpristas michalpristas merged commit 599680a into elastic:main Feb 7, 2023
mergify bot pushed a commit that referenced this pull request Feb 7, 2023
Start/stop components in a more synchronised manner (#2226)

(cherry picked from commit 599680a)
michalpristas added a commit that referenced this pull request Feb 7, 2023
Start/stop components in a more synchronised manner (#2226)

(cherry picked from commit 599680a)

Co-authored-by: Michal Pristas <michal.pristas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants