Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoordinatorState can block / deliver state updates out of order #2735

Closed
faec opened this issue May 26, 2023 · 0 comments · Fixed by #2849
Closed

CoordinatorState can block / deliver state updates out of order #2735

faec opened this issue May 26, 2023 · 0 comments · Fixed by #2849
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@faec
Copy link
Contributor

faec commented May 26, 2023

CoordinatorState, in internal/pkg/agent/application/coordinator/state.go, has several race conditions / potential blockages:

  • Congestion in one subscriber can block the whole Coordinator. This is already noted in (*CoordinatorState).Subscribe, where the documentation says // Note: Not reading from a subscription channel will cause the Coordinator to block.. (The primary client of the CoordinatorState change notifications is the ElasticAgentControl.StateWatch RPC stream, which is subject to congestion based on network conditions.)
  • Multiple state changes in a short interval can be sent out of order, causing subscribers to finalize on the wrong value: a mutex lock on subMx ensures that a changed state is sent to all subscribers before advancing to the next one, but if multiple changes accumulate while it is being sent there is nothing that checks which order the accumulated changes are sent in.
  • Related to the first two, because every change tries to send fully to every subscriber, even when states are sent in the right order they may not be the most current values at the time they are sent.
  • Subscribers may not receive state changes that happen shortly after they subscribe: the subscriber list isn't updated until after the initial state has been queued, and changes that happen during that interval will be missed.
  • Subscribers that become congested may never receive some updates even after they recover -- after a one-second timeout, that state change will no longer be sent to that subscriber.

This should probably be addressed by making state notifications a variable-size select with reflect.Select -- that way updates will always happen immediately for any active listeners, and a congested subscriber won't affect other subscribers or the Coordinator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
1 participant