CoordinatorState can block / deliver state updates out of order #2735

faec · 2023-05-26T19:54:11Z

CoordinatorState, in internal/pkg/agent/application/coordinator/state.go, has several race conditions / potential blockages:

Congestion in one subscriber can block the whole Coordinator. This is already noted in (*CoordinatorState).Subscribe, where the documentation says // Note: Not reading from a subscription channel will cause the Coordinator to block.. (The primary client of the CoordinatorState change notifications is the ElasticAgentControl.StateWatch RPC stream, which is subject to congestion based on network conditions.)
Multiple state changes in a short interval can be sent out of order, causing subscribers to finalize on the wrong value: a mutex lock on subMx ensures that a changed state is sent to all subscribers before advancing to the next one, but if multiple changes accumulate while it is being sent there is nothing that checks which order the accumulated changes are sent in.
Related to the first two, because every change tries to send fully to every subscriber, even when states are sent in the right order they may not be the most current values at the time they are sent.
Subscribers may not receive state changes that happen shortly after they subscribe: the subscriber list isn't updated until after the initial state has been queued, and changes that happen during that interval will be missed.
Subscribers that become congested may never receive some updates even after they recover -- after a one-second timeout, that state change will no longer be sent to that subscriber.

This should probably be addressed by making state notifications a variable-size select with reflect.Select -- that way updates will always happen immediately for any active listeners, and a congested subscriber won't affect other subscribers or the Coordinator.

The text was updated successfully, but these errors were encountered:

faec added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels May 26, 2023

faec self-assigned this May 26, 2023

faec mentioned this issue May 26, 2023

Fix possible blocking in the Coordinator and out-of-order state reporting in CoordinatorState #2736

Closed

7 tasks

faec mentioned this issue Jun 12, 2023

Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849

Merged

7 tasks

faec closed this as completed in #2849 Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoordinatorState can block / deliver state updates out of order #2735

CoordinatorState can block / deliver state updates out of order #2735

faec commented May 26, 2023 •

edited

Loading

CoordinatorState can block / deliver state updates out of order #2735

CoordinatorState can block / deliver state updates out of order #2735

Comments

faec commented May 26, 2023 • edited Loading

faec commented May 26, 2023 •

edited

Loading