Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849

faec · 2023-06-12T20:26:46Z

Create a new helper component, Broadcaster, which broadcasts changes in an observed value to a variable list of subscribers while providing various performance / non-blocking guarantees (see #2819). Replace Coordinator's current racey state broadcasting with calls to this component.

While the core functionality is provided in Broadcaster, this also required significant changes to Coordinator itself, since we need to generate a reliable stream of discrete state values to pass it as input.

The biggest design change to enable this was to move all state changes into the Coordinator goroutine, so there's a single source of truth for the value and ordering of every change to Coordinator.state. (Most were already there, but there were some exceptions.) This allowed some important simplifications, like removing the three mutexes that were previously needed to check any state value. In exchange, some state changes previously run externally now need to go through Coordinator -- e.g. SetOverrideState now sends to an internal channel in Coordinator, which applies the change synchronously on its main goroutine, instead of locking the relevant mutex and broadcasting it from an external goroutine.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~I have added an entry in ./changelog/fragments using the changelog tool~~
~~I have added an integration test or an E2E test~~

Related issues

mergify · 2023-06-12T20:27:32Z

This pull request does not have a backport label. Could you fix it @faec? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2023-06-12T20:55:10Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-06-23T13:47:40.582+0000
Duration: 21 min 52 sec

Test stats 🧪

Test	Results
Failed	0
Passed	6091
Skipped	19
Total	6110

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

… more robust to uninitialized components

…tor-broadcaster

elasticmachine · 2023-06-13T14:53:52Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.684% (`75/76`)	👍 0.018
Files	69.231% (`180/260`)	👍 0.239
Classes	67.894% (`332/489`)	👍 0.265
Methods	54.555% (`1042/1910`)	👍 0.615
Lines	40.625% (`11991/29516`)	👍 0.925
Conditionals	100.0% (`0/0`)	💚

faec · 2023-06-13T20:06:16Z

/test

faec · 2023-06-21T21:02:48Z

I've finished all the new tests I'm expecting to write -- Broadcaster now has 98.8% unit test coverage (and I'm glad I did it, I kept finding issues at each stage), and Coordinator went from 66.5% to 74.4%, with a lot of the new tests covering much more fine-grained behavior than the previous ones, including all the scenarios mentioned in #2868. The remaining gaps are mostly related to manager/upgrade states that are still hard to test, and will require followup in other packages.

The last thing I'm definitely planning before it's ready for merge is a rewrite of the preexisting TestDiagnosticHooks, which is currently occasionally flaky and enormous (~600 hand-written lines + 1200 of auto-generated mocks)... using the new tools should make it much smaller and more consistent.

cmacknz · 2023-06-22T18:37:09Z

The new test cases all LGTM, looking forward to the diagnostic test refactor.

If #2928 to run the integration tests on PRs isn't merged before you are done, trigger the integration tests manually from this branch before merging.

faec · 2023-06-22T21:18:25Z

I ended up creating two new internal Coordinator variables to facilitate the diagnostics tests, one for the derived config (AST plus variable substitutions) and one for the component model sent to the runtime manager. It doesn't change any behavior but it avoids having to call some heavy-duty internal helpers from the diagnostics, which in turn avoids having to prepare/mock those internals from the diagnostic tests. (I was planning to add these anyway as part of #2852 but it simplified this final step a lot.)

Out of time for today but I expect to finish in the morning, just a few short tests left 🤞 and the diagnostics test refactor has pushed my line count dramatically into the negative :-)

faec · 2023-06-23T13:11:09Z

The reworked diagnostic tests turned up a bug: #2940

faec · 2023-06-23T13:35:04Z

All tests now written and passing locally, subject to CI issues it's ready for a final look.

…tor-broadcaster

pchila · 2023-06-23T14:26:10Z

internal/pkg/agent/application/coordinator/diagnostics_test.go

Nitpicking here: the previous version of the test was intended for having a certain (realistic) configuration in input and verifying that we generated the full set of diagnostic files with the correct information given that config.
The idea being that, if we found some bugs or some tricky configurations we could add the config here as taken from fleet and have a unit test with that.
The tests now use simplified bits of config and test 1 hook at the time for the specific bits of config.
Each approach has its own merits (the one in this PR has brevity and simplicity on its side) but here we are losing the possibility of testing the whole of the coordinator diagnostic output given certain policies/configurations.

I am not asking to change it back, I just want us to be aware of what this change implies

We chatted about this a bunch offline, to summarize my perspective:

The existing test covered a lot of ground but (flakiness aside) there was no way to tell if it was doing the right thing, since no human could audit 10K lines of golden files

Checking internal state through the diagnostics interface, where it gets serialized back and forth to YAML, and we need to do complicated sanitization to even tell if there is a match, adds complication that is completely orthogonal to the behavior being tested -- if we want to test internal values, we can write tests that look at them directly.

Many conditions that were previously only covered by the diagnostics test are now also covered directly in the unit tests (and the new diagnostics tests have already caught an error that was previously missed)

Setting up "realistic" configurations can only go so far when all except one object is mocked -- this should really be the domain of the new integration/e2e tests. A complicated config with only one real component doesn't buy much... a lot of these structures are effectively opaque pointers from Coordinator's perspective, and the important part is that we send them to the right places.

All that said, we should definitely continue to expand test coverage and find ways to verify more realistic scenarios :-)

(Related followup that I think both @pchila and I agree on: the extreme verbosity of our diagnostics often isn't buying us much. I don't know anyone who has used those really enormous files productively in an SDH... the important things are the logs and a few key fields, and sometimes we produce these large files "just in case" when what is really needed is more careful verification of important errors and their causes. This comes up in e.g. #2852 -- we might benefit from making the diagnostics smaller but more targeted, and preserving better records when we encounter errors.

99% of the time I am only looking at pre-config.yaml, state.yaml, and the logs. The computed/expected/actual only get looked at when variable substitutions or conditions are involved, which is almost always for standalone agents and usually agents on K8S.

I would hesitate to remove them, but they definitely aren't used frequently. If we can figure out how to output those files as actual YAML matching the policy format instead of the crazyily nested google.protobuf.Struct fields that would be a nice improvement.

👍 I think part of the problem there is that we use those protobufs not just for API calls but as our core internal representation of that data... I'm not sure how heavily we depend on it though, maybe there is a way to avoid it, or at least to work around it when encoding the most annoying cases. I will try and poke at it when I'm revisiting the error state reporting next sprint.

faec · 2023-06-26T12:20:38Z

As of this morning the integration tests are still failing at head (https://buildkite.com/elastic/elastic-agent/builds/1278#0188f77a-867b-4072-9107-8d3bd3fa9ce1) despite the rest of the build being fixed. Merging anyway after checking with @cmacknz and we'll keep an eye on the status as the unrelated errors in the integration tests are fixed.

…or state notifications (elastic#2849)

faec added 10 commits June 9, 2023 05:24

fix race detector errors in TestCoordinatorDiagnosticHooks

bb65a47

Draft in progress

32288f7

finish first round of Coordinator updates

12a071c

update dependents of the state package

607450d

Broadcaster in progress

087c649

finish first draft

41cd7c6

update comments in Coordinator

609c01d

Broadcast state updates from the Coordinator run loop

8b65cd9

Finish initializing the broadcaster

3f61731

add license comments

ffb3327

faec added bug Something isn't working enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Jun 12, 2023

faec self-assigned this Jun 12, 2023

mergify bot added the backport-skip label Jun 12, 2023

faec added the skip-changelog label Jun 12, 2023

faec added 8 commits June 12, 2023 17:22

test fix

bbaa0af

add more tests, fix the bugs they found

cb4e51b

add another unit test

b8eb439

Split Coordinator inner loop into a standalone helper function

4ad966c

fix error message

e630de4

correct LogLevel access

ed158af

Run upgrade tests against a running Coordinator, and make Coordinator…

62a64b0

… more robust to uninitialized components

Merge branch 'main' of github.com:elastic/elastic-agent into coordina…

1e80301

…tor-broadcaster

faec added 2 commits June 13, 2023 17:59

troubleshooting diagnostics test

9c3001c

move manager communication channels into an internal struct

361e8b4

faec added 10 commits June 21, 2023 09:09

mage check

833d257

remove debug printfs

d9a53ff

Add test changing Coordinator's component model

d156df1

fix protobuf sensitivity in config tests

6e03868

three coordinator tests left

4f93d91

add test applying variable updates to policy

503154f

add override state test

a25e268

add simple upgrade test

02937ac

update todo

0e61643

Finish broadcaster tests

b0f8ac2

rewiring diagnostics_test

d8358b0

faec added 2 commits June 22, 2023 17:08

replacement diagnostics test in progress

911f65b

Remove mocks and golden files from old diagnostics test

8409573

faec added 2 commits June 22, 2023 17:20

make check

d553a8b

Add more diagnostic tests

9204694

add state diagnostic test

c46f9cb

Merge branch 'main' of github.com:elastic/elastic-agent into coordina…

d885df4

…tor-broadcaster

pchila reviewed Jun 23, 2023

View reviewed changes

pchila approved these changes Jun 23, 2023

View reviewed changes

faec merged commit 729636a into elastic:main Jun 26, 2023

faec deleted the coordinator-broadcaster branch June 26, 2023 12:20

ycombinator mentioned this pull request Jun 30, 2023

[8.9](backport #2957) Make upgrade watcher configurable #2972

Merged

faec mentioned this pull request Jul 7, 2023

[Meta] Audit concurrency handling / increase unit test coverage in Agent #3040

Open

AndersonQ pushed a commit to AndersonQ/elastic-agent that referenced this pull request Jul 10, 2023

Create non-blocking broadcaster helper and use it to manage Coordinat…

e4c6fec

…or state notifications (elastic#2849)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849

Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849

faec commented Jun 12, 2023 •

edited

Loading

mergify bot commented Jun 12, 2023

elasticmachine commented Jun 12, 2023 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Jun 13, 2023 •

edited

Loading

faec commented Jun 13, 2023

faec commented Jun 21, 2023 •

edited

Loading

cmacknz commented Jun 22, 2023

faec commented Jun 22, 2023

faec commented Jun 23, 2023

faec commented Jun 23, 2023

pchila Jun 23, 2023 •

edited

Loading

faec Jun 23, 2023

faec Jun 23, 2023

cmacknz Jun 23, 2023

faec Jun 23, 2023

faec commented Jun 26, 2023

Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849

Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849

Conversation

faec commented Jun 12, 2023 • edited Loading

Checklist

Related issues

mergify bot commented Jun 12, 2023

elasticmachine commented Jun 12, 2023 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Jun 13, 2023 • edited Loading

🌐 Coverage report

faec commented Jun 13, 2023

faec commented Jun 21, 2023 • edited Loading

cmacknz commented Jun 22, 2023

faec commented Jun 22, 2023

faec commented Jun 23, 2023

faec commented Jun 23, 2023

pchila Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

faec Jun 23, 2023

Choose a reason for hiding this comment

faec Jun 23, 2023

Choose a reason for hiding this comment

cmacknz Jun 23, 2023

Choose a reason for hiding this comment

faec Jun 23, 2023

Choose a reason for hiding this comment

faec commented Jun 26, 2023

faec commented Jun 12, 2023 •

edited

Loading

elasticmachine commented Jun 12, 2023 •

edited

Loading

elasticmachine commented Jun 13, 2023 •

edited

Loading

faec commented Jun 21, 2023 •

edited

Loading

pchila Jun 23, 2023 •

edited

Loading