-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849
Conversation
This pull request does not have a backport label. Could you fix it @faec? 🙏
NOTE: |
… more robust to uninitialized components
🌐 Coverage report
|
/test |
I've finished all the new tests I'm expecting to write -- The last thing I'm definitely planning before it's ready for merge is a rewrite of the preexisting |
The new test cases all LGTM, looking forward to the diagnostic test refactor. If #2928 to run the integration tests on PRs isn't merged before you are done, trigger the integration tests manually from this branch before merging. |
I ended up creating two new internal Coordinator variables to facilitate the diagnostics tests, one for the derived config (AST plus variable substitutions) and one for the component model sent to the runtime manager. It doesn't change any behavior but it avoids having to call some heavy-duty internal helpers from the diagnostics, which in turn avoids having to prepare/mock those internals from the diagnostic tests. (I was planning to add these anyway as part of #2852 but it simplified this final step a lot.) Out of time for today but I expect to finish in the morning, just a few short tests left 🤞 and the diagnostics test refactor has pushed my line count dramatically into the negative :-) |
The reworked diagnostic tests turned up a bug: #2940 |
All tests now written and passing locally, subject to CI issues it's ready for a final look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking here: the previous version of the test was intended for having a certain (realistic) configuration in input and verifying that we generated the full set of diagnostic files with the correct information given that config.
The idea being that, if we found some bugs or some tricky configurations we could add the config here as taken from fleet and have a unit test with that.
The tests now use simplified bits of config and test 1 hook at the time for the specific bits of config.
Each approach has its own merits (the one in this PR has brevity and simplicity on its side) but here we are losing the possibility of testing the whole of the coordinator diagnostic output given certain policies/configurations.
I am not asking to change it back, I just want us to be aware of what this change implies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We chatted about this a bunch offline, to summarize my perspective:
- The existing test covered a lot of ground but (flakiness aside) there was no way to tell if it was doing the right thing, since no human could audit 10K lines of golden files
- Checking internal state through the diagnostics interface, where it gets serialized back and forth to YAML, and we need to do complicated sanitization to even tell if there is a match, adds complication that is completely orthogonal to the behavior being tested -- if we want to test internal values, we can write tests that look at them directly.
- Many conditions that were previously only covered by the diagnostics test are now also covered directly in the unit tests (and the new diagnostics tests have already caught an error that was previously missed)
- Setting up "realistic" configurations can only go so far when all except one object is mocked -- this should really be the domain of the new integration/e2e tests. A complicated config with only one real component doesn't buy much... a lot of these structures are effectively opaque pointers from Coordinator's perspective, and the important part is that we send them to the right places.
All that said, we should definitely continue to expand test coverage and find ways to verify more realistic scenarios :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Related followup that I think both @pchila and I agree on: the extreme verbosity of our diagnostics often isn't buying us much. I don't know anyone who has used those really enormous files productively in an SDH... the important things are the logs and a few key fields, and sometimes we produce these large files "just in case" when what is really needed is more careful verification of important errors and their causes. This comes up in e.g. #2852 -- we might benefit from making the diagnostics smaller but more targeted, and preserving better records when we encounter errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
99% of the time I am only looking at pre-config.yaml, state.yaml, and the logs. The computed/expected/actual only get looked at when variable substitutions or conditions are involved, which is almost always for standalone agents and usually agents on K8S.
I would hesitate to remove them, but they definitely aren't used frequently. If we can figure out how to output those files as actual YAML matching the policy format instead of the crazyily nested google.protobuf.Struct fields that would be a nice improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think part of the problem there is that we use those protobufs not just for API calls but as our core internal representation of that data... I'm not sure how heavily we depend on it though, maybe there is a way to avoid it, or at least to work around it when encoding the most annoying cases. I will try and poke at it when I'm revisiting the error state reporting next sprint.
As of this morning the integration tests are still failing at head (https://buildkite.com/elastic/elastic-agent/builds/1278#0188f77a-867b-4072-9107-8d3bd3fa9ce1) despite the rest of the build being fixed. Merging anyway after checking with @cmacknz and we'll keep an eye on the status as the unrelated errors in the integration tests are fixed. |
…or state notifications (elastic#2849)
Create a new helper component,
Broadcaster
, which broadcasts changes in an observed value to a variable list of subscribers while providing various performance / non-blocking guarantees (see #2819). ReplaceCoordinator
's current racey state broadcasting with calls to this component.While the core functionality is provided in
Broadcaster
, this also required significant changes toCoordinator
itself, since we need to generate a reliable stream of discrete state values to pass it as input.The biggest design change to enable this was to move all state changes into the
Coordinator
goroutine, so there's a single source of truth for the value and ordering of every change toCoordinator.state
. (Most were already there, but there were some exceptions.) This allowed some important simplifications, like removing the three mutexes that were previously needed to check any state value. In exchange, some state changes previously run externally now need to go throughCoordinator
-- e.g.SetOverrideState
now sends to an internal channel inCoordinator
, which applies the change synchronously on its main goroutine, instead of locking the relevant mutex and broadcasting it from an external goroutine.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added an entry in./changelog/fragments
using the changelog toolI have added an integration test or an E2E testRelated issues
Coordinator
#2868