Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create non-blocking broadcaster helper and use it to manage Coordinator state notifications #2849

Merged
merged 44 commits into from
Jun 26, 2023

Conversation

faec
Copy link
Contributor

@faec faec commented Jun 12, 2023

Create a new helper component, Broadcaster, which broadcasts changes in an observed value to a variable list of subscribers while providing various performance / non-blocking guarantees (see #2819). Replace Coordinator's current racey state broadcasting with calls to this component.

While the core functionality is provided in Broadcaster, this also required significant changes to Coordinator itself, since we need to generate a reliable stream of discrete state values to pass it as input.

The biggest design change to enable this was to move all state changes into the Coordinator goroutine, so there's a single source of truth for the value and ordering of every change to Coordinator.state. (Most were already there, but there were some exceptions.) This allowed some important simplifications, like removing the three mutexes that were previously needed to check any state value. In exchange, some state changes previously run externally now need to go through Coordinator -- e.g. SetOverrideState now sends to an internal channel in Coordinator, which applies the change synchronously on its main goroutine, instead of locking the relevant mutex and broadcasting it from an external goroutine.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Related issues

@faec faec added bug Something isn't working enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Jun 12, 2023
@faec faec self-assigned this Jun 12, 2023
@mergify
Copy link
Contributor

mergify bot commented Jun 12, 2023

This pull request does not have a backport label. Could you fix it @faec? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@elasticmachine
Copy link
Contributor

elasticmachine commented Jun 12, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-06-23T13:47:40.582+0000

  • Duration: 21 min 52 sec

Test stats 🧪

Test Results
Failed 0
Passed 6091
Skipped 19
Total 6110

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Jun 13, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.684% (75/76) 👍 0.018
Files 69.231% (180/260) 👍 0.239
Classes 67.894% (332/489) 👍 0.265
Methods 54.555% (1042/1910) 👍 0.615
Lines 40.625% (11991/29516) 👍 0.925
Conditionals 100.0% (0/0) 💚

@faec
Copy link
Contributor Author

faec commented Jun 13, 2023

/test

@faec
Copy link
Contributor Author

faec commented Jun 21, 2023

I've finished all the new tests I'm expecting to write -- Broadcaster now has 98.8% unit test coverage (and I'm glad I did it, I kept finding issues at each stage), and Coordinator went from 66.5% to 74.4%, with a lot of the new tests covering much more fine-grained behavior than the previous ones, including all the scenarios mentioned in #2868. The remaining gaps are mostly related to manager/upgrade states that are still hard to test, and will require followup in other packages.

The last thing I'm definitely planning before it's ready for merge is a rewrite of the preexisting TestDiagnosticHooks, which is currently occasionally flaky and enormous (~600 hand-written lines + 1200 of auto-generated mocks)... using the new tools should make it much smaller and more consistent.

@cmacknz
Copy link
Member

cmacknz commented Jun 22, 2023

The new test cases all LGTM, looking forward to the diagnostic test refactor.

If #2928 to run the integration tests on PRs isn't merged before you are done, trigger the integration tests manually from this branch before merging.

@faec
Copy link
Contributor Author

faec commented Jun 22, 2023

I ended up creating two new internal Coordinator variables to facilitate the diagnostics tests, one for the derived config (AST plus variable substitutions) and one for the component model sent to the runtime manager. It doesn't change any behavior but it avoids having to call some heavy-duty internal helpers from the diagnostics, which in turn avoids having to prepare/mock those internals from the diagnostic tests. (I was planning to add these anyway as part of #2852 but it simplified this final step a lot.)

Out of time for today but I expect to finish in the morning, just a few short tests left 🤞 and the diagnostics test refactor has pushed my line count dramatically into the negative :-)

@faec
Copy link
Contributor Author

faec commented Jun 23, 2023

The reworked diagnostic tests turned up a bug: #2940

@faec
Copy link
Contributor Author

faec commented Jun 23, 2023

All tests now written and passing locally, subject to CI issues it's ready for a final look.

Copy link
Member

@pchila pchila Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking here: the previous version of the test was intended for having a certain (realistic) configuration in input and verifying that we generated the full set of diagnostic files with the correct information given that config.
The idea being that, if we found some bugs or some tricky configurations we could add the config here as taken from fleet and have a unit test with that.
The tests now use simplified bits of config and test 1 hook at the time for the specific bits of config.
Each approach has its own merits (the one in this PR has brevity and simplicity on its side) but here we are losing the possibility of testing the whole of the coordinator diagnostic output given certain policies/configurations.

I am not asking to change it back, I just want us to be aware of what this change implies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this a bunch offline, to summarize my perspective:

  • The existing test covered a lot of ground but (flakiness aside) there was no way to tell if it was doing the right thing, since no human could audit 10K lines of golden files
  • Checking internal state through the diagnostics interface, where it gets serialized back and forth to YAML, and we need to do complicated sanitization to even tell if there is a match, adds complication that is completely orthogonal to the behavior being tested -- if we want to test internal values, we can write tests that look at them directly.
  • Many conditions that were previously only covered by the diagnostics test are now also covered directly in the unit tests (and the new diagnostics tests have already caught an error that was previously missed)
  • Setting up "realistic" configurations can only go so far when all except one object is mocked -- this should really be the domain of the new integration/e2e tests. A complicated config with only one real component doesn't buy much... a lot of these structures are effectively opaque pointers from Coordinator's perspective, and the important part is that we send them to the right places.

All that said, we should definitely continue to expand test coverage and find ways to verify more realistic scenarios :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Related followup that I think both @pchila and I agree on: the extreme verbosity of our diagnostics often isn't buying us much. I don't know anyone who has used those really enormous files productively in an SDH... the important things are the logs and a few key fields, and sometimes we produce these large files "just in case" when what is really needed is more careful verification of important errors and their causes. This comes up in e.g. #2852 -- we might benefit from making the diagnostics smaller but more targeted, and preserving better records when we encounter errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99% of the time I am only looking at pre-config.yaml, state.yaml, and the logs. The computed/expected/actual only get looked at when variable substitutions or conditions are involved, which is almost always for standalone agents and usually agents on K8S.

I would hesitate to remove them, but they definitely aren't used frequently. If we can figure out how to output those files as actual YAML matching the policy format instead of the crazyily nested google.protobuf.Struct fields that would be a nice improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think part of the problem there is that we use those protobufs not just for API calls but as our core internal representation of that data... I'm not sure how heavily we depend on it though, maybe there is a way to avoid it, or at least to work around it when encoding the most annoying cases. I will try and poke at it when I'm revisiting the error state reporting next sprint.

@faec
Copy link
Contributor Author

faec commented Jun 26, 2023

As of this morning the integration tests are still failing at head (https://buildkite.com/elastic/elastic-agent/builds/1278#0188f77a-867b-4072-9107-8d3bd3fa9ce1) despite the rest of the build being fixed. Merging anyway after checking with @cmacknz and we'll keep an eye on the status as the unrelated errors in the integration tests are fixed.

@faec faec merged commit 729636a into elastic:main Jun 26, 2023
@faec faec deleted the coordinator-broadcaster branch June 26, 2023 12:20
AndersonQ pushed a commit to AndersonQ/elastic-agent that referenced this pull request Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip bug Something isn't working enhancement New feature or request skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
5 participants