-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid reporting empty version_info for components that just started #5333
Avoid reporting empty version_info for components that just started #5333
Conversation
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why this is being done? Do we really need it to have that guarantee? I don't feel like we do.
I don't like the idea that we are adjust core code to fix a condition that the test should just probably handle.
Why not have the test fetch the state twice, instead of doing it internally in agent?
my reasoning was that we choose to allow that to happens, so we might as well try to prevent it to happen in the diagnostics. As it's just the diagnostics hook code, I believe it's localised enough to do not lead to unexpected behaviour. Also it's been happening on the tests, making it not that unlikely as expected |
I don't really see how this is really fixing anything, as it just calls the same function again in that one case. So its just reducing the chance that it could happen, but it still will happen. |
This is the build info that should come during the check-in, the first one. Therefore most likely the 1st check-in happens, the coordinator internally have it all updated but because of the broadcaster when the diagnostics hook runs, the received state is stale. However a second call should return the right state because by the time the call happens again 1) the stale value has been read by the 1st call 2) the new value is on the broadcaster. Thus a 2nd call should be enough. |
@AndersonQ how are your confirming that the new value is present? |
db1be75
to
16c638f
Compare
I'm not. If the component is healthy, the check-in happened and the versionInfo was sent. Thus there isn't much reason to believe several retries would be needed. Besides if it were the case the versionInfo isn't be correctly updated 1) the diagnostics should proceed and not get stuck anyway 2) we'd catch it on our tests. Also the diagnostic from the flaky test has all then versionInfo, shoing a retry might be all we need. |
It is still unclear how a single retry provides the guarantee that versionInfo is now set. That is what I do not understand, and why I am not okay with this PR at the moment. Currently in my head this change is just reducing the chance slightly, but it is still possible. |
The healthy status and the version_info should arrive together, therefore it should be pretty much instantaneous the update. That's why I believe it more than a slight reduction on the chance of getting the an empty version_info. Anyway I added more retries with a timeout between them. If all the retries run it's a 250ms + the normal code execution time, what should be a pretty much negligible on the total time to get a diagnostics. Does it looks more robust now? |
cs.State.VersionInfo.BuildHash != "" && | ||
cs.State.VersionInfo.Meta["commit"] != "" { | ||
break outerLoop | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block would be much better as its own function either in getState
or another wrapper around it.
I still am not a fan of this, again your just improving the odds and not solving the true issue. Could we some how guarantee that when status is updated to Healthy that the VersionInfo is set at the same time? Why are they not?
If there is a bad component that never provides the versioninfo it is going to delay this result as well indefinitely by 250 milliseconds.
that's the whole idea. The broadcaster was built with the possibility of reporting stale states and the coordinator uses it that way. Changing it all isn't worth it, thus improving the odds seems good enough. the broadcaster might report a stale state: elastic-agent/pkg/utils/broadcaster/broadcaster.go Lines 194 to 199 in 729636a
and how the coordinator uses it: elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go Lines 372 to 387 in 41ee2bb
the new state is sent on elastic-agent/internal/pkg/agent/application/coordinator/coordinator_state.go Lines 127 to 132 in c929f79
triggered by the coordinator's runLoop: elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go Lines 1095 to 1099 in 41ee2bb
the broadcaster will update the value its
there is a lot of indirection until the new state finally arrives at the broadcaster to be consumed by the diagnostics hook. That's why I'm not trying to fix it completely, just avoid the problem with a good old retry. Our integration tests will say how good the fix is and we can adjust accordingly. That's why I rather have a single retry for now, which would have minimal impact on the time spent collecting a diagnostics and I believe should be enough to get an up-to-date state. |
I still don't like this change, even with the explanation. Honestly I don't believe it is even fixing the problem, just masking it with better odds. If the component is being updated with healthy which the package_version_test.go is ensuring that it is healthy then it should have that version information. Another option is to just adjust the test and make the You can use the |
This does not address the problem at all, it just hides the issue. That's why I'm against that approach.
That's the idea ;) We're seeing this test failing on CI and tweaking the test to ignore the issue isn't a good approach or a good precedent. If we merge this in the current state we'll be able to see if the problem persists or not. If it persists, another approach is needed, if it doesn't, problem solved. |
Being that we know that its possible for the version information to be delayed (which even based on all the explanation in the PR, still doesn't make sense on why it would be separate from a single Healthy update), and we accepting that delay. My suggestion is to just fix the test to ensure that its always there before performing diagnostics as the test is explicitly testing contents of diagnostic output. I don't think we actually care in the real world that it is missing on a freshly started running Elastic Agent. That is why I am suggesting to not to make a change inside the core of the Elastic Agent at all. I have not seen an argument that ensures that we need this guarantee, except in this one test case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the whole discussion and having a look at the code modification I tend to agree with @blakerouse about fixing this in test code rather than adding retried to the diagnostic hooks.
If we accept that the version_info will eventually be available and dumped in the diagnostics I would prefer to change the asserts on the test to reflect that.
Conversely, if we need some stronger guarantees about the state returned by the broadcaster it's a much bigger change but it's probably not worth it since the chance of diagnostics generated really quickly after components performed their first checkin is rather small and we can always ask for more diagnostics
bf01310
to
f1ff6bb
Compare
There is something off, but let's get the flaky failures out of the way for now. I think this solution is good enough for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for just adjusting the test!
Looks good.
|
(cherry picked from commit d8bdd71)
(cherry picked from commit d8bdd71)
What does this PR do?
The state diagnostics hook now re-fetches once the state when necessary, improving the accuracy of component status reporting.
Why is it important?
Given a design choice, the coordinator's reported state might be stale, leading to healthy components lacking version_info immediately after they start.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry ine changelog tool./changelog/fragments
using th[ ] I have added an integration test or an E2E testDisruptive User Impact
How to test this PR locally
Run the
TestDiagnosticState
testRelated issues
Questions to ask yourself