Avoid reporting empty version_info for components that just started #5333

AndersonQ · 2024-08-21T10:08:37Z

What does this PR do?

The state diagnostics hook now re-fetches once the state when necessary, improving the accuracy of component status reporting.

Why is it important?

Given a design choice, the coordinator's reported state might be stale, leading to healthy components lacking version_info immediately after they start.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in ./changelog/fragments using th~~e changelog tool
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

How to test this PR locally

Run the TestDiagnosticState test

Related issues

Closes [Flaky Test]: TestComponentBuildHashInDiagnostics – build hash is empty #4215

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

elasticmachine · 2024-08-21T10:08:40Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

blakerouse

I don't fully understand why this is being done? Do we really need it to have that guarantee? I don't feel like we do.

I don't like the idea that we are adjust core code to fix a condition that the test should just probably handle.

Why not have the test fetch the state twice, instead of doing it internally in agent?

AndersonQ · 2024-08-21T13:31:00Z

Why not have the test fetch the state twice, instead of doing it internally in agent?

my reasoning was that we choose to allow that to happens, so we might as well try to prevent it to happen in the diagnostics. As it's just the diagnostics hook code, I believe it's localised enough to do not lead to unexpected behaviour.

Also it's been happening on the tests, making it not that unlikely as expected

blakerouse · 2024-08-21T13:33:23Z

I don't really see how this is really fixing anything, as it just calls the same function again in that one case. So its just reducing the chance that it could happen, but it still will happen.

AndersonQ · 2024-08-21T13:52:27Z

This is the build info that should come during the check-in, the first one. Therefore most likely the 1st check-in happens, the coordinator internally have it all updated but because of the broadcaster when the diagnostics hook runs, the received state is stale. However a second call should return the right state because by the time the call happens again 1) the stale value has been read by the 1st call 2) the new value is on the broadcaster. Thus a 2nd call should be enough.

blakerouse · 2024-08-21T17:32:53Z

@AndersonQ how are your confirming that the new value is present?

AndersonQ · 2024-08-22T11:57:59Z

@AndersonQ how are your confirming that the new value is present?

I'm not. If the component is healthy, the check-in happened and the versionInfo was sent. Thus there isn't much reason to believe several retries would be needed. Besides if it were the case the versionInfo isn't be correctly updated 1) the diagnostics should proceed and not get stuck anyway 2) we'd catch it on our tests. Also the diagnostic from the flaky test has all then versionInfo, shoing a retry might be all we need.

blakerouse · 2024-08-22T13:14:43Z

@AndersonQ how are your confirming that the new value is present?

I'm not. If the component is healthy, the check-in happened and the versionInfo was sent. Thus there isn't much reason to believe several retries would be needed. Besides if it were the case the versionInfo isn't be correctly updated 1) the diagnostics should proceed and not get stuck anyway 2) we'd catch it on our tests. Also the diagnostic from the flaky test has all then versionInfo, shoing a retry might be all we need.

It is still unclear how a single retry provides the guarantee that versionInfo is now set. That is what I do not understand, and why I am not okay with this PR at the moment. Currently in my head this change is just reducing the chance slightly, but it is still possible.

AndersonQ · 2024-08-22T13:54:01Z

It is still unclear how a single retry provides the guarantee that versionInfo is now set. That is what I do not understand, and why I am not okay with this PR at the moment. Currently in my head this change is just reducing the chance slightly, but it is still possible.

The healthy status and the version_info should arrive together, therefore it should be pretty much instantaneous the update. That's why I believe it more than a slight reduction on the chance of getting the an empty version_info.

Anyway I added more retries with a timeout between them. If all the retries run it's a 250ms + the normal code execution time, what should be a pretty much negligible on the total time to get a diagnostics. Does it looks more robust now?

blakerouse · 2024-08-22T13:56:03Z

internal/pkg/agent/application/coordinator/coordinator.go

+							cs.State.VersionInfo.BuildHash != "" &&
+							cs.State.VersionInfo.Meta["commit"] != "" {
+							break outerLoop
+						}


This whole block would be much better as its own function either in getState or another wrapper around it.

I still am not a fan of this, again your just improving the odds and not solving the true issue. Could we some how guarantee that when status is updated to Healthy that the VersionInfo is set at the same time? Why are they not?

If there is a bad component that never provides the versioninfo it is going to delay this result as well indefinitely by 250 milliseconds.

AndersonQ · 2024-08-22T16:51:23Z

[...] again your just improving the odds and not solving the true issue

that's the whole idea. The broadcaster was built with the possibility of reporting stale states and the coordinator uses it that way. Changing it all isn't worth it, thus improving the odds seems good enough.

the broadcaster might report a stale state:

elastic-agent/pkg/utils/broadcaster/broadcaster.go

Lines 194 to 199 in 729636a

    
           // Race warning: if inputBuffer > 0 then the input's current value may be 
        
           // slightly (~1 scheduler cycle) ahead of its subscribers. In particular this 
        
           // sequence: 
        
           // 
        
           //	b.InputChan <- newValue 
        
           //	b.Get()

and how the coordinator uses it:

elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go

Lines 372 to 387 in 41ee2bb

    
           // Note: the uses of a buffered input channel in our broadcaster (the 
        
           // third parameter to broadcaster.New) means that it is possible for 
        
           // immediately adjacent writes/reads not to match, e.g.: 
        
           // 
        
           //  stateBroadcaster.Set(newState) 
        
           //  reportedState := stateBroadcaster.Get()  // may not match newState 
        
           // 
        
           // We accept this intentionally to make sure Coordinator itself blocks 
        
           // as rarely as possible. Within Coordinator's goroutine, we can always 
        
           // get the latest synchronized value by reading the state struct directly, 
        
           // so this only affects external callers, and we accept that some of those 
        
           // might be behind by a scheduler interrupt or so. 
        
           // 
        
           // If this ever changes and we decide we need absolute causal 
        
           // synchronization in the subscriber API, just set the input buffer to 0. 
        
           stateBroadcaster: broadcaster.New(state, 64, 32),

the new state is sent on refreshState()

elastic-agent/internal/pkg/agent/application/coordinator/coordinator_state.go

Lines 127 to 132 in c929f79

    
           // Forward the current state to the broadcaster and clear the stateNeedsRefresh 
        
           // flag. Must be called on the main Coordinator goroutine. 
        
           func (c *Coordinator) refreshState() { 
        
           	c.stateBroadcaster.InputChan <- c.generateReportableState() 
        
           	c.stateNeedsRefresh = false 
        
           }

triggered by the coordinator's runLoop:

elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go

Lines 1095 to 1099 in 41ee2bb

    
           case componentState := <-c.managerChans.runtimeManagerUpdate: 
        
           	// New component change reported by the runtime manager via 
        
           	// Coordinator.watchRuntimeComponents(), merge it with the 
        
           	// Coordinator state. 
        
           	c.applyComponentState(componentState)

the broadcaster will update the value its Get sends as soon as it receives a new value:

elastic-agent/pkg/utils/broadcaster/broadcaster.go

Line 542 in 729636a

b.selectCases[indexGetCase].Send = reflect.ValueOf(b.currentValue())

there is a lot of indirection until the new state finally arrives at the broadcaster to be consumed by the diagnostics hook. That's why I'm not trying to fix it completely, just avoid the problem with a good old retry.

Our integration tests will say how good the fix is and we can adjust accordingly. That's why I rather have a single retry for now, which would have minimal impact on the time spent collecting a diagnostics and I believe should be enough to get an up-to-date state.

blakerouse · 2024-08-22T17:53:09Z

I still don't like this change, even with the explanation. Honestly I don't believe it is even fixing the problem, just masking it with better odds.

If the component is being updated with healthy which the package_version_test.go is ensuring that it is healthy then it should have that version information.

Another option is to just adjust the test and make the allHealthy function to ensure that version information is present before reporting that the Elastic Agent is healthy. https://github.com/elastic/elastic-agent/blob/main/testing/integration/package_version_test.go#L94

You can use the c.VersionInfo to verify that the status information has the version information. If its present there then it should defiantly have it in the diagnostics.

AndersonQ · 2024-08-23T13:25:03Z

Another option is to just adjust the test and make the allHealthy function to ensure that version information is present before reporting that the Elastic Agent is healthy. https://github.com/elastic/elastic-agent/blob/main/testing/integration/package_version_test.go#L94

This does not address the problem at all, it just hides the issue. That's why I'm against that approach.

Honestly I don't believe it is even fixing the problem, just masking it with better odds.

That's the idea ;)
It isn't supposed to be a perfect fix. The issue does not compromise the normal behaviour of the agent and a complete fix isn't worth it. That's why I chose a palliative fix, improving the odds of it succeeding with minimal overhead and added complexity.

We're seeing this test failing on CI and tweaking the test to ignore the issue isn't a good approach or a good precedent. If we merge this in the current state we'll be able to see if the problem persists or not. If it persists, another approach is needed, if it doesn't, problem solved.
This at least attempt to address the real issue instead of just ignoring it. Besides all the evidence (the failed tests) suggest a retry might be enough.

blakerouse · 2024-08-23T17:14:30Z

Another option is to just adjust the test and make the allHealthy function to ensure that version information is present before reporting that the Elastic Agent is healthy. https://github.com/elastic/elastic-agent/blob/main/testing/integration/package_version_test.go#L94

This does not address the problem at all, it just hides the issue. That's why I'm against that approach.

Honestly I don't believe it is even fixing the problem, just masking it with better odds.

That's the idea ;) It isn't supposed to be a perfect fix. The issue does not compromise the normal behaviour of the agent and a complete fix isn't worth it. That's why I chose a palliative fix, improving the odds of it succeeding with minimal overhead and added complexity.

We're seeing this test failing on CI and tweaking the test to ignore the issue isn't a good approach or a good precedent. If we merge this in the current state we'll be able to see if the problem persists or not. If it persists, another approach is needed, if it doesn't, problem solved. This at least attempt to address the real issue instead of just ignoring it. Besides all the evidence (the failed tests) suggest a retry might be enough.

Being that we know that its possible for the version information to be delayed (which even based on all the explanation in the PR, still doesn't make sense on why it would be separate from a single Healthy update), and we accepting that delay. My suggestion is to just fix the test to ensure that its always there before performing diagnostics as the test is explicitly testing contents of diagnostic output.

I don't think we actually care in the real world that it is missing on a freshly started running Elastic Agent. That is why I am suggesting to not to make a change inside the core of the Elastic Agent at all. I have not seen an argument that ensures that we need this guarantee, except in this one test case.

pchila

After reading the whole discussion and having a look at the code modification I tend to agree with @blakerouse about fixing this in test code rather than adding retried to the diagnostic hooks.

If we accept that the version_info will eventually be available and dumped in the diagnostics I would prefer to change the asserts on the test to reflect that.

Conversely, if we need some stronger guarantees about the state returned by the broadcaster it's a much bigger change but it's probably not worth it since the chance of diagnostics generated really quickly after components performed their first checkin is rather small and we can always ask for more diagnostics

AndersonQ · 2024-08-30T14:24:34Z

There is something off, but let's get the flaky failures out of the way for now. I think this solution is good enough for now.

blakerouse

Thank you for just adjusting the test!

Looks good.

elastic-sonarqube · 2024-09-03T09:07:43Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

(cherry picked from commit d8bdd71)

AndersonQ added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team flaky-test Unstable or unreliable test cases. skip-changelog backport-8.15 Automated backport to the 8.15 branch with mergify labels Aug 21, 2024

AndersonQ self-assigned this Aug 21, 2024

AndersonQ requested a review from a team as a code owner August 21, 2024 10:08

AndersonQ requested review from andrzej-stencel and pchila August 21, 2024 10:08

blakerouse requested changes Aug 21, 2024

View reviewed changes

ycombinator force-pushed the 4215-flaky-TestComponentBuildHashInDiagnostics branch from db1be75 to 16c638f Compare August 21, 2024 19:13

blakerouse reviewed Aug 22, 2024

View reviewed changes

pchila reviewed Aug 27, 2024

View reviewed changes

check for commit presence

f1ff6bb

AndersonQ force-pushed the 4215-flaky-TestComponentBuildHashInDiagnostics branch from bf01310 to f1ff6bb Compare August 30, 2024 14:22

AndersonQ requested review from blakerouse and pchila August 30, 2024 14:24

blakerouse approved these changes Aug 30, 2024

View reviewed changes

Merge branch 'main' into 4215-flaky-TestComponentBuildHashInDiagnostics

1e1dd60

jlind23 enabled auto-merge (squash) September 3, 2024 08:39

AndersonQ disabled auto-merge September 3, 2024 08:39

AndersonQ merged commit d8bdd71 into elastic:main Sep 3, 2024
13 checks passed

AndersonQ added the backport-v8.x label Sep 3, 2024

mergify bot pushed a commit that referenced this pull request Sep 3, 2024

fix TestDiagnosticState agent health check (#5333)

a7b8760

(cherry picked from commit d8bdd71)

mergify bot mentioned this pull request Sep 3, 2024

[8.15](backport #5333) Avoid reporting empty version_info for components that just started #5403

Closed

3 tasks

mergify bot pushed a commit that referenced this pull request Sep 3, 2024

fix TestDiagnosticState agent health check (#5333)

0eb7036

(cherry picked from commit d8bdd71)

mergify bot mentioned this pull request Sep 3, 2024

[8.x](backport #5333) Avoid reporting empty version_info for components that just started #5404

Closed

3 tasks

AndersonQ deleted the 4215-flaky-TestComponentBuildHashInDiagnostics branch September 3, 2024 11:51

This was referenced Sep 3, 2024

[Flaky Test] TestComponentBuildHashInDiagnostics: save diagnostic and agent status on failure #5408

Merged

[Flaky Test] TestComponentBuildHashInDiagnostics improve agent state check #5420

Merged

This was referenced Sep 5, 2024

[8.15](backport #5420) [Flaky Test] TestComponentBuildHashInDiagnostics improve agent state check #5435

Merged

[8.x](backport #5420) [Flaky Test] TestComponentBuildHashInDiagnostics improve agent state check #5436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid reporting empty version_info for components that just started #5333

Avoid reporting empty version_info for components that just started #5333

AndersonQ commented Aug 21, 2024

elasticmachine commented Aug 21, 2024

blakerouse left a comment

AndersonQ commented Aug 21, 2024

blakerouse commented Aug 21, 2024

AndersonQ commented Aug 21, 2024

blakerouse commented Aug 21, 2024

AndersonQ commented Aug 22, 2024

blakerouse commented Aug 22, 2024

AndersonQ commented Aug 22, 2024

blakerouse Aug 22, 2024

AndersonQ commented Aug 22, 2024

blakerouse commented Aug 22, 2024

AndersonQ commented Aug 23, 2024

blakerouse commented Aug 23, 2024

pchila left a comment

AndersonQ commented Aug 30, 2024

blakerouse left a comment

elastic-sonarqube bot commented Sep 3, 2024

Avoid reporting empty version_info for components that just started #5333

Avoid reporting empty version_info for components that just started #5333

Conversation

AndersonQ commented Aug 21, 2024

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

elasticmachine commented Aug 21, 2024

blakerouse left a comment

Choose a reason for hiding this comment

AndersonQ commented Aug 21, 2024

blakerouse commented Aug 21, 2024

AndersonQ commented Aug 21, 2024

blakerouse commented Aug 21, 2024

AndersonQ commented Aug 22, 2024

blakerouse commented Aug 22, 2024

AndersonQ commented Aug 22, 2024

blakerouse Aug 22, 2024

Choose a reason for hiding this comment

AndersonQ commented Aug 22, 2024

blakerouse commented Aug 22, 2024

AndersonQ commented Aug 23, 2024

blakerouse commented Aug 23, 2024

pchila left a comment

Choose a reason for hiding this comment

AndersonQ commented Aug 30, 2024

blakerouse left a comment

Choose a reason for hiding this comment

elastic-sonarqube bot commented Sep 3, 2024

Quality Gate passed