Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reporting empty version_info for components that just started #5333

Conversation

AndersonQ
Copy link
Member

What does this PR do?

The state diagnostics hook now re-fetches once the state when necessary, improving the accuracy of component status reporting.

Why is it important?

Given a design choice, the coordinator's reported state might be stale, leading to healthy components lacking version_info immediately after they start.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Run the TestDiagnosticState test

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@AndersonQ AndersonQ added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team flaky-test Unstable or unreliable test cases. skip-changelog backport-8.15 Automated backport to the 8.15 branch with mergify labels Aug 21, 2024
@AndersonQ AndersonQ self-assigned this Aug 21, 2024
@AndersonQ AndersonQ requested a review from a team as a code owner August 21, 2024 10:08
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why this is being done? Do we really need it to have that guarantee? I don't feel like we do.

I don't like the idea that we are adjust core code to fix a condition that the test should just probably handle.

Why not have the test fetch the state twice, instead of doing it internally in agent?

@AndersonQ
Copy link
Member Author

Why not have the test fetch the state twice, instead of doing it internally in agent?

my reasoning was that we choose to allow that to happens, so we might as well try to prevent it to happen in the diagnostics. As it's just the diagnostics hook code, I believe it's localised enough to do not lead to unexpected behaviour.

Also it's been happening on the tests, making it not that unlikely as expected

@blakerouse
Copy link
Contributor

I don't really see how this is really fixing anything, as it just calls the same function again in that one case. So its just reducing the chance that it could happen, but it still will happen.

@AndersonQ
Copy link
Member Author

This is the build info that should come during the check-in, the first one. Therefore most likely the 1st check-in happens, the coordinator internally have it all updated but because of the broadcaster when the diagnostics hook runs, the received state is stale. However a second call should return the right state because by the time the call happens again 1) the stale value has been read by the 1st call 2) the new value is on the broadcaster. Thus a 2nd call should be enough.

@blakerouse
Copy link
Contributor

@AndersonQ how are your confirming that the new value is present?

@ycombinator ycombinator force-pushed the 4215-flaky-TestComponentBuildHashInDiagnostics branch from db1be75 to 16c638f Compare August 21, 2024 19:13
@AndersonQ
Copy link
Member Author

@AndersonQ how are your confirming that the new value is present?

I'm not. If the component is healthy, the check-in happened and the versionInfo was sent. Thus there isn't much reason to believe several retries would be needed. Besides if it were the case the versionInfo isn't be correctly updated 1) the diagnostics should proceed and not get stuck anyway 2) we'd catch it on our tests. Also the diagnostic from the flaky test has all then versionInfo, shoing a retry might be all we need.

@blakerouse
Copy link
Contributor

@AndersonQ how are your confirming that the new value is present?

I'm not. If the component is healthy, the check-in happened and the versionInfo was sent. Thus there isn't much reason to believe several retries would be needed. Besides if it were the case the versionInfo isn't be correctly updated 1) the diagnostics should proceed and not get stuck anyway 2) we'd catch it on our tests. Also the diagnostic from the flaky test has all then versionInfo, shoing a retry might be all we need.

It is still unclear how a single retry provides the guarantee that versionInfo is now set. That is what I do not understand, and why I am not okay with this PR at the moment. Currently in my head this change is just reducing the chance slightly, but it is still possible.

@AndersonQ
Copy link
Member Author

It is still unclear how a single retry provides the guarantee that versionInfo is now set. That is what I do not understand, and why I am not okay with this PR at the moment. Currently in my head this change is just reducing the chance slightly, but it is still possible.

The healthy status and the version_info should arrive together, therefore it should be pretty much instantaneous the update. That's why I believe it more than a slight reduction on the chance of getting the an empty version_info.

Anyway I added more retries with a timeout between them. If all the retries run it's a 250ms + the normal code execution time, what should be a pretty much negligible on the total time to get a diagnostics. Does it looks more robust now?

cs.State.VersionInfo.BuildHash != "" &&
cs.State.VersionInfo.Meta["commit"] != "" {
break outerLoop
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block would be much better as its own function either in getState or another wrapper around it.

I still am not a fan of this, again your just improving the odds and not solving the true issue. Could we some how guarantee that when status is updated to Healthy that the VersionInfo is set at the same time? Why are they not?

If there is a bad component that never provides the versioninfo it is going to delay this result as well indefinitely by 250 milliseconds.

@AndersonQ
Copy link
Member Author

[...] again your just improving the odds and not solving the true issue

that's the whole idea. The broadcaster was built with the possibility of reporting stale states and the coordinator uses it that way. Changing it all isn't worth it, thus improving the odds seems good enough.

the broadcaster might report a stale state:

// Race warning: if inputBuffer > 0 then the input's current value may be
// slightly (~1 scheduler cycle) ahead of its subscribers. In particular this
// sequence:
//
// b.InputChan <- newValue
// b.Get()

and how the coordinator uses it:

// Note: the uses of a buffered input channel in our broadcaster (the
// third parameter to broadcaster.New) means that it is possible for
// immediately adjacent writes/reads not to match, e.g.:
//
// stateBroadcaster.Set(newState)
// reportedState := stateBroadcaster.Get() // may not match newState
//
// We accept this intentionally to make sure Coordinator itself blocks
// as rarely as possible. Within Coordinator's goroutine, we can always
// get the latest synchronized value by reading the state struct directly,
// so this only affects external callers, and we accept that some of those
// might be behind by a scheduler interrupt or so.
//
// If this ever changes and we decide we need absolute causal
// synchronization in the subscriber API, just set the input buffer to 0.
stateBroadcaster: broadcaster.New(state, 64, 32),

the new state is sent on refreshState()

// Forward the current state to the broadcaster and clear the stateNeedsRefresh
// flag. Must be called on the main Coordinator goroutine.
func (c *Coordinator) refreshState() {
c.stateBroadcaster.InputChan <- c.generateReportableState()
c.stateNeedsRefresh = false
}

triggered by the coordinator's runLoop:

case componentState := <-c.managerChans.runtimeManagerUpdate:
// New component change reported by the runtime manager via
// Coordinator.watchRuntimeComponents(), merge it with the
// Coordinator state.
c.applyComponentState(componentState)

the broadcaster will update the value its Get sends as soon as it receives a new value:

b.selectCases[indexGetCase].Send = reflect.ValueOf(b.currentValue())

there is a lot of indirection until the new state finally arrives at the broadcaster to be consumed by the diagnostics hook. That's why I'm not trying to fix it completely, just avoid the problem with a good old retry.

Our integration tests will say how good the fix is and we can adjust accordingly. That's why I rather have a single retry for now, which would have minimal impact on the time spent collecting a diagnostics and I believe should be enough to get an up-to-date state.

@blakerouse
Copy link
Contributor

I still don't like this change, even with the explanation. Honestly I don't believe it is even fixing the problem, just masking it with better odds.

If the component is being updated with healthy which the package_version_test.go is ensuring that it is healthy then it should have that version information.

Another option is to just adjust the test and make the allHealthy function to ensure that version information is present before reporting that the Elastic Agent is healthy. https://github.com/elastic/elastic-agent/blob/main/testing/integration/package_version_test.go#L94

You can use the c.VersionInfo to verify that the status information has the version information. If its present there then it should defiantly have it in the diagnostics.

@AndersonQ
Copy link
Member Author

Another option is to just adjust the test and make the allHealthy function to ensure that version information is present before reporting that the Elastic Agent is healthy. https://github.com/elastic/elastic-agent/blob/main/testing/integration/package_version_test.go#L94

This does not address the problem at all, it just hides the issue. That's why I'm against that approach.

Honestly I don't believe it is even fixing the problem, just masking it with better odds.

That's the idea ;)
It isn't supposed to be a perfect fix. The issue does not compromise the normal behaviour of the agent and a complete fix isn't worth it. That's why I chose a palliative fix, improving the odds of it succeeding with minimal overhead and added complexity.

We're seeing this test failing on CI and tweaking the test to ignore the issue isn't a good approach or a good precedent. If we merge this in the current state we'll be able to see if the problem persists or not. If it persists, another approach is needed, if it doesn't, problem solved.
This at least attempt to address the real issue instead of just ignoring it. Besides all the evidence (the failed tests) suggest a retry might be enough.

@blakerouse
Copy link
Contributor

Another option is to just adjust the test and make the allHealthy function to ensure that version information is present before reporting that the Elastic Agent is healthy. https://github.com/elastic/elastic-agent/blob/main/testing/integration/package_version_test.go#L94

This does not address the problem at all, it just hides the issue. That's why I'm against that approach.

Honestly I don't believe it is even fixing the problem, just masking it with better odds.

That's the idea ;) It isn't supposed to be a perfect fix. The issue does not compromise the normal behaviour of the agent and a complete fix isn't worth it. That's why I chose a palliative fix, improving the odds of it succeeding with minimal overhead and added complexity.

We're seeing this test failing on CI and tweaking the test to ignore the issue isn't a good approach or a good precedent. If we merge this in the current state we'll be able to see if the problem persists or not. If it persists, another approach is needed, if it doesn't, problem solved. This at least attempt to address the real issue instead of just ignoring it. Besides all the evidence (the failed tests) suggest a retry might be enough.

Being that we know that its possible for the version information to be delayed (which even based on all the explanation in the PR, still doesn't make sense on why it would be separate from a single Healthy update), and we accepting that delay. My suggestion is to just fix the test to ensure that its always there before performing diagnostics as the test is explicitly testing contents of diagnostic output.

I don't think we actually care in the real world that it is missing on a freshly started running Elastic Agent. That is why I am suggesting to not to make a change inside the core of the Elastic Agent at all. I have not seen an argument that ensures that we need this guarantee, except in this one test case.

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the whole discussion and having a look at the code modification I tend to agree with @blakerouse about fixing this in test code rather than adding retried to the diagnostic hooks.

If we accept that the version_info will eventually be available and dumped in the diagnostics I would prefer to change the asserts on the test to reflect that.

Conversely, if we need some stronger guarantees about the state returned by the broadcaster it's a much bigger change but it's probably not worth it since the chance of diagnostics generated really quickly after components performed their first checkin is rather small and we can always ask for more diagnostics

@AndersonQ AndersonQ force-pushed the 4215-flaky-TestComponentBuildHashInDiagnostics branch from bf01310 to f1ff6bb Compare August 30, 2024 14:22
@AndersonQ
Copy link
Member Author

There is something off, but let's get the flaky failures out of the way for now. I think this solution is good enough for now.

@AndersonQ AndersonQ requested review from blakerouse and pchila August 30, 2024 14:24
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for just adjusting the test!

Looks good.

@jlind23 jlind23 enabled auto-merge (squash) September 3, 2024 08:39
@AndersonQ AndersonQ disabled auto-merge September 3, 2024 08:39
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.15 Automated backport to the 8.15 branch with mergify flaky-test Unstable or unreliable test cases. skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky Test]: TestComponentBuildHashInDiagnostics – build hash is empty
5 participants