Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290

Conversation

AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Feb 20, 2024

What does this PR do?

  • add debug logs on Manager.update for completeness
  • Refactor and rename TestManager_OutputChange
    • use eventually instead of time.Sleep
    • rename to better reflect what it tests
  • Remove TestManager_StartStop as TestManager_StartStopComponent already covers it

Why is it important?

TestFakeInputSuite/TestManager_OutputChange was flaky

Checklist

  • My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

How to test this PR locally

Run FakeInputSuite/TestManager_StartStopComponent, it should not be flaky

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@AndersonQ AndersonQ added Team:Elastic-Agent Label for the Agent team skip-changelog labels Feb 20, 2024
@AndersonQ AndersonQ self-assigned this Feb 20, 2024
Copy link
Contributor

mergify bot commented Feb 20, 2024

This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@AndersonQ AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch 2 times, most recently from 806f833 to 1bef395 Compare February 20, 2024 14:26
@AndersonQ AndersonQ changed the title [Flaky Test]: debug TestFakeInputSuite/TestManager_OutputChange [Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange Feb 20, 2024
@AndersonQ AndersonQ marked this pull request as ready for review February 20, 2024 14:27
@AndersonQ AndersonQ requested a review from a team as a code owner February 20, 2024 14:27
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@AndersonQ AndersonQ enabled auto-merge (squash) February 20, 2024 14:27
@AndersonQ AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from 1bef395 to fca9e82 Compare February 20, 2024 14:32
@@ -753,6 +753,7 @@ func (m *Manager) update(model component.Model, teardown bool) error {
if ok {
// existing component; send runtime updated value
existing.setCurrent(comp)
m.logger.Debugf("Updating component %q", comp.ID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for completeness as there is a log for stopping components. There makes sense to have for starting and updating components as well

@@ -800,6 +801,7 @@ func (m *Manager) update(model component.Model, teardown bool) error {
m.currentMx.Lock()
m.current[comp.ID] = state
m.currentMx.Unlock()
m.logger.Debugf("Starting component %q", comp.ID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for completeness as there is a log for stopping components. There makes sense to have for starting and updating components as well

@@ -98,127 +99,6 @@ func (suite *FakeInputSuite) setupTestPaths() {
paths.SetVersionHome(false)
}

func (suite *FakeInputSuite) TestManager_StartStop() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestManager_StartStopComponent convers this test

@AndersonQ AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from fca9e82 to 6c69e1d Compare February 20, 2024 15:01
 - add debug logs on Manager.update for completeness
 - Refactor and rename TestManager_OutputChange
    - use eventually instead of time.Sleep
    - rename to better reflect what it tests
 - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it
@AndersonQ AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from 6c69e1d to 8ee2e02 Compare February 20, 2024 15:41
Comment on lines 3272 to 3288
switch step.componentID {
case IDComp0:
switch step.state.State {
case client.UnitStateStarting:
comp0Started = true
case client.UnitStateStopped:
comp0Stopped = true
case client.UnitStateDegraded, client.UnitStateFailed:
comp0Failed = true
}
case IDComp1:
switch step.state.State {
case client.UnitStateStarting:
comp1Started = true
case client.UnitStateDegraded, client.UnitStateFailed:
comp1Failed = true
}
Copy link
Member

@cmacknz cmacknz Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced you can ignore the order of operations.

If you don't do this, you risk bringing back a previous bug. See #4286 (comment) and the description of the change that introduced this test: #2226

Currently we start new components before stopping old. This is usually fine unless component is a service (start/stop is slower).
When new service is started old one can be still running and using resources needed for new service to start.
We can than see something like bind port already in use.

We care about the order of operations for the same input type, otherwise we can create situations where the inputs can't start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced you can ignore the order of operations.

If you don't do this, you risk bringing back a previous bug. See #4286 (comment) and the description of the change that introduced this test: #2226

Currently we start new components before stopping old. This is usually fine unless component is a service (start/stop is slower).
When new service is started old one can be still running and using resources needed for new service to start.
We can than see something like bind port already in use.

We care about the order of operations for the same input type, otherwise we can create situations where the inputs can't start.

done, I kept the check on the order of the events

Comment on lines 834 to 839
m.currentMx.RLock()
if _, exists := m.current[compID]; !exists {
m.currentMx.RUnlock()
return
}
m.currentMx.RUnlock()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalpristas do you remember why it'd be needed? It seems unnecessary to me. And if no tests fails, I'd believe it indeed is unnecessary

@AndersonQ AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from 9dee688 to d207b4a Compare February 22, 2024 11:23
Comment on lines +3146 to +3193
// Wait the 1st config to be applied and the comp0ID to start
require.Eventuallyf(t,
func() bool {
filtered := logs.FilterMessageSnippet(
fmt.Sprintf("Starting component %q", comp0ID)).
TakeAll()
return len(filtered) > 0
},
30*time.Second,
200*time.Millisecond,
"component %s did not start", comp0ID)

updateSleep := 300 * time.Millisecond
if runtime.GOOS == component.Windows {
// windows is slow, preventing flakiness
updateSleep = time.Second
}
time.Sleep(updateSleep)
m.Update(component.Model{Components: components2})
err = <-m.errCh
require.NoError(t, err)
require.NoError(t, err, "expected no error from the manager when applying"+
"the 2nd component model")

// Wait the 2nd config to be applied and the comp1ID to start
require.Eventuallyf(t,
func() bool {
filtered := logs.FilterMessageSnippet(
fmt.Sprintf("Starting component %q", comp1ID)).
TakeAll()
return len(filtered) > 0
},
30*time.Second,
200*time.Millisecond,
"component %s did not start", comp1ID)

count := 0
timeout := 30 * time.Second
endTimer := time.NewTimer(timeout)
defer endTimer.Stop()
// component 1 started, we can stop the manager
cancel()

LOOP:
for {
select {
case <-endTimer.C:
t.Fatalf("timed out after %s seconds, "+
"did not receive enought state changes", timeout)
case err := <-errCh:
require.NoError(t, err)
case err := <-subErrCh0:
t.Logf("[subErrCh0] received: %v", err)
require.NoError(t, err)
count++
if count >= 2 {
break LOOP
}
case err := <-subErrCh1:
t.Logf("[subErrCh1] received: %v", err)
require.NoError(t, err)
count++
if count >= 2 {
break LOOP
}
}
}
comp0StartLogs := logs.FilterMessageSnippet(
fmt.Sprintf("Starting component %q", comp0ID)).TakeAll()
comp0StopLogs := logs.FilterMessageSnippet(
fmt.Sprintf("Stopping component %q", comp0ID)).TakeAll()
comp1StartLogs := logs.FilterMessageSnippet(
fmt.Sprintf("Starting component %q", comp1ID)).TakeAll()

subCancel()
cancel()
assert.Len(t, comp0StartLogs, 1,
"component %d started more than once", comp0ID)
assert.Len(t, comp0StopLogs, 1,
"component %d stopped more than once", comp0ID)
assert.Len(t, comp1StartLogs, 1,
"component %d started more than once", comp1ID)

// check progression, require stop fake-0 before start fake-1
stateProgressionWG.Wait()
comp0Stopped := false
for _, step := range stateProgression {
if step.componentID == IDComp0 &&
step.state.State == client.UnitStateStopped {
comp0Stopped = true
}
if step.componentID == IDComp1 &&
step.state.State == client.UnitStateStarting {
require.True(t, comp0Stopped)
break
assert.Truef(t, comp0StopLogs[0].Time.Before(comp1StartLogs[0].Time),
"component %s stopped after %s", comp0ID, comp1ID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of relying on logs for tests, but due to the concurrent nature of how the state is reported and the test was collecting them, checking directly the logs are the simplest and mo reliable way to check the correct order of the events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is fine.

You could additionally verify that the component 0 start time is before the component 0 stop time.

Right now the following ordering is allowed by the test (even though our logic should not produce it):

comp0stop -> comp1start -> comp0start

@AndersonQ AndersonQ requested a review from cmacknz February 22, 2024 11:26
restore `waitForStopped` checking if the component is still on the manager's component list
… of github.com:AndersonQ/elastic-agent into 4286-flaky-TestFakeInputSuite/TestManager_OutputChange
Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
54.5% 54.5% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@AndersonQ AndersonQ merged commit a5262e1 into elastic:main Feb 22, 2024
9 checks passed
@cmacknz cmacknz added backport-v8.12.0 Automated backport with mergify backport-v8.13.0 Automated backport with mergify labels Feb 22, 2024
mergify bot pushed a commit that referenced this pull request Feb 22, 2024
- add debug logs on Manager.update for completeness
 - Refactor and rename TestManager_OutputChange
    - use eventually instead of time.Sleep
    - rename to better reflect what it tests
 - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it

(cherry picked from commit a5262e1)

# Conflicts:
#	pkg/component/runtime/manager_fake_input_test.go
mergify bot pushed a commit that referenced this pull request Feb 22, 2024
- add debug logs on Manager.update for completeness
 - Refactor and rename TestManager_OutputChange
    - use eventually instead of time.Sleep
    - rename to better reflect what it tests
 - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it

(cherry picked from commit a5262e1)
AndersonQ added a commit that referenced this pull request Feb 23, 2024
- add debug logs on Manager.update for completeness
 - Refactor and rename TestManager_OutputChange
    - use eventually instead of time.Sleep
    - rename to better reflect what it tests
 - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it

(cherry picked from commit a5262e1)

Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip backport-v8.12.0 Automated backport with mergify backport-v8.13.0 Automated backport with mergify skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky Test]: TestFakeInputSuite/TestManager_OutputChange – Should be true
4 participants