-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290
[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290
Conversation
This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
NOTE: |
806f833
to
1bef395
Compare
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
1bef395
to
fca9e82
Compare
pkg/component/runtime/manager.go
Outdated
@@ -753,6 +753,7 @@ func (m *Manager) update(model component.Model, teardown bool) error { | |||
if ok { | |||
// existing component; send runtime updated value | |||
existing.setCurrent(comp) | |||
m.logger.Debugf("Updating component %q", comp.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for completeness as there is a log for stopping components. There makes sense to have for starting and updating components as well
@@ -800,6 +801,7 @@ func (m *Manager) update(model component.Model, teardown bool) error { | |||
m.currentMx.Lock() | |||
m.current[comp.ID] = state | |||
m.currentMx.Unlock() | |||
m.logger.Debugf("Starting component %q", comp.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for completeness as there is a log for stopping components. There makes sense to have for starting and updating components as well
@@ -98,127 +99,6 @@ func (suite *FakeInputSuite) setupTestPaths() { | |||
paths.SetVersionHome(false) | |||
} | |||
|
|||
func (suite *FakeInputSuite) TestManager_StartStop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestManager_StartStopComponent convers this test
fca9e82
to
6c69e1d
Compare
- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it
6c69e1d
to
8ee2e02
Compare
switch step.componentID { | ||
case IDComp0: | ||
switch step.state.State { | ||
case client.UnitStateStarting: | ||
comp0Started = true | ||
case client.UnitStateStopped: | ||
comp0Stopped = true | ||
case client.UnitStateDegraded, client.UnitStateFailed: | ||
comp0Failed = true | ||
} | ||
case IDComp1: | ||
switch step.state.State { | ||
case client.UnitStateStarting: | ||
comp1Started = true | ||
case client.UnitStateDegraded, client.UnitStateFailed: | ||
comp1Failed = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced you can ignore the order of operations.
If you don't do this, you risk bringing back a previous bug. See #4286 (comment) and the description of the change that introduced this test: #2226
Currently we start new components before stopping old. This is usually fine unless component is a service (start/stop is slower).
When new service is started old one can be still running and using resources needed for new service to start.
We can than see something like bind port already in use.
We care about the order of operations for the same input type, otherwise we can create situations where the inputs can't start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced you can ignore the order of operations.
If you don't do this, you risk bringing back a previous bug. See #4286 (comment) and the description of the change that introduced this test: #2226
Currently we start new components before stopping old. This is usually fine unless component is a service (start/stop is slower).
When new service is started old one can be still running and using resources needed for new service to start.
We can than see something like bind port already in use.We care about the order of operations for the same input type, otherwise we can create situations where the inputs can't start.
done, I kept the check on the order of the events
it does not return early if the stopping component is removed from the manager's current component
pkg/component/runtime/manager.go
Outdated
m.currentMx.RLock() | ||
if _, exists := m.current[compID]; !exists { | ||
m.currentMx.RUnlock() | ||
return | ||
} | ||
m.currentMx.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michalpristas do you remember why it'd be needed? It seems unnecessary to me. And if no tests fails, I'd believe it indeed is unnecessary
9dee688
to
d207b4a
Compare
// Wait the 1st config to be applied and the comp0ID to start | ||
require.Eventuallyf(t, | ||
func() bool { | ||
filtered := logs.FilterMessageSnippet( | ||
fmt.Sprintf("Starting component %q", comp0ID)). | ||
TakeAll() | ||
return len(filtered) > 0 | ||
}, | ||
30*time.Second, | ||
200*time.Millisecond, | ||
"component %s did not start", comp0ID) | ||
|
||
updateSleep := 300 * time.Millisecond | ||
if runtime.GOOS == component.Windows { | ||
// windows is slow, preventing flakiness | ||
updateSleep = time.Second | ||
} | ||
time.Sleep(updateSleep) | ||
m.Update(component.Model{Components: components2}) | ||
err = <-m.errCh | ||
require.NoError(t, err) | ||
require.NoError(t, err, "expected no error from the manager when applying"+ | ||
"the 2nd component model") | ||
|
||
// Wait the 2nd config to be applied and the comp1ID to start | ||
require.Eventuallyf(t, | ||
func() bool { | ||
filtered := logs.FilterMessageSnippet( | ||
fmt.Sprintf("Starting component %q", comp1ID)). | ||
TakeAll() | ||
return len(filtered) > 0 | ||
}, | ||
30*time.Second, | ||
200*time.Millisecond, | ||
"component %s did not start", comp1ID) | ||
|
||
count := 0 | ||
timeout := 30 * time.Second | ||
endTimer := time.NewTimer(timeout) | ||
defer endTimer.Stop() | ||
// component 1 started, we can stop the manager | ||
cancel() | ||
|
||
LOOP: | ||
for { | ||
select { | ||
case <-endTimer.C: | ||
t.Fatalf("timed out after %s seconds, "+ | ||
"did not receive enought state changes", timeout) | ||
case err := <-errCh: | ||
require.NoError(t, err) | ||
case err := <-subErrCh0: | ||
t.Logf("[subErrCh0] received: %v", err) | ||
require.NoError(t, err) | ||
count++ | ||
if count >= 2 { | ||
break LOOP | ||
} | ||
case err := <-subErrCh1: | ||
t.Logf("[subErrCh1] received: %v", err) | ||
require.NoError(t, err) | ||
count++ | ||
if count >= 2 { | ||
break LOOP | ||
} | ||
} | ||
} | ||
comp0StartLogs := logs.FilterMessageSnippet( | ||
fmt.Sprintf("Starting component %q", comp0ID)).TakeAll() | ||
comp0StopLogs := logs.FilterMessageSnippet( | ||
fmt.Sprintf("Stopping component %q", comp0ID)).TakeAll() | ||
comp1StartLogs := logs.FilterMessageSnippet( | ||
fmt.Sprintf("Starting component %q", comp1ID)).TakeAll() | ||
|
||
subCancel() | ||
cancel() | ||
assert.Len(t, comp0StartLogs, 1, | ||
"component %d started more than once", comp0ID) | ||
assert.Len(t, comp0StopLogs, 1, | ||
"component %d stopped more than once", comp0ID) | ||
assert.Len(t, comp1StartLogs, 1, | ||
"component %d started more than once", comp1ID) | ||
|
||
// check progression, require stop fake-0 before start fake-1 | ||
stateProgressionWG.Wait() | ||
comp0Stopped := false | ||
for _, step := range stateProgression { | ||
if step.componentID == IDComp0 && | ||
step.state.State == client.UnitStateStopped { | ||
comp0Stopped = true | ||
} | ||
if step.componentID == IDComp1 && | ||
step.state.State == client.UnitStateStarting { | ||
require.True(t, comp0Stopped) | ||
break | ||
assert.Truef(t, comp0StopLogs[0].Time.Before(comp1StartLogs[0].Time), | ||
"component %s stopped after %s", comp0ID, comp1ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of relying on logs for tests, but due to the concurrent nature of how the state is reported and the test was collecting them, checking directly the logs are the simplest and mo reliable way to check the correct order of the events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this is fine.
You could additionally verify that the component 0 start time is before the component 0 stop time.
Right now the following ordering is allowed by the test (even though our logic should not produce it):
comp0stop -> comp1start -> comp0start
restore `waitForStopped` checking if the component is still on the manager's component list
… of github.com:AndersonQ/elastic-agent into 4286-flaky-TestFakeInputSuite/TestManager_OutputChange
|
- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it (cherry picked from commit a5262e1) # Conflicts: # pkg/component/runtime/manager_fake_input_test.go
- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it (cherry picked from commit a5262e1)
- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it (cherry picked from commit a5262e1) Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>
What does this PR do?
Why is it important?
TestFakeInputSuite/TestManager_OutputChange was flaky
Checklist
[ ] I have commented my code, particularly in hard-to-understand areas[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testHow to test this PR locally
Run
FakeInputSuite/TestManager_StartStopComponent
, it should not be flakyRelated issues
Questions to ask yourself