[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290

AndersonQ · 2024-02-20T10:47:21Z

What does this PR do?

add debug logs on Manager.update for completeness
Refactor and rename TestManager_OutputChange
- use eventually instead of time.Sleep
- rename to better reflect what it tests
Remove TestManager_StartStop as TestManager_StartStopComponent already covers it

Why is it important?

TestFakeInputSuite/TestManager_OutputChange was flaky

Checklist

My code follows the style guidelines of this project
~~[ ] I have commented my code, particularly in hard-to-understand areas~~
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
~~[ ] I have added an integration test or an E2E test~~

How to test this PR locally

Run FakeInputSuite/TestManager_StartStopComponent, it should not be flaky

Related issues

Closes [Flaky Test]: TestFakeInputSuite/TestManager_OutputChange – Should be true #4286

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2024-02-20T10:47:58Z

This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2024-02-20T14:27:17Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

AndersonQ · 2024-02-20T14:30:37Z

pkg/component/runtime/manager.go

@@ -753,6 +753,7 @@ func (m *Manager) update(model component.Model, teardown bool) error {
 		if ok {
 			// existing component; send runtime updated value
 			existing.setCurrent(comp)
+			m.logger.Debugf("Updating component %q", comp.ID)


Added for completeness as there is a log for stopping components. There makes sense to have for starting and updating components as well

AndersonQ · 2024-02-20T14:30:41Z

pkg/component/runtime/manager.go

@@ -800,6 +801,7 @@ func (m *Manager) update(model component.Model, teardown bool) error {
 		m.currentMx.Lock()
 		m.current[comp.ID] = state
 		m.currentMx.Unlock()
+		m.logger.Debugf("Starting component %q", comp.ID)


Added for completeness as there is a log for stopping components. There makes sense to have for starting and updating components as well

AndersonQ · 2024-02-20T14:31:12Z

pkg/component/runtime/manager_fake_input_test.go

@@ -98,127 +99,6 @@ func (suite *FakeInputSuite) setupTestPaths() {
 	paths.SetVersionHome(false)
 }

-func (suite *FakeInputSuite) TestManager_StartStop() {


TestManager_StartStopComponent convers this test

- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it

cmacknz · 2024-02-20T18:27:17Z

pkg/component/runtime/manager_fake_input_test.go

+		switch step.componentID {
+		case IDComp0:
+			switch step.state.State {
+			case client.UnitStateStarting:
+				comp0Started = true
+			case client.UnitStateStopped:
+				comp0Stopped = true
+			case client.UnitStateDegraded, client.UnitStateFailed:
+				comp0Failed = true
+			}
+		case IDComp1:
+			switch step.state.State {
+			case client.UnitStateStarting:
+				comp1Started = true
+			case client.UnitStateDegraded, client.UnitStateFailed:
+				comp1Failed = true
+			}


I'm not convinced you can ignore the order of operations.

If you don't do this, you risk bringing back a previous bug. See #4286 (comment) and the description of the change that introduced this test: #2226

Currently we start new components before stopping old. This is usually fine unless component is a service (start/stop is slower).
When new service is started old one can be still running and using resources needed for new service to start.
We can than see something like bind port already in use.

We care about the order of operations for the same input type, otherwise we can create situations where the inputs can't start.

I'm not convinced you can ignore the order of operations.

If you don't do this, you risk bringing back a previous bug. See #4286 (comment) and the description of the change that introduced this test: #2226

Currently we start new components before stopping old. This is usually fine unless component is a service (start/stop is slower).
When new service is started old one can be still running and using resources needed for new service to start.
We can than see something like bind port already in use.

We care about the order of operations for the same input type, otherwise we can create situations where the inputs can't start.

done, I kept the check on the order of the events

it does not return early if the stopping component is removed from the manager's current component

AndersonQ · 2024-02-22T11:19:07Z

pkg/component/runtime/manager.go

-		m.currentMx.RLock()
-		if _, exists := m.current[compID]; !exists {
-			m.currentMx.RUnlock()
-			return
-		}
-		m.currentMx.RUnlock()


@michalpristas do you remember why it'd be needed? It seems unnecessary to me. And if no tests fails, I'd believe it indeed is unnecessary

…tputChange

AndersonQ · 2024-02-22T11:25:35Z

pkg/component/runtime/manager_fake_input_test.go

+	// Wait the 1st config to be applied and the comp0ID to start
+	require.Eventuallyf(t,
+		func() bool {
+			filtered := logs.FilterMessageSnippet(
+				fmt.Sprintf("Starting component %q", comp0ID)).
+				TakeAll()
+			return len(filtered) > 0
+		},
+		30*time.Second,
+		200*time.Millisecond,
+		"component %s did not start", comp0ID)

-	updateSleep := 300 * time.Millisecond
-	if runtime.GOOS == component.Windows {
-		// windows is slow, preventing flakiness
-		updateSleep = time.Second
-	}
-	time.Sleep(updateSleep)
 	m.Update(component.Model{Components: components2})
 	err = <-m.errCh
-	require.NoError(t, err)
+	require.NoError(t, err, "expected no error from the manager when applying"+
+		"the 2nd component model")
+
+	// Wait the 2nd config to be applied and the comp1ID to start
+	require.Eventuallyf(t,
+		func() bool {
+			filtered := logs.FilterMessageSnippet(
+				fmt.Sprintf("Starting component %q", comp1ID)).
+				TakeAll()
+			return len(filtered) > 0
+		},
+		30*time.Second,
+		200*time.Millisecond,
+		"component %s did not start", comp1ID)

-	count := 0
-	timeout := 30 * time.Second
-	endTimer := time.NewTimer(timeout)
-	defer endTimer.Stop()
+	// component 1 started, we can stop the manager
+	cancel()

-LOOP:
-	for {
-		select {
-		case <-endTimer.C:
-			t.Fatalf("timed out after %s seconds, "+
-				"did not receive enought state changes", timeout)
-		case err := <-errCh:
-			require.NoError(t, err)
-		case err := <-subErrCh0:
-			t.Logf("[subErrCh0] received: %v", err)
-			require.NoError(t, err)
-			count++
-			if count >= 2 {
-				break LOOP
-			}
-		case err := <-subErrCh1:
-			t.Logf("[subErrCh1] received: %v", err)
-			require.NoError(t, err)
-			count++
-			if count >= 2 {
-				break LOOP
-			}
-		}
-	}
+	comp0StartLogs := logs.FilterMessageSnippet(
+		fmt.Sprintf("Starting component %q", comp0ID)).TakeAll()
+	comp0StopLogs := logs.FilterMessageSnippet(
+		fmt.Sprintf("Stopping component %q", comp0ID)).TakeAll()
+	comp1StartLogs := logs.FilterMessageSnippet(
+		fmt.Sprintf("Starting component %q", comp1ID)).TakeAll()

-	subCancel()
-	cancel()
+	assert.Len(t, comp0StartLogs, 1,
+		"component %d started more than once", comp0ID)
+	assert.Len(t, comp0StopLogs, 1,
+		"component %d stopped more than once", comp0ID)
+	assert.Len(t, comp1StartLogs, 1,
+		"component %d started more than once", comp1ID)

-	// check progression, require stop fake-0 before start fake-1
-	stateProgressionWG.Wait()
-	comp0Stopped := false
-	for _, step := range stateProgression {
-		if step.componentID == IDComp0 &&
-			step.state.State == client.UnitStateStopped {
-			comp0Stopped = true
-		}
-		if step.componentID == IDComp1 &&
-			step.state.State == client.UnitStateStarting {
-			require.True(t, comp0Stopped)
-			break
+	assert.Truef(t, comp0StopLogs[0].Time.Before(comp1StartLogs[0].Time),
+		"component %s stopped after %s", comp0ID, comp1ID)


I'm not a fan of relying on logs for tests, but due to the concurrent nature of how the state is reported and the test was collecting them, checking directly the logs are the simplest and mo reliable way to check the correct order of the events.

Agree this is fine.

You could additionally verify that the component 0 start time is before the component 0 stop time.

Right now the following ordering is allowed by the test (even though our logic should not produce it):

comp0stop -> comp1start -> comp0start

restore `waitForStopped` checking if the component is still on the manager's component list

… of github.com:AndersonQ/elastic-agent into 4286-flaky-TestFakeInputSuite/TestManager_OutputChange

…tputChange

elastic-sonarqube · 2024-02-22T14:25:29Z

Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
54.5% 54.5% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it (cherry picked from commit a5262e1) # Conflicts: # pkg/component/runtime/manager_fake_input_test.go

- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it (cherry picked from commit a5262e1)

- add debug logs on Manager.update for completeness - Refactor and rename TestManager_OutputChange - use eventually instead of time.Sleep - rename to better reflect what it tests - Remove TestManager_StartStop as TestManager_StartStopComponent already covers it (cherry picked from commit a5262e1) Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>

AndersonQ added Team:Elastic-Agent Label for the Agent team skip-changelog labels Feb 20, 2024

AndersonQ self-assigned this Feb 20, 2024

mergify bot added the backport-skip label Feb 20, 2024

AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch 2 times, most recently from 806f833 to 1bef395 Compare February 20, 2024 14:26

AndersonQ changed the title ~~[Flaky Test]: debug TestFakeInputSuite/TestManager_OutputChange~~ [Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange Feb 20, 2024

AndersonQ marked this pull request as ready for review February 20, 2024 14:27

AndersonQ requested a review from a team as a code owner February 20, 2024 14:27

AndersonQ requested review from michalpristas and pchila February 20, 2024 14:27

AndersonQ enabled auto-merge (squash) February 20, 2024 14:27

AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from 1bef395 to fca9e82 Compare February 20, 2024 14:32

AndersonQ commented Feb 20, 2024

View reviewed changes

AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from fca9e82 to 6c69e1d Compare February 20, 2024 15:01

AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from 6c69e1d to 8ee2e02 Compare February 20, 2024 15:41

cmacknz requested changes Feb 20, 2024

View reviewed changes

AndersonQ added 6 commits February 21, 2024 07:24

reafctor Manager.waitForStopped

d37d32b

it does not return early if the stopping component is removed from the manager's current component

require start/stop to happen in order, but compare timestamps instead

703ab45

fix race condition

64f23a3

better error message

7cd7bfb

change when timestamp is taken

d071cc2

refactor and try to remove the possible deadlock

f06b3da

AndersonQ commented Feb 22, 2024

View reviewed changes

use manager logs to determine the order of events

d207b4a

AndersonQ force-pushed the 4286-flaky-TestFakeInputSuite/TestManager_OutputChange branch from 9dee688 to d207b4a Compare February 22, 2024 11:23

Merge branch 'main' into 4286-flaky-TestFakeInputSuite/TestManager_Ou…

d31a104

…tputChange

AndersonQ commented Feb 22, 2024

View reviewed changes

AndersonQ requested a review from cmacknz February 22, 2024 11:26

AndersonQ added 2 commits February 22, 2024 12:44

restore waitForStopped check

1bd0dc2

restore `waitForStopped` checking if the component is still on the manager's component list

Merge branch '4286-flaky-TestFakeInputSuite/TestManager_OutputChange'…

5dfd9a9

… of github.com:AndersonQ/elastic-agent into 4286-flaky-TestFakeInputSuite/TestManager_OutputChange

pchila approved these changes Feb 22, 2024

View reviewed changes

AndersonQ added 2 commits February 22, 2024 13:23

fix potential deadlock

447878c

Merge branch 'main' into 4286-flaky-TestFakeInputSuite/TestManager_Ou…

41ed11f

…tputChange

cmacknz approved these changes Feb 22, 2024

View reviewed changes

AndersonQ merged commit a5262e1 into elastic:main Feb 22, 2024
9 checks passed

cmacknz added backport-v8.12.0 Automated backport with mergify backport-v8.13.0 Automated backport with mergify labels Feb 22, 2024

This was referenced Feb 22, 2024

[8.12](backport #4290) [Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4323

Closed

[8.13](backport #4290) [Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290

[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290

AndersonQ commented Feb 20, 2024 •

edited

Loading

mergify bot commented Feb 20, 2024

elasticmachine commented Feb 20, 2024

AndersonQ Feb 20, 2024

AndersonQ Feb 20, 2024

AndersonQ Feb 20, 2024

cmacknz Feb 20, 2024 •

edited

Loading

AndersonQ Feb 22, 2024

AndersonQ Feb 22, 2024

AndersonQ Feb 22, 2024

cmacknz Feb 22, 2024

elastic-sonarqube bot commented Feb 22, 2024

[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290

[Flaky Test]: fix TestFakeInputSuite/TestManager_OutputChange #4290

Conversation

AndersonQ commented Feb 20, 2024 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Questions to ask yourself

mergify bot commented Feb 20, 2024

elasticmachine commented Feb 20, 2024

AndersonQ Feb 20, 2024

Choose a reason for hiding this comment

AndersonQ Feb 20, 2024

Choose a reason for hiding this comment

AndersonQ Feb 20, 2024

Choose a reason for hiding this comment

cmacknz Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

AndersonQ Feb 22, 2024

Choose a reason for hiding this comment

AndersonQ Feb 22, 2024

Choose a reason for hiding this comment

AndersonQ Feb 22, 2024

Choose a reason for hiding this comment

cmacknz Feb 22, 2024

Choose a reason for hiding this comment

elastic-sonarqube bot commented Feb 22, 2024

Quality Gate passed

AndersonQ commented Feb 20, 2024 •

edited

Loading

cmacknz Feb 20, 2024 •

edited

Loading