Improve watcher and TestWatcher_AgentErrorQuick logs #5345

AndersonQ · 2024-08-23T07:41:47Z

What does this PR do?

It improves the watcher logs and prints the logs if TestWatcher_AgentErrorQuick fails.
If the test would fail, the logs would be like the following:

❯ go test -run TestWatcher_AgentErrorQuick ./internal/pkg/agent/application/upgrade/
--- FAIL: TestWatcher_AgentErrorQuick (1.00s)
    watcher_test.go:287: [info] Agent watcher started
    watcher_test.go:287: [info] Trying to connect to agent
    watcher_test.go:287: [info] Connected to agent
    watcher_test.go:287: [debug] received state: FAILED:force failure
    watcher_test.go:287: [info] Communicating with PID 0
    watcher_test.go:287: [debug] received state: HEALTHY:healthy
    watcher_test.go:287: [error] Agent reported failure (starting failed timer): agent reported failed state: force failure
    watcher_test.go:287: [info] Agent reported healthy (failed timer stopped)
    watcher_test.go:287: [debug] received state: error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    watcher_test.go:287: [error] Lost connection: failed reading next state: rpc error: code = DeadlineExceeded desc = context deadline exceeded
FAIL
FAIL	github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade	1.351s
FAIL

Why is it important?

TestWatcher_AgentErrorQuick was flaky before, but it hasn't happened again on CI. Even running it for 12 hours didn't reproduce the problem.

Checklist

My code follows the style guidelines of this project
~~[ ] I have commented my code, particularly in hard-to-understand areas~~
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

None

How to test this PR locally

make TestWatcher_AgentErrorQuick to fail by adding t.Fail() at the end of the test and then run the test.

❯ go test -count 43200 -run TestWatcher_AgentErrorQuick -timeout=0
PASS
ok      github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade 43320.474s

Related issues

Closes Flaky Test TestWatcher_AgentErrorQuick #3983

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

AndersonQ · 2024-08-23T07:43:28Z

internal/pkg/agent/application/upgrade/watcher_test.go

+		if t.Failed() {
+			rawLogs := obs.All()
+			for _, rawLog := range rawLogs {
+				msg := fmt.Sprintf("[%s] %s", rawLog.Level, rawLog.Message)
+				for k, v := range rawLog.ContextMap() {
+					msg += fmt.Sprintf("%s=%v", k, v)
+				}
+				t.Log(msg)
+			}
+		}


I'll add a helper function together with logger.NewTesting to pretty print the logs like that on another PR

elasticmachine · 2024-08-23T07:43:38Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

mauri870 · 2024-08-23T11:31:45Z

internal/pkg/agent/application/upgrade/watcher.go

@@ -93,14 +93,14 @@ func (ch *AgentWatcher) Run(ctx context.Context) {
 					if failedErr == nil {
 						flipFlopCount++
 						failedTimer.Reset(ch.checkInterval)
-						ch.log.Error("Agent reported failure (starting failed timer): %s", err)
+						ch.log.Errorf("Agent reported failure (starting failed timer): %s", err)


NIT: I think Zap supports %w behind the scenes. Would it be better to use %w for the errors? I know they won't be unwrapped, but at least it would be a clear visual indicator that you're formatting an error type.

mauri870 · 2024-08-23T11:33:56Z

internal/pkg/agent/application/upgrade/watcher.go

 					}
 				} else {
 					if failedErr != nil {
 						failedTimer.Stop()
-						ch.log.Error("Agent reported healthy (failed timer stopped): %s", err)
+						ch.log.Info("Agent reported healthy (failed timer stopped)")


Are we not going to lose useful info by ommiting the error?

Do we really want to add this as an Info log instead of an Error log?
I just want to ensure we won't generate logs that won't be useful for our users if they are in Info.

this is the else branch of err != nil, so actually the log is cleaner that way otherwise we'd have something like [...]<nil> what does not make any sense unless you know how the log line was written

I see now, sorry, I got confused by the failedErr != nil check. In that case couldn't we log failedErr? Does that have any meaning to the error message?

mauri870 · 2024-08-23T11:34:27Z

internal/pkg/agent/application/upgrade/watcher.go

@@ -138,7 +139,7 @@ LOOP:
 			connectCancel()
 			if err != nil {
 				ch.connectCounter++
-				ch.log.Error("Failed connecting to running daemon: ", err)
+				ch.log.Errorf("Failed connecting to running daemon: %s", err)


Ditto https://github.com/elastic/elastic-agent/pull/5345/files#r1728826302

mauri870 · 2024-08-23T11:37:35Z

LGTM; I’ve left some nitpicking comments.

elastic-sonarqube · 2024-08-23T13:42:56Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
85.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

pchila

Just a small nitpick on the message format when watcher loses connection to the agent (non-blocking).

Looks good overall

pchila · 2024-08-26T09:52:15Z

internal/pkg/agent/application/upgrade/watcher.go

 					// agent has crashed or exited
 					stateCancel()
 					ch.agentClient.Disconnect()
-					ch.log.Error("Lost connection: failed reading next state: ", err)
+					ch.log.Errorf("Lost connection: failed reading next state: %s", err)


Nitpick, maybe we can get a clearer message by offering context of what we were doing when we received the error followed by the error itself (if zap supports %w as @mauri870 said it could also be worth using it here)

Suggested change

ch.log.Errorf("Lost connection: failed reading next state: %s", err)

ch.log.Errorf("reading next state: lost connection to the agent: %s", err)

(cherry picked from commit a9de876)

) (cherry picked from commit a9de876) Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>

AndersonQ added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team flaky-test Unstable or unreliable test cases. skip-changelog backport-8.15 Automated backport to the 8.15 branch with mergify labels Aug 23, 2024

AndersonQ self-assigned this Aug 23, 2024

AndersonQ force-pushed the 3983-falky-TestWatcher_AgentErrorQuick branch from a5aaa17 to 6be7151 Compare August 23, 2024 07:42

AndersonQ commented Aug 23, 2024

View reviewed changes

AndersonQ marked this pull request as ready for review August 23, 2024 07:43

AndersonQ requested a review from a team as a code owner August 23, 2024 07:43

AndersonQ requested review from andrzej-stencel and pchila August 23, 2024 07:43

AndersonQ force-pushed the 3983-falky-TestWatcher_AgentErrorQuick branch 2 times, most recently from 315bf99 to d5da3de Compare August 23, 2024 08:17

AndersonQ mentioned this pull request Aug 23, 2024

Improve testing logger #5346

Merged

3 tasks

improve watcher logs and TestWatcher_AgentErrorQuick logs

aae7675

AndersonQ force-pushed the 3983-falky-TestWatcher_AgentErrorQuick branch from d5da3de to aae7675 Compare August 23, 2024 08:25

mauri870 reviewed Aug 23, 2024

View reviewed changes

mauri870 approved these changes Aug 23, 2024

View reviewed changes

AndersonQ enabled auto-merge (squash) August 23, 2024 14:59

pchila approved these changes Aug 26, 2024

View reviewed changes

AndersonQ merged commit a9de876 into elastic:main Aug 26, 2024
13 checks passed

mergify bot pushed a commit that referenced this pull request Aug 26, 2024

improve watcher logs and TestWatcher_AgentErrorQuick logs (#5345)

428868c

(cherry picked from commit a9de876)

mergify bot mentioned this pull request Aug 26, 2024

[8.15](backport #5345) Improve watcher and TestWatcher_AgentErrorQuick logs #5357

Merged

1 task

AndersonQ deleted the 3983-falky-TestWatcher_AgentErrorQuick branch August 26, 2024 09:58

AndersonQ added the backport-v8.x label Sep 3, 2024

AndersonQ added a commit that referenced this pull request Sep 3, 2024

improve watcher logs and TestWatcher_AgentErrorQuick logs (#5345)

679874b

(cherry picked from commit a9de876)

This was referenced Sep 6, 2024

[8.15](backport #5346) Improve testing logger #5447

Merged

[8.x](backport #5346) Improve testing logger #5448

Closed

AndersonQ added a commit that referenced this pull request Sep 9, 2024

improve watcher logs and TestWatcher_AgentErrorQuick logs (#5345) (#5357

fc28030

) (cherry picked from commit a9de876) Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve watcher and TestWatcher_AgentErrorQuick logs #5345

Improve watcher and TestWatcher_AgentErrorQuick logs #5345

AndersonQ commented Aug 23, 2024 •

edited

Loading

AndersonQ Aug 23, 2024

AndersonQ Aug 23, 2024

elasticmachine commented Aug 23, 2024

mauri870 Aug 23, 2024

mauri870 Aug 23, 2024

pierrehilbert Aug 23, 2024

AndersonQ Aug 23, 2024

mauri870 Aug 23, 2024

mauri870 Aug 23, 2024

mauri870 commented Aug 23, 2024

elastic-sonarqube bot commented Aug 23, 2024

pchila left a comment

pchila Aug 26, 2024

	ch.log.Errorf("Lost connection: failed reading next state: %s", err)
	ch.log.Errorf("reading next state: lost connection to the agent: %s", err)

Improve watcher and TestWatcher_AgentErrorQuick logs #5345

Improve watcher and TestWatcher_AgentErrorQuick logs #5345

Conversation

AndersonQ commented Aug 23, 2024 • edited Loading

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Aug 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauri870 commented Aug 23, 2024

elastic-sonarqube bot commented Aug 23, 2024

Quality Gate passed

pchila left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndersonQ commented Aug 23, 2024 •

edited

Loading