Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve watcher and TestWatcher_AgentErrorQuick logs #5345

Merged

Conversation

AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Aug 23, 2024

What does this PR do?

It improves the watcher logs and prints the logs if TestWatcher_AgentErrorQuick fails.
If the test would fail, the logs would be like the following:

❯ go test -run TestWatcher_AgentErrorQuick ./internal/pkg/agent/application/upgrade/
--- FAIL: TestWatcher_AgentErrorQuick (1.00s)
    watcher_test.go:287: [info] Agent watcher started
    watcher_test.go:287: [info] Trying to connect to agent
    watcher_test.go:287: [info] Connected to agent
    watcher_test.go:287: [debug] received state: FAILED:force failure
    watcher_test.go:287: [info] Communicating with PID 0
    watcher_test.go:287: [debug] received state: HEALTHY:healthy
    watcher_test.go:287: [error] Agent reported failure (starting failed timer): agent reported failed state: force failure
    watcher_test.go:287: [info] Agent reported healthy (failed timer stopped)
    watcher_test.go:287: [debug] received state: error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    watcher_test.go:287: [error] Lost connection: failed reading next state: rpc error: code = DeadlineExceeded desc = context deadline exceeded
FAIL
FAIL	github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade	1.351s
FAIL

Why is it important?

TestWatcher_AgentErrorQuick was flaky before, but it hasn't happened again on CI. Even running it for 12 hours didn't reproduce the problem.

Checklist

  • My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

None

How to test this PR locally

make TestWatcher_AgentErrorQuick to fail by adding t.Fail() at the end of the test and then run the test.

❯ go test -count 43200 -run TestWatcher_AgentErrorQuick -timeout=0
PASS
ok      github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade 43320.474s

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@AndersonQ AndersonQ added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team flaky-test Unstable or unreliable test cases. skip-changelog backport-8.15 Automated backport to the 8.15 branch with mergify labels Aug 23, 2024
@AndersonQ AndersonQ self-assigned this Aug 23, 2024
@AndersonQ AndersonQ force-pushed the 3983-falky-TestWatcher_AgentErrorQuick branch from a5aaa17 to 6be7151 Compare August 23, 2024 07:42
Comment on lines +280 to +289
if t.Failed() {
rawLogs := obs.All()
for _, rawLog := range rawLogs {
msg := fmt.Sprintf("[%s] %s", rawLog.Level, rawLog.Message)
for k, v := range rawLog.ContextMap() {
msg += fmt.Sprintf("%s=%v", k, v)
}
t.Log(msg)
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a helper function together with logger.NewTesting to pretty print the logs like that on another PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndersonQ AndersonQ marked this pull request as ready for review August 23, 2024 07:43
@AndersonQ AndersonQ requested a review from a team as a code owner August 23, 2024 07:43
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@AndersonQ AndersonQ force-pushed the 3983-falky-TestWatcher_AgentErrorQuick branch 2 times, most recently from 315bf99 to d5da3de Compare August 23, 2024 08:17
@AndersonQ AndersonQ mentioned this pull request Aug 23, 2024
3 tasks
@AndersonQ AndersonQ force-pushed the 3983-falky-TestWatcher_AgentErrorQuick branch from d5da3de to aae7675 Compare August 23, 2024 08:25
@@ -93,14 +93,14 @@ func (ch *AgentWatcher) Run(ctx context.Context) {
if failedErr == nil {
flipFlopCount++
failedTimer.Reset(ch.checkInterval)
ch.log.Error("Agent reported failure (starting failed timer): %s", err)
ch.log.Errorf("Agent reported failure (starting failed timer): %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think Zap supports %w behind the scenes. Would it be better to use %w for the errors? I know they won't be unwrapped, but at least it would be a clear visual indicator that you're formatting an error type.

}
} else {
if failedErr != nil {
failedTimer.Stop()
ch.log.Error("Agent reported healthy (failed timer stopped): %s", err)
ch.log.Info("Agent reported healthy (failed timer stopped)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not going to lose useful info by ommiting the error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to add this as an Info log instead of an Error log?
I just want to ensure we won't generate logs that won't be useful for our users if they are in Info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the else branch of err != nil, so actually the log is cleaner that way otherwise we'd have something like [...]<nil> what does not make any sense unless you know how the log line was written

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, sorry, I got confused by the failedErr != nil check. In that case couldn't we log failedErr? Does that have any meaning to the error message?

@@ -138,7 +139,7 @@ LOOP:
connectCancel()
if err != nil {
ch.connectCounter++
ch.log.Error("Failed connecting to running daemon: ", err)
ch.log.Errorf("Failed connecting to running daemon: %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mauri870
Copy link
Member

LGTM; I’ve left some nitpicking comments.

@AndersonQ AndersonQ enabled auto-merge (squash) August 23, 2024 14:59
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nitpick on the message format when watcher loses connection to the agent (non-blocking).

Looks good overall

// agent has crashed or exited
stateCancel()
ch.agentClient.Disconnect()
ch.log.Error("Lost connection: failed reading next state: ", err)
ch.log.Errorf("Lost connection: failed reading next state: %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, maybe we can get a clearer message by offering context of what we were doing when we received the error followed by the error itself (if zap supports %w as @mauri870 said it could also be worth using it here)

Suggested change
ch.log.Errorf("Lost connection: failed reading next state: %s", err)
ch.log.Errorf("reading next state: lost connection to the agent: %s", err)

@AndersonQ AndersonQ merged commit a9de876 into elastic:main Aug 26, 2024
13 checks passed
mergify bot pushed a commit that referenced this pull request Aug 26, 2024
@AndersonQ AndersonQ deleted the 3983-falky-TestWatcher_AgentErrorQuick branch August 26, 2024 09:58
AndersonQ added a commit that referenced this pull request Sep 3, 2024
AndersonQ added a commit that referenced this pull request Sep 9, 2024
)

(cherry picked from commit a9de876)

Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.15 Automated backport to the 8.15 branch with mergify flaky-test Unstable or unreliable test cases. skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test TestWatcher_AgentErrorQuick
5 participants