-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve watcher and TestWatcher_AgentErrorQuick logs #5345
Improve watcher and TestWatcher_AgentErrorQuick logs #5345
Conversation
a5aaa17
to
6be7151
Compare
if t.Failed() { | ||
rawLogs := obs.All() | ||
for _, rawLog := range rawLogs { | ||
msg := fmt.Sprintf("[%s] %s", rawLog.Level, rawLog.Message) | ||
for k, v := range rawLog.ContextMap() { | ||
msg += fmt.Sprintf("%s=%v", k, v) | ||
} | ||
t.Log(msg) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a helper function together with logger.NewTesting
to pretty print the logs like that on another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
315bf99
to
d5da3de
Compare
d5da3de
to
aae7675
Compare
@@ -93,14 +93,14 @@ func (ch *AgentWatcher) Run(ctx context.Context) { | |||
if failedErr == nil { | |||
flipFlopCount++ | |||
failedTimer.Reset(ch.checkInterval) | |||
ch.log.Error("Agent reported failure (starting failed timer): %s", err) | |||
ch.log.Errorf("Agent reported failure (starting failed timer): %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I think Zap supports %w behind the scenes. Would it be better to use %w for the errors? I know they won't be unwrapped, but at least it would be a clear visual indicator that you're formatting an error type.
} | ||
} else { | ||
if failedErr != nil { | ||
failedTimer.Stop() | ||
ch.log.Error("Agent reported healthy (failed timer stopped): %s", err) | ||
ch.log.Info("Agent reported healthy (failed timer stopped)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not going to lose useful info by ommiting the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to add this as an Info log instead of an Error log?
I just want to ensure we won't generate logs that won't be useful for our users if they are in Info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the else branch of err != nil
, so actually the log is cleaner that way otherwise we'd have something like [...]<nil>
what does not make any sense unless you know how the log line was written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, sorry, I got confused by the failedErr != nil check. In that case couldn't we log failedErr? Does that have any meaning to the error message?
@@ -138,7 +139,7 @@ LOOP: | |||
connectCancel() | |||
if err != nil { | |||
ch.connectCounter++ | |||
ch.log.Error("Failed connecting to running daemon: ", err) | |||
ch.log.Errorf("Failed connecting to running daemon: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; I’ve left some nitpicking comments. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nitpick on the message format when watcher loses connection to the agent (non-blocking).
Looks good overall
// agent has crashed or exited | ||
stateCancel() | ||
ch.agentClient.Disconnect() | ||
ch.log.Error("Lost connection: failed reading next state: ", err) | ||
ch.log.Errorf("Lost connection: failed reading next state: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, maybe we can get a clearer message by offering context of what we were doing when we received the error followed by the error itself (if zap supports %w as @mauri870 said it could also be worth using it here)
ch.log.Errorf("Lost connection: failed reading next state: %s", err) | |
ch.log.Errorf("reading next state: lost connection to the agent: %s", err) |
(cherry picked from commit a9de876)
(cherry picked from commit a9de876)
What does this PR do?
It improves the watcher logs and prints the logs if TestWatcher_AgentErrorQuick fails.
If the test would fail, the logs would be like the following:
Why is it important?
TestWatcher_AgentErrorQuick
was flaky before, but it hasn't happened again on CI. Even running it for 12 hours didn't reproduce the problem.Checklist
[ ] I have commented my code, particularly in hard-to-understand areas[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature works[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
None
How to test this PR locally
make
TestWatcher_AgentErrorQuick
to fail by addingt.Fail()
at the end of the test and then run the test.Related issues
Questions to ask yourself