-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logmon: recover from shutting down call locally #5616
Conversation
Retry if we detect shutting down during Start() api call is started, locally.
} | ||
|
||
func (h *logmonHook) prestartOneLoop(ctx context.Context, req *interfaces.TaskPrestartRequest) error { | ||
// attach to a running logmon if state indicates one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted not to change the logic here as the logic is somewhat brittle and I don't want to make it worse. But the idea is that if grpc call fails with shutting down, h.logmonPluginClient
would be non-nill and would be marked as Exited
, so we'll create a new logmon instance on retries.
@@ -73,5 +73,8 @@ func (p *Plugin) GRPCServer(broker *plugin.GRPCBroker, s *grpc.Server) error { | |||
} | |||
|
|||
func (p *Plugin) GRPCClient(ctx context.Context, broker *plugin.GRPCBroker, c *grpc.ClientConn) (interface{}, error) { | |||
return &logmonClient{client: proto.NewLogMonClient(c)}, nil | |||
return &logmonClient{ | |||
doneCtx: ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not one 100% percent following the plugin ctx
and how it's set during the initialization phase - but this follows the pattern in other driver plugins , e.g. https://github.com/hashicorp/nomad/blob/v0.9.0/plugins/drivers/plugin.go#L39-L49 .
Interestingly the logmon plugin client doesn't embed BasePluginClient
, so might want to follow up on that with 0.9.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this code is very difficult to test, but it'd be nice to get some sort of coverage on it even if it's just small unit tests that rely on mocks.
c0114c7
to
94c9c57
Compare
94c9c57
to
978fc65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phenomenal job on the test! We should probably try synthesizing arbitrary pauses with SIGSTOP more!
// then we kill process while Start call is running | ||
require.NoError(t, proc.Signal(syscall.SIGSTOP)) | ||
// sleep for the signal to take effect | ||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you guarantee that the sleep is sufficient? Wondering if this will be flaky in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of a good way to test if a process is stopped - unlike killing. If it's flaky in CI, we can react - but considering that we will overhaul logmon plugin method to be inline with other plugin clients and this code would probably change, I'd avoid overengineering and react to CI problems as they come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I take that back :). ended up using gopsutil.Process
utility to verify process sleep status in 658a734!
h.logger.Warn("logmon shutdown while making request", "error", err) | ||
|
||
if tries > 3 { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe log here that its out of retries before returning error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit and one question, otherwise LGTM
logmon: recover from shutting down call locally
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Alternative to #5615 - where we detect logmon plugin shutting down during
logmon.Start
call and retry it after killing the process.Here, we try 4 times with 1-second backoff. I opted to use a very simple backoff strategy, and punt on a more robust strategy for logmon failures to 0.9.2.
The condition is very hard to test, as plugin needs to be exiting but not exited yet prior to Start call. Open for suggestions.