logmon: recover from shutting down call locally #5616

notnoop · 2019-04-25T19:21:12Z

Alternative to #5615 - where we detect logmon plugin shutting down during logmon.Start call and retry it after killing the process.

Here, we try 4 times with 1-second backoff. I opted to use a very simple backoff strategy, and punt on a more robust strategy for logmon failures to 0.9.2.

The condition is very hard to test, as plugin needs to be exiting but not exited yet prior to Start call. Open for suggestions.

Retry if we detect shutting down during Start() api call is started, locally.

notnoop · 2019-04-25T19:24:18Z

client/allocrunner/taskrunner/logmon_hook.go

+}
+
+func (h *logmonHook) prestartOneLoop(ctx context.Context, req *interfaces.TaskPrestartRequest) error {
+	// attach to a running logmon if state indicates one


I opted not to change the logic here as the logic is somewhat brittle and I don't want to make it worse. But the idea is that if grpc call fails with shutting down, h.logmonPluginClient would be non-nill and would be marked as Exited, so we'll create a new logmon instance on retries.

notnoop · 2019-04-25T19:26:36Z

client/logmon/plugin.go

@@ -73,5 +73,8 @@ func (p *Plugin) GRPCServer(broker *plugin.GRPCBroker, s *grpc.Server) error {
 }

 func (p *Plugin) GRPCClient(ctx context.Context, broker *plugin.GRPCBroker, c *grpc.ClientConn) (interface{}, error) {
-	return &logmonClient{client: proto.NewLogMonClient(c)}, nil
+	return &logmonClient{
+		doneCtx: ctx,


I'm not one 100% percent following the plugin ctx and how it's set during the initialization phase - but this follows the pattern in other driver plugins , e.g. https://github.com/hashicorp/nomad/blob/v0.9.0/plugins/drivers/plugin.go#L39-L49 .

Interestingly the logmon plugin client doesn't embed BasePluginClient, so might want to follow up on that with 0.9.2.

schmichael

I know this code is very difficult to test, but it'd be nice to get some sort of coverage on it even if it's just small unit tests that rely on mocks.

schmichael

Phenomenal job on the test! We should probably try synthesizing arbitrary pauses with SIGSTOP more!

preetapan · 2019-04-25T21:26:29Z

client/allocrunner/taskrunner/logmon_hook_unix_test.go

+	// then we kill process while Start call is running
+	require.NoError(t, proc.Signal(syscall.SIGSTOP))
+	// sleep for the signal to take effect
+	time.Sleep(1 * time.Second)


How can you guarantee that the sleep is sufficient? Wondering if this will be flaky in CI

I don't know of a good way to test if a process is stopped - unlike killing. If it's flaky in CI, we can react - but considering that we will overhaul logmon plugin method to be inline with other plugin clients and this code would probably change, I'd avoid overengineering and react to CI problems as they come.

So I take that back :). ended up using gopsutil.Process utility to verify process sleep status in 658a734!

preetapan · 2019-04-25T21:27:55Z

client/allocrunner/taskrunner/logmon_hook.go

+			h.logger.Warn("logmon shutdown while making request", "error", err)
+
+			if tries > 3 {
+				return err


maybe log here that its out of retries before returning error?

preetapan

one nit and one question, otherwise LGTM

logmon: recover from shutting down call locally

github-actions · 2023-02-11T02:16:14Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 2 commits April 25, 2019 14:32

logmon client to handle grpc closing errors

c23d673

logmon: retry starting logmon if it exits

b21849c

Retry if we detect shutting down during Start() api call is started, locally.

notnoop commented Apr 25, 2019

View reviewed changes

schmichael approved these changes Apr 25, 2019

View reviewed changes

notnoop force-pushed the b-retry-logmon-start-errs-2 branch from c0114c7 to 94c9c57 Compare April 25, 2019 20:31

add a test that simulates logmon dying during Start() call

978fc65

notnoop force-pushed the b-retry-logmon-start-errs-2 branch from 94c9c57 to 978fc65 Compare April 25, 2019 20:41

schmichael approved these changes Apr 25, 2019

View reviewed changes

try sleeping for stop signal to take effect

ba373fe

preetapan reviewed Apr 25, 2019

View reviewed changes

preetapan approved these changes Apr 25, 2019

View reviewed changes

Mahmood Ali added 3 commits April 25, 2019 18:09

add logging about attempts

1f1551a

try checking process status

658a734

retry grpc unavailable errors even if not shutting down

a321901

notnoop merged commit 1497b8e into master Apr 26, 2019

notnoop pushed a commit that referenced this pull request Apr 26, 2019

Merge pull request #5616 from hashicorp/b-retry-logmon-start-errs-2

7caa796

logmon: recover from shutting down call locally

notnoop pushed a commit that referenced this pull request Apr 26, 2019

update changelog for GH-5609 and GH-5616

4f52b1c

This was referenced Apr 26, 2019

WIP - logmon: recover from Start failures #5615

Closed

Tasks that are signaled to restart fail due to logmon errors #5574

Closed

notnoop deleted the b-retry-logmon-start-errs-2 branch May 1, 2019 00:40

fprovencher mentioned this pull request Jun 11, 2019

Task failing with logmon failed to create fifo for extracting logs #5803

Closed

github-actions bot locked as resolved and limited conversation to collaborators Feb 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logmon: recover from shutting down call locally #5616

logmon: recover from shutting down call locally #5616

notnoop commented Apr 25, 2019

notnoop Apr 25, 2019

notnoop Apr 25, 2019

schmichael left a comment

schmichael left a comment

preetapan Apr 25, 2019

notnoop Apr 25, 2019

notnoop Apr 26, 2019

preetapan Apr 25, 2019

preetapan left a comment

github-actions bot commented Feb 11, 2023

logmon: recover from shutting down call locally #5616

logmon: recover from shutting down call locally #5616

Conversation

notnoop commented Apr 25, 2019

notnoop Apr 25, 2019

Choose a reason for hiding this comment

notnoop Apr 25, 2019

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

preetapan Apr 25, 2019

Choose a reason for hiding this comment

notnoop Apr 25, 2019

Choose a reason for hiding this comment

notnoop Apr 26, 2019

Choose a reason for hiding this comment

preetapan Apr 25, 2019

Choose a reason for hiding this comment

preetapan left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 11, 2023