Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid killing server before it had chance to shut down gracefully #1660

Open
radeksimko opened this issue May 11, 2021 · 11 comments
Open

Avoid killing server before it had chance to shut down gracefully #1660

radeksimko opened this issue May 11, 2021 · 11 comments
Labels
feedback requested language server issue Issues related to language servers communicating with this plugin

Comments

@radeksimko
Copy link
Contributor

radeksimko commented May 11, 2021

Describe the bug

I was trying to debug a memory consumption issue in terraform-ls and typically the memory profile is written upon (graceful) exit of the server. I realized that the memory profile is never being created because the server is just killed (i.e. SIGKILL-ed) so it has no time to do anything.

It appears that the client (LSP plugin) does send exit notification as expected, but (contrary to the LSP spec) doesn't wait for the server to exit itself gracefully, or even start processing that request.

Here I admit the spec can be a little bit vague, but it does say:

Exit Notification
A notification to ask the server to exit its process. The server should exit with success code 0 if the shutdown request has been received before; otherwise with error code 1.

The way I'd interpret that is that "client asks and expects server to exit", when the implementation seems to be "client asks, but doesn't expect the server to exit".

I am aware that most of the cleanup should be done as part of shutdown - and in case of terraform-ls it indeed is.
From client's perspective it's sensible to expect exit to exit within a very short time period as all cleanup that takes time should already be finished by then. However it seems to me the server is not given any time to do anything, so it practically never exits with 0.

Sublime Text LSP

2021/05/11 11:50:40 server.go:266: Checking request for "textDocument/didClose": {"textDocument":{"uri":"file:///var/workspace/tf-test/github/main.tf"}}
2021/05/11 11:50:40 rpc_logger.go:29: Incoming notification for "textDocument/didClose": {"textDocument":{"uri":"file:///var/workspace/tf-test/github/main.tf"}}
2021/05/11 11:50:43 server.go:591: Received 1 new requests
2021/05/11 11:50:43 server.go:175: Processing 1 requests
2021/05/11 11:50:43 server.go:266: Checking request for "shutdown":
2021/05/11 11:50:43 rpc_logger.go:29: Incoming request for "shutdown" (ID 2):
2021/05/11 11:50:43 service.go:361: stopping walker for session ...
2021/05/11 11:50:43 service.go:363: walker stopped
2021/05/11 11:50:43 service.go:367: stopping watcher for session ...
2021/05/11 11:50:43 service.go:372: watcher stopped
2021/05/11 11:50:43 service.go:377: cancelling any module loading ...
2021/05/11 11:50:43 service.go:379: module loading cancelled
2021/05/11 11:50:43 rpc_logger.go:50: Response to "shutdown" (ID 2): null
2021/05/11 11:50:43 server.go:247: Completed 1 requests [1.879563ms elapsed]
2021/05/11 11:50:43 module_loader.go:88: Cancelling module loader...
2021/05/11 11:50:43 server.go:591: Received 1 new requests
2021/05/11 11:50:43 server.go:175: Processing 1 requests
2021/05/11 11:50:43 server.go:266: Checking request for "exit":
2021/05/11 11:50:43 rpc_logger.go:29: Incoming notification for "exit":

VS Code (Terraform extension using the official LSP client library)

Admittedly the implementation there also doesn't seem to follow the spec exactly as it first sends EOF and then never gets to send the exit notification. However, with the EOF the server is at least given chance to stop itself gracefully.

2021/05/11 12:14:57 server.go:266: Checking request for "textDocument/didClose": {"textDocument":{"uri":"file:///var/workspace/tf-test/github/main.tf"}}
2021/05/11 12:14:57 rpc_logger.go:29: Incoming notification for "textDocument/didClose": {"textDocument":{"uri":"file:///var/workspace/tf-test/github/main.tf"}}
2021/05/11 12:15:05 server.go:591: Received 1 new requests
2021/05/11 12:15:05 server.go:175: Processing 1 requests
2021/05/11 12:15:05 server.go:266: Checking request for "shutdown":
2021/05/11 12:15:05 rpc_logger.go:29: Incoming request for "shutdown" (ID 5):
2021/05/11 12:15:05 service.go:361: stopping walker for session ...
2021/05/11 12:15:05 service.go:363: walker stopped
2021/05/11 12:15:05 service.go:367: stopping watcher for session ...
2021/05/11 12:15:05 service.go:372: watcher stopped
2021/05/11 12:15:05 service.go:377: cancelling any module loading ...
2021/05/11 12:15:05 service.go:379: module loading cancelled
2021/05/11 12:15:05 rpc_logger.go:50: Response to "shutdown" (ID 5): null
2021/05/11 12:15:05 server.go:247: Completed 1 requests [1.003389ms elapsed]
2021/05/11 12:15:05 module_loader.go:88: Cancelling module loader...
2021/05/11 12:15:05 server.go:517: Server signaled to stop with err=EOF
2021/05/11 12:15:05 server.go:146: Reading next request: EOF
2021/05/11 12:15:05 service.go:352: session stopped unexpectedly (err: <nil>)
2021/05/11 12:15:05 service.go:361: stopping walker for session ...
2021/05/11 12:15:05 service.go:363: walker stopped
2021/05/11 12:15:05 service.go:367: stopping watcher for session ...
2021/05/11 12:15:05 service.go:372: watcher stopped
2021/05/11 12:15:05 service.go:377: cancelling any module loading ...
2021/05/11 12:15:05 service.go:379: module loading cancelled
2021/05/11 12:15:05 langserver.go:81: Stopping server (pid 11440) ...
2021/05/11 12:15:05 langserver.go:85: Server (pid 11440) stopped.

To Reproduce

Steps to reproduce the behavior:

  1. Install & activate terraform-ls
  2. Configure it so that logs go into a file on the filesystem
  3. tail -f /path/to/log-file
  4. Open any folder with *.tf files in
  5. Close the tab and window
  6. Watch last lines after receiving shutdown request and exit notification

Expected behavior

Server is given some short time to exit itself and only if it doesn't exit, SIGINT is attempted and only then SIGKILL.

Environment (please complete the following information):

  • OS: macOS 10.15.7
  • Sublime Text version: Dev 4105
  • LSP version: v1.3.0
  • Language servers used: terraform-ls

Additional context

This is a snippet of my settings for the LS:

{
	"clients":
	{
		"terraform":
		{
			"enabled": true,
			"languageId": "terraform",
			"scopes":
			[
				"source.terraform"
			],
			"syntaxes":
			[
				"Packages/Terraform/Terraform.sublime-syntax"
			],
			"command": [
				"terraform-ls",
				"serve",
				"-log-file=/tmp/tf-ls-sublime-{{pid}}.log"
			]
		}
	},
@predragnikolic
Copy link
Member

predragnikolic commented May 11, 2021

It appears that the client (LSP plugin) does send exit notification as expected, but (contrary to the LSP spec) doesn't wait for the server to exit itself gracefully, or even start processing that request.

So if the exit notification is a notification
and the specs says the following about notifications:

A notification message. A processed notification message must not send a response back. They work like events.

I conclude that the client should never exect a response for the exit notification. Or did I get the spec wrong?

@radeksimko
Copy link
Contributor Author

@predragnikolic That is a reasonable conclusion to draw from there - if it is applied to notifications in general. I am not proposing for the client to wait for any response to any notification, but very specifically to wait for the server to shut down gracefully after exit.

I do believe that server should not be just straight killed after the notification is sent, because then it then defeats the purpose of sending the notification if it the server is killed prior to processing it IMO?

@radeksimko
Copy link
Contributor Author

And when I say wait I mean wait ~subsecond after sending the notification with a timeout - it still seems reasonable to SIGKILL a server which didn't shut itself down gracefully and was given the chance.

Relatedly I believe that SIGINT should be attempted before SIGKILL as in Unix SIGKILL is usually used as last resort - not a standard way of shutting a process down, but that might be a separate discussion and not necessarily related to this.

@rwols
Copy link
Member

rwols commented May 11, 2021

When the server is closed gracefully, we should be waiting at least 1 second after notifying with “exit”. Does this not happen?

@rwols
Copy link
Member

rwols commented May 11, 2021

See

exit_code = self._process.wait(1)

@radeksimko
Copy link
Contributor Author

Oh, interesting 👀 , I must have missed that while reading through what appeared to be relevant code here

LSP/boot.py

Lines 175 to 176 in 988cd02

def on_exit(self) -> None:
kill_all_subprocesses()

def kill_all_subprocesses() -> None:
global _subprocesses
subprocesses = list(_subprocesses)
for p in subprocesses:
try:
p.kill()
except Exception:
pass
for p in subprocesses:
try:
p.wait()
except Exception:
pass

I do not believe this is happening but I'm also having hard times verifying this purely from the server side as catching SIGKILL is impossible.

I can try to add some debugging around line you mentioned and see if the plugin is hitting it before sending the signal.

@rwols
Copy link
Member

rwols commented May 11, 2021

In the case of ST app shutdown, we are given a blocking exit handler, so we must kill all subprocesses immediately, unfortunately (although I have a few ideas around this).

graceful shutdown of language servers happens when you close the last document applicable to an LS. After 3 seconds we do the shutdown-exit dance.

@radeksimko
Copy link
Contributor Author

I see, FWIW I was able to reproduce this on Darwin by closing the window via Cmd+W (i.e. not Cmd+Q), so I'm assuming ST app still somehow treats this as some kind of shut down event for plugins and forces you do kill subprocesses too?

I can try to reproduce by leaving the window open, but closing all tabs then.

@radeksimko
Copy link
Contributor Author

Even in case of the 3 second timeout after closing last relevant tab it doesn't seem to shut down the server gracefully:

2021/05/11 19:36:02 server.go:266: Checking request for "textDocument/didClose": {"textDocument":{"uri":"file:///var/workspace/tf-test/github/main.tf"}}
2021/05/11 19:36:02 rpc_logger.go:29: Incoming notification for "textDocument/didClose": {"textDocument":{"uri":"file:///var/workspace/tf-test/github/main.tf"}}
2021/05/11 19:36:05 server.go:591: Received 1 new requests
2021/05/11 19:36:05 server.go:175: Processing 1 requests
2021/05/11 19:36:05 server.go:266: Checking request for "shutdown":
2021/05/11 19:36:05 rpc_logger.go:29: Incoming request for "shutdown" (ID 2):
2021/05/11 19:36:05 service.go:361: stopping walker for session ...
2021/05/11 19:36:05 service.go:363: walker stopped
2021/05/11 19:36:05 service.go:367: stopping watcher for session ...
2021/05/11 19:36:05 service.go:372: watcher stopped
2021/05/11 19:36:05 service.go:377: cancelling any module loading ...
2021/05/11 19:36:05 service.go:379: module loading cancelled
2021/05/11 19:36:05 rpc_logger.go:50: Response to "shutdown" (ID 2): null
2021/05/11 19:36:05 server.go:247: Completed 1 requests [1.237525ms elapsed]
2021/05/11 19:36:05 module_loader.go:88: Cancelling module loader...
2021/05/11 19:36:05 server.go:591: Received 1 new requests
2021/05/11 19:36:05 server.go:175: Processing 1 requests
2021/05/11 19:36:05 server.go:266: Checking request for "exit":
2021/05/11 19:36:05 rpc_logger.go:29: Incoming notification for "exit":

@rwols
Copy link
Member

rwols commented May 12, 2021

I downloaded the latest terraform-ls binary and am using this config

		"terraform": {
			"command": [
				"$home/Downloads/terraform-ls", "serve"
			],
			"selector": "source.terraform",
			"enabled": true
		},

and used this hello world example: https://github.com/gruntwork-io/terratest/tree/master/examples/terraform-hello-world-example

Using the following diff:

diff --git a/plugin/core/transports.py b/plugin/core/transports.py
index 542b4b4..d2768e6 100644
--- a/plugin/core/transports.py
+++ b/plugin/core/transports.py
@@ -1,4 +1,4 @@
-from .logging import exception_log, debug
+from .logging import exception_log, debug, trace
 from .types import TCP_CONNECT_TIMEOUT
 from .types import TransportConfig
 from .typing import Dict, Any, Optional, IO, Protocol, List, Callable, Tuple
@@ -113,23 +113,33 @@ class JsonRpcTransport(Transport):
 
     def _end(self, exception: Optional[Exception]) -> None:
         exit_code = 0
+        trace()
         if not exception:
             try:
                 # Allow the process to stop itself.
+                trace()
                 exit_code = self._process.wait(1)
+                self._process = None
+                trace()
             except (AttributeError, ProcessLookupError, subprocess.TimeoutExpired):
+                trace()
                 pass
         if self._process:
             try:
                 # The process didn't stop itself. Terminate!
+                trace()
                 self._process.kill()
                 # still wait for the process to die, or zombie processes might be the result
                 # Ignore the exit code in this case, it's going to be something non-zero because we sent SIGKILL.
                 self._process.wait()
+                trace()
             except (AttributeError, ProcessLookupError):
+                trace()
                 pass
             except Exception as ex:
+                trace()
                 exception = ex  # TODO: Old captured exception is overwritten
+        trace()
 
         def invoke() -> None:
             callback_object = self._callback_object()

the following is printed to the console after three seconds after closing the last applicable tab:

LSP: TRACE _end                             plugin/core/transports.py:116
LSP: TRACE _end                             plugin/core/transports.py:120
LSP: TRACE _end                             plugin/core/transports.py:125
LSP: TRACE _end                             plugin/core/transports.py:130
LSP: TRACE _end                             plugin/core/transports.py:135
LSP: TRACE _end                             plugin/core/transports.py:142

It's not visible in text but I can also see a one second waiting period between the trace of line 120 and line 125. So everything seems to be correct from the client's POV. I do notice that, even running ./terraform-ls serve my terminal, shutting down is "too slow" (slower than 1 second).

Would it perhaps be an idea to do all of the tear down in the shutdown request? Every client is supposed to wait for a null response. So you can take your time handling that request. After that, wait for the exit notification and kill with exit code 0.

@rwols
Copy link
Member

rwols commented May 12, 2021

Also, according to the spec:

Clients must not send any notifications other than exit or requests to a server to which they have sent a shutdown request. If a server receives requests after a shutdown request those requests should error with InvalidRequest.

so you really have all of your logging data at the point of receiving the shutdown request, as any other request/notification of the client is considered invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback requested language server issue Issues related to language servers communicating with this plugin
Projects
None yet
Development

No branches or pull requests

3 participants