-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Navigating in the JupyterLab UI can prevent idle kernels from being culled. #1360
Comments
We discussed this in the weekly meeting today. To make the activity tracking work as expected, we need to keep track of the last time we knew the kernel was active in response to a user request. @Zsailer suggested that the message types that should be categorized as user requests are:
On top of that, the Note that this seems to indicate that somewhere we need to keep track of whether or not the kernel was previously running a user request, so that we can continue to treat the kernel as being active until we get confirmation from the kernel that it has switched back to My current thinking is to add a flag (so that users can preserve the existing behavior), and then change the
For the last check we would err on the side of updating the activity timestamp rather than erring on the side of not updating it. This last activity logic currently exists in the |
I'm now leaning towards not having a flag for preserving the old behavior. It seems that the old behavior was simply incorrect rather than just representing a differing opinion on what constitutes "activity". In particular, the status message contents are used to update the This means control messages would cause the I'm going to put together a PR with my understanding of the right way to fix this, and share it here for early feedback. The fix itself looks relatively simple, but there will be a lot of work required for testing it. |
@Zsailer thanks for your offer during today's community meeting to discuss the details about how to fix this issue. I've put together what I am now relatively confident is the right way to resolve this issue and would like to solicit feedback on it. For now this is just in a commit here. Once I get some more time to work on it, I'll add some tests for the various edge-case scenarios that I can think of and then send it out in a pull request. |
I would also add |
I think there are two issues conflated in the busy/idle message problem that is brought up here. For activity tracking, we are trying to distinguish between regular system-spawned messages, and messages that convey user activity intent. In execute_request, we have the notion of a "silent" execution that sort of tries to do this - a system can send a "silent" execution to slip an execution in between other user activity, and not have the user activity (like the execution counter) be disrupted by the system execution. Perhaps we could use some flag like that for all messages? If the flag was in the header metadata, then it would be carried into all responses as well. That would let the originator of a message indicate whether it indicated user activity or not. In this example, some debug messages do indicate user activity (for example, stepping through a function), whereas others may not (like an inspect_variables request to auto-update a variable listing). We see the same for comm messages - some comm messages may be system-generated from frontend components, while others may indicate actual user activity, and you can't tell just by looking at the message type. The other issue is that it is not easy to determine exactly what is busy/idle from the busy/idle message. One of the issues here is that messages to the control channel spawn busy/idle messages on the iopub channel. However, it is not clear from the busy/idle message that the control thread is busy, rather than the shell thread. So unless you maintain a mapping of messages that can be sent to either channel (and the frontend respects that mapping), you don't know if the shell channel is actually busy or not. Should we have busy/idle messages for activity on the non-shell channel? Or should busy/idle messages carry clear info about which thread is going to be busy/idle? This becomes even more important to solve if we make progress on the subshell JEP, where messages on the shell channel may cause only one subshell to be busy. Note that only the kernel can really tell if the shell thread is going to be busy. For example, old ipykernel versions had a single thread managing both control and shell, while newer versions have separate threads for control and shell. In the old ipykernel, a control message arguably should issue a busy/idle message for the shell channel, while in new ipykernel, control messages do not block the shell channel. |
That said, both of the points I brought up require more fundamental changes to the protocol, or at least more fundamental changes to the conventions we use in the protocol. Inferring user activity from message type may be the best temporary fix, though I'd caution against ramifications, such as:
|
@jasongrout Thanks for the detailed and informative responses! You've convinced me that there are two separate issues in play here, so I've updated my change to try to address them separately:
As before, any additional feedback/comments are much appreciated. Thanks! |
@jasongrout Thanks for the suggestion. I've been digging into this a bit more, and had a follow up question: Are you sure that Is my understanding about that incorrect? |
related issue i raised (on the wrong project :)) jupyter/notebook#7260 looked at @ojarjur's changes, looks mostly the same as what i attempted to achieve in m yexample logic. my 2 cents is i think having a configurable list of message types is a decent middleground, if not specified it maintains the existing behaviour until there's better distinction between user activity and control activity. With cull_connected=True i think the current list of message types that @ojarjur has is probably fair game in regard to idle culling.
in terms of ramifications such as these, if used in conjunction with longer cull_idle_timeouts (several hours++) they would be unlikely to affect a users' experience, and again if the msg_type is configurable then its up to respective jupyter server admins to tailor it to their infrastructure and user expectations. |
A kernel could be "active" whilst performing a long lived operation e.g. some heavy processing, watching some files for changes, monitoring a socket, etc. So it might be possible for a kernel to appear inactive, but still be in use, i.e. the user might still have the terminal open in the UI. It depends what do we consider to be an active kernel:
It might be worth mentioning that websockets have a ping-pong mechanism which can be used to detect broken websocket connections. This ping can identify broken/closed websocket connections even when the connection is quiet (because pings are sent according to a set interval rather than when there is data to send). Currently Jupyter Server does not configure a default ping interval, so there is no ping at all. I don't think there's any early detection of client-severed websocket connections because of this. The default ping interval is now exposed as a Jupyter Server config: #1391 So it might be possible that a valid approach to determining whether a kernel is "active" or "idle" is to configure a default ping interval (to ensure client-severed kernel connections are closed from the server end) and count the number of websockets connected to a kernel. If there are no websockets open on a kernel (within a configured time window), then no user has the kernel open in the UI. In a Jupyter Server extension that I maintain, we use this mechanism to handle the same issue. In our case the websockets are connected to GraphQL subscriptions rather than kernels, so its a different wss subprotocol, but the same principle, i.e. some long-lived server side process that sends data back to the client when events occur. We kill our "subscriptions" when the last connected websocket connection is closed (i.e. when the user isn't using it any more). |
Thanks for adding this PR, but I think this was always configurable through |
Description
When a Jupyter server is configured to cull idle kernels that have connections, the JupyterLab interactions with the Jupyter server can prevent this culling from actually happening.
The simplest such case is when the user navigates between different tabs within JupyterLab, but I've also seen this when refreshing the browser tab that the JupyterLab UI is in.
The root issue seems to be an unexpected interaction between the behavior of the JupyterLab UI and the behavior of the kernel activity tracking.
Specifically, the JupyterLab UI will (on certain events) send a
debug_request
on the control channel to the running kernel, which results in both adebug_reply
on the control channel from the kernel and multiple (in my case I keep seeing two of them)status
messages from the kernel on the iopub channel.This causes issues because it seems that the server will update the kernel's
last_activity
timestamp on all IOPub messages rather than justexecute_input
/execute_result
messagesThat, in turn, prevents the kernel from being considered
idle
, which then prevents it from being culled.I am very surprised to see that
status
messages update thelast_activity
timestamp. I would have expected those to be no-ops from the standpoint of activity tracking.Reproduce
Set the following config options (either in a config file or on the
jupyter lab ...
command line):MappingKernelManager.cull_idle_timeout = 900
MappingKernelManager.cull_connected = True
MappingKernelManager.cull_interval = 10
Open the JupyterLab UI, open a new notebook, then close the JupyterLab UI and instead open the
.../api/kernels
page in your browser.After 15 minutes refresh the
.../api/kernels
page and see that the idle kernel has been culled.Then re-open the JupyterLab UI, open a new notebook, and then open a new launcher tab at the same time.
Without ever executing anything in the notebook, switch between the two tabs in JupyterLab once every minute.
Note that the kernel is never culled.
Open the
.../api/kernels
page and note how every time you switch tabs thelast_activity
timestamp of the kernel is updated.Expected behavior
I would expect the
last_activity
timestamp to not be updated due to thestatus
messages sent by the kernel.Context
The text was updated successfully, but these errors were encountered: