-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the busy/idle execution state tracking for kernels. #1429
Improve the busy/idle execution state tracking for kernels. #1429
Conversation
This commit changes the way Jupyter server tracks both activity and execution state so that both of those are based solely on user actions rather than also incorporating control messages and other channels. This is important for both correctly tracking the state of the kernel, and also for culling kernels. Previously, the last activity timestamp was updated on every message on the iopub channel regardless of the type of message, and execution state was based on every `status` message on that channel regardless of what other message the `status` was in response to. That behavior causes incorrect behavior for both kernel culling and kernel status. In particular, it can result in both kernels being culled too early and too late (depending on what the user is doing). For example, if a user ran a long running code cell, and then switched tabs within the JupyterLab UI, then the JupyterLab UI would send a control message to the kernel and the kernel would respond with a status message referencing that control message as its parent. As a result of that, the Jupyter server would update its execution state to be `idle` even though the long running code cell was still executing. This could cause the kernel to be culled too soon. Alternatively, if the user was not running anything and just switched tabs within the JupyterLab UI, then the last activity timestamp would be updated even though the user didn't run any code cells. That would cause the kernel to be culled too late. This change fixes both of those scenarios by making sure that it is only user actions (and kernel messages in response to those user actions) that cause the execution state and last activity timestamps to be updated. This initial commit contains just the code change so that I can solicit feedback on this proposed change early. The core of the change will have to be new tests, which will come in a subsequent commit as part of the same pull request.
…st the corresponding busy status message.
… into ojarjur/fix-kernel-status
…ory cache of message IDs
"execute_input", | ||
"execute_reply", | ||
"execute_request", | ||
"inspect_request", |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it was also discussed in depth in #1360 (comment). I am not sure if there was a resolution though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if there was a resolution though.
We didn't come to a consensus on the set of message types to include, but I think we did get a pretty clear consensus that the set of message types should be configurable so that server admins can override whatever default we choose.
I don't know if we can get a consensus on the set of types to leave in the default value for the config, but if we do get a consensus that something should be added to (or removed from) this list, then please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on the call today, it might be more conservative to change this list to a not_tracked_message_types
, as it would be less risk of accidentally forgetting that a certain message type should be counted as kernel activity. It would also mean that the risk changes from "culling a kernel that shouldn't be culled" (bad because of potential data loss) to "not culling a kernel that can be culled" (bad because of potential cost / resource use), where the former seems more disruptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put together a tentative list of what I think would be the default set of not-tracked message types.
I got this by going through this page, and selecting all of the message types that were either:
- Not sent on the shell channel, or
- Of the form
*_info_(request|reply)
.
This is almost disjoint from the proposed list for tracked message types. The difference is that the execute_input
message that was going to be tracked is included here in the not-tracked list because it is sent on the IOPub channel instead of the shell channel.
The full list is:
comm_info_request
,
comm_info_reply
,
kernel_info_request
,
kernel_info_reply
,
shutdown_request
,
shutdown_reply
,
interrupt_request
,
interrupt_reply
,
debug_request
,
debug_reply
,
stream
,
display_data
,
update_display_data
,
execute_input
,
execute_result
,
error
,
status
,
clear_output
,
debug_event
,
input_request
,
input_reply
@krassowski @vidartf @Zsailer Can you all please help me double check this?
In particular:
- Is my methodology described above the right approach to take?
- Does the list I came up with look right?
Thanks, in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ojarjur. I'll give this a closer look later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list looks right to me.
Thanks, @ojarjur. I believe this is good to go.
It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error. I'll try to eliminate the flakyness and then update this PR |
Fixed now; I ran the test 100 times locally and it passed every time. |
… the threshold for determining that the kernel state is consistent
I think there's a second race condition contributing to the test flakiness; the second one isn't an issue in the test but rather a preexisting race condition in the code that the test uncovered. Specifically, this code can wind up being run after this code if the kernel starts up quickly enough... That, in turn, can result in a correctly set "busy" and/or "idle" kernel execution state being overwritten with an erroneous state of "starting". I'm not sure if the kernel execution state should be set at all in the |
…ad of retrying the call to get the kernel state just mark the whole test as flaky so it gets retried
…monitoring to a list of untracked message types
…lakiness caused by race conditions
Amazing work, @ojarjur! Merging 🚀 |
This PR makes two adjustments/enhancements for the way kernel execution state is tracked:
Distinguish between "busy"/"idle" status messages with different parent IDs.
A "busy" or "idle" status message is not sent in isolation, but rather in response to a specific parent message, and
the kernel might still be busy with one parent message even after reporting that it is done (e.g. has an "idle" status)
with a different parent message.
Accordingly, the overall kernel execution state is only switched to "idle" once all previously seen "busy" status
messages have had a corresponding "idle" message sent with the same parent message ID.
Distinguish between different types of parent message when determining whether or not the overall kernel is "busy".
Not every message type corresponds to a user action, so not all of them should be included in the logic for determining
that a kernel is "busy". This is especially true for the case of kernel culling where the distinction between "idle" and
"busy" is used to decide whether or not to cull the kernels.
Since there might be different setups where the server admin might want to include/exclude different types of
messages from this calculation, the set of message types used for status tracking is configurable.
Fixes #1360