Skip to content

Commit

Permalink
Be much more gracious with supervisor startup time (#6100)
Browse files Browse the repository at this point in the history
This change addresses an issue frequently seen in tests wherein the
kernel supervisor appears to fail to connect at startup. After several
rounds of debugging, it eventually became clear that there was no actual
reconnection failure; the problem was simply that the CI machines
(especially on Windows hardware) are very slow, and the client gave up
retrying before the supervisor process was fully started.

The fix is wait much longer before giving up, and to base the number of
retries on wall clock time rather than attempts. Formerly, we could
exhaust retries in as little as 1.5-2 seconds; now we wait up to 10
seconds.

Here are 2 e2e test runs from a branch with this change:

https://github.com/posit-dev/positron/actions/runs/12934857967
https://github.com/posit-dev/positron/actions/runs/12934185231

Addresses #5337.

### QA Notes

This change is primarily intended to address the issue in CI and
shouldn't have much impact on the product; the only downside of this
change is that now if there really _is_ some issue that causes a
connectivity failure, it takes 10 seconds for it to show up instead of
2.

These situations should not be common in the wild, since practically all
of the issues we have seen manifest as an error starting/launching the
supervisor, which we can detect without waiting for a timeout.
  • Loading branch information
jmcphers authored Jan 23, 2025
1 parent 843bee4 commit 4748306
Showing 1 changed file with 10 additions and 11 deletions.
21 changes: 10 additions & 11 deletions extensions/positron-supervisor/src/KallichoreAdapterApi.ts
Original file line number Diff line number Diff line change
Expand Up @@ -390,9 +390,9 @@ export class KCApi implements KallichoreAdapterApi {
this._api.setDefaultAuthentication(bearer);

// List the sessions to verify that the server is up. The process is
// alive for a few milliseconds before the HTTP server is ready, so we
// may need to retry a few times.
for (let retry = 0; retry < 40; retry++) {
// alive for a few milliseconds (or more, on slower systems) before the
// HTTP server is ready, so we may need to retry a few times.
for (let retry = 0; retry < 100; retry++) {
try {
const status = await this._api.serverStatus();
this._log.appendLine(`Kallichore ${status.body.version} server online with ${status.body.sessions} sessions`);
Expand All @@ -405,20 +405,19 @@ export class KCApi implements KallichoreAdapterApi {
throw new Error(`The supervisor process exited before the server was ready.`);
}

// ECONNREFUSED is a normal condition during startup; the server
// isn't ready yet. Keep trying until we hit the retry limit,
// about 2 seconds from the time we got a process ID
// established.
// ECONNREFUSED is a normal condition during startup; the
// server isn't ready yet. Keep trying up to 10 seconds from
// the time we got a process ID established.
if (err.code === 'ECONNREFUSED') {
if (retry < 19) {
if (elapsed < 10000) {
// Log every few attempts. We don't want to overwhelm
// the logs, and it's normal for us to encounter a few
// connection refusals before the server is ready.
if (retry % 5 === 0) {
this._log.appendLine(`Waiting for Kallichore server to start (attempt ${retry + 1}, ${elapsed}ms)`);
if (retry > 0 && retry % 5 === 0) {
this._log.appendLine(`Waiting for Kallichore server to start (attempt ${retry}, ${elapsed}ms)`);
}
// Wait a bit and try again
await new Promise((resolve) => setTimeout(resolve, 50));
await new Promise((resolve) => setTimeout(resolve, 100));
continue;
} else {
// Give up; it shouldn't take this long to start
Expand Down

0 comments on commit 4748306

Please sign in to comment.