Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't panic the kernel upon a device error #4414

Merged
merged 3 commits into from
Jan 28, 2022
Merged

Conversation

warner
Copy link
Member

@warner warner commented Jan 28, 2022

fix(swingset): don't kernel panic upon device error

Previously, any error thrown by a device invocation would set the kernel to
the "panic" state, mark the calling vat as scheduled for termination, then
return an error indication to the calling vat, which makes the
syscall.callNow() throw an exception. From the calling vat's perspective,
their D() invocation throw a "you killed my kernel, prepare to die" error,
the delivery would complete, then the entire kernel would shut down.
controller.run() rejects its return promise, causing the host application
to halt without committing state.

This was fail-stop safe against surprises in the device code, but is too
difficult to use in practice. Devices could not use assert() to validate
caller arguments without giving the calling vat the power to shut down the
entire kernel.

With this change, errors during device invocation cause the calling vat to
get an error, as before, but the error is neither vat-fatal nor kernel-fatal.

It is still vat-fatal to pass promises in arguments to devices, where the
mistake is caught by the vat's outbound translator, rather than the device's
inbound translator. We'll probably relax this restriction later.

The device inbound translator used to throw an error if it got a foreign
device node, or a promise that somehow got past the guard above. In both
cases, the translator error caused a kernel panic. This commit changes the
inbound translator to handle both foreign device nodes and promises, which
ought to prevent vats from triggering kernel panics via this route.
Deviceslots will still reject both, but since that error occurs in the
device invocation, the calling vat should get a (catchable) error, and the
kernel will not panic.

GC handling is still really sketchy around devices: do not expect objects
passed to a device to ever be dropped.

closes #4326

@warner warner added the SwingSet package: SwingSet label Jan 28, 2022
@warner warner requested a review from FUDCo January 28, 2022 07:51
@warner warner self-assigned this Jan 28, 2022
@warner
Copy link
Member Author

warner commented Jan 28, 2022

Apparently I'm still fighting with the typechecker, but I think the rest of the PR should be reviewable.

@warner warner force-pushed the 4326-nonfatal-device-errors branch 2 times, most recently from f776a01 to 95e3bbd Compare January 28, 2022 21:37
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo a couple of minor possible issues flagged in comments this seems fine.

/** @type { DeviceInvocationResult } */
const deviceResults = dispatch.invoke(target, method, args);
// common error: raw devices returning capdata instead of ['ok', capdata]
assert.equal(deviceResults.length, 2, deviceResults);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this assert. What is the meaning of the 3rd argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the message emitted if the assertion fails (or at least I think that's what it's for), so the developer at the console can figure out why it threw without going back and inserting a console.log. optDetails seems to be the official name. The default for optDetails reports the two values being compared, in this case "2" and whatever .length is. In the failure mode I ran into, where my raw device returned { body: .., slots: ..} instead of ['ok', { body, slots }], the assertion printed undefined !== 2, which didn't help me very much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought the 3rd arg was the message string, which is why I am puzzled because deviceResults is not a string.

insistValidVatstoreKey(key);
kdebug(`syscall[${deviceID}].vatstoreDelete(${key})`);
return harden(['vatstoreDelete', deviceID, key]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there's quite a bit of code duplication with vatTranslator.js. Should this be factored into some common code to ensure that changes to the syscall interface don't diverge under maintenance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe. For now, devices don't get the same syscalls as vats: they don't do promises, so syscall.send isn't quite right, they can't syscall.exit, they don't participate in GC so no syscall.dropImports.

But in the long run, yeah, we want devices to be more like vats, and at that point it'll make sense to drop deviceTranslator.js and use vatTranslator in its place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more along the lines of a syscall translator that's parameterized according to whether it thinks it's doing the job for a vat or a device, which would allow differences between the two to be accommodated but would allow the 95% that's the same to be a single piece of code to maintain.

Previously, any error thrown by a device invocation would set the kernel to
the "panic" state, mark the calling vat as scheduled for termination, then
return an error indication to the calling vat, which makes the
`syscall.callNow()` throw an exception. From the calling vat's perspective,
their `D()` invocation throw a "you killed my kernel, prepare to die" error,
the delivery would complete, then the entire kernel would shut down.
`controller.run()` rejects its return promise, causing the host application
to halt without committing state.

This was fail-stop safe against surprises in the device code, but is too
difficult to use in practice. Devices could not use assert() to validate
caller arguments without giving the calling vat the power to shut down the
entire kernel.

With this change, errors during device invocation cause the calling vat to
get an error, as before, but the error is neither vat-fatal nor kernel-fatal.

It is still vat-fatal to pass promises in arguments to devices, where the
mistake is caught by the vat's outbound translator, rather than the device's
inbound translator. We'll probably relax this restriction later.

The device inbound translator used to throw an error if it got a foreign
device node, or a promise that somehow got past the guard above. In both
cases, the translator error caused a kernel panic. This commit changes the
inbound translator to handle both foreign device nodes and promises, which
ought to prevent vats from triggering kernel panics via this route.
Deviceslots will still reject both, but since that error occurs in the
device *invocation*, the calling vat should get a (catchable) error, and the
kernel will not panic.

GC handling is still really sketchy around devices: do not expect objects
passed to a device to ever be dropped.

closes #4326
These were really out of date. They could still use examples of how to manage
state correctly, because it's pretty tricky.
@warner warner force-pushed the 4326-nonfatal-device-errors branch from 95e3bbd to cef1f88 Compare January 28, 2022 22:25
@warner warner added the automerge:no-update (expert!) Automatically merge without updates label Jan 28, 2022
@mergify mergify bot merged commit 07f2552 into master Jan 28, 2022
@mergify mergify bot deleted the 4326-nonfatal-device-errors branch January 28, 2022 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:no-update (expert!) Automatically merge without updates SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

device errors should not (always) kill the kernel
2 participants