-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vattp sometimes drops messages? #3039
Comments
We're seeing something weird, where vattp is dropping messages randomly, which could cause a consensus failure. So we're going to pin it to a local worker for now, instead of letting it run on XS. refs #3039
We're seeing something weird, where vattp is dropping messages randomly, which could cause a consensus failure. So we're going to pin it to a local worker for now, instead of letting it run on XS. refs #3039
On branch
The gap between 725 and 730 is the proximate cause: comms was expecting 726 but received 730. All further messages will be rejected until it sees 726. Looking at the vat-tp deliveries in the same slogfile, we it received
The crank which delivered messages 721-729 emitted the following syscalls:
The code which executes this crank is: agoric-sdk/packages/SwingSet/src/vats/vat-tp.js Lines 170 to 180 in f052169
The syscalls suggest that the loop was interrupted somewhere after the @FUDCo analyzed a similar trace whose We're struggling to imagine something that could cause this. We've brainstormed:
|
@FUDCo and I are copying the kernelDB from a chain run that hit this failure, and are attempting to replay the swingset transcript for the vat in question, to see if it does the same thing upon replay. If so, it suggests that, whatever is going on, it's at least a function of the state of that vat. The particular failure we're investigating is in a run we've named The "in-vivo" replace I did (by just restarting the chain, after adding a log message to show when syscalls are being made during replay, and replacing the |
We're able to reproduce the problem in a somewhat-reduced test environment, where we copy the kernelDB from the failing chain, strip out all the vats except for The problem manifests at a simple The fun part is that the crank which exhibits the error changes as a function of the code that xsnap is given (supervisor, liveslots, and vat code). We updated our transcript replay code to not only sense if a vat makes a syscall which differs from the ones recorded in the transcript, but also it fails to make one of the recorded ones. If we run the original code, we see no transcript mismatches, because the crank that showed the error managed to show exactly the same error in the replay. If we modify the code a bit, and the bug happens during some different (earlier) crank, then we see fewer syscalls than we expected, and the replay terminates early with an error. The good news is that, from what I can tell, it's a deterministic function of that code, so at least the bug is stable, but it tends to move or go away if you add debugging code. It seems to be insensitive to whitespace. Our current theory is some sort of memory allocation bug, maybe during GC. Maybe something which is causing the short array to get truncated, or to corrupt the variable holding |
The most concise way I have found for packaging up such things is a stand-alone https://gist.github.com/dckc/19451649697bb7cb231ab5a776fde0d1#file-makefile p.s. This assumes the bug will show up on machines other than your own. We've seen bugs where the symptoms were hard to reproduce on another machine; the culprit seemed to be that paths were different between machines and the paths made their way into the runtime heap. |
Yeah, that's where I'll aim. At @FUDCo 's advice, I'm going to start big+fast (ala "I regret this letter is so long, I did not have the time to make it shorter"), and defer refinement/reduction until after we've delivered something sufficient for them to reproduce it. |
When I tried on Monday to reproduce this, I got |
@warner Moddable sent us a WeakMap GC fix; I checked that in as agoric-labs/moddable@c533fb7 ( on a https://github.com/agoric-labs/moddable/tree/weak-gc-fix branch) and I created a https://github.com/Agoric/agoric-sdk/tree/3039-vattp-xs-gc branch that is master as of May 5 notede above ( f052169 ) with the moddable submodule set to the GC fix. Care to try it out? Or maybe I could figure out how... |
Which weakmap gc problem is this supposed to fix? |
Er... The one where vattp sometimes drops messages, as detailed at length above. |
In today's testnet meeting: Deprioritizing for now |
@warner do I have access to the And please sketch how to use |
Describe the bug
#3021 seems to be caused by vattp sometimes dropping some messages. We're investigating.
The text was updated successfully, but these errors were encountered: