-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug hangs in CI #423
Debug hangs in CI #423
Conversation
I am going to rerun the tests several times to see if I can flush out more CI failures. |
…arms that trip our unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, the hang is fixed by the cacheDebugger
additions in FasterKV
? Mostly around the takeFromOutput
checks? Not sure I understand how that led to the CI to hang. It would help me if you could explain that. Thanks!
foreach ((string name, Task task) in scenarios.StartAllScenarios(includeTimers: true, includeLarge: true)) | ||
foreach ((string name, Task task) in scenarios.StartAllScenarios(includeTimers: false, includeLarge: true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw timer tests failing spuriously (because of large time delays leading to unexpected results) so I did not think it was worth keeping them in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, you're seeing that timers make creating a testing spec difficult (as in it's difficult to assert what exactly will happen when). Ok, that makes sense
takeValueFromOutput = (output.Val != null); // we have observed that src.Val is null sometimes, so if present, we use output instead | ||
if (takeValueFromOutput) | ||
{ | ||
this.cacheDebugger?.Record(key.Val, CacheDebugger.CacheEvent.SingleWriterCopyToTailFromOutput, output.Version, default, info.Address); | ||
} | ||
else | ||
{ | ||
this.cacheDebugger?.Record(key.Val, CacheDebugger.CacheEvent.SingleWriterCopyToTail, src.Version, default, info.Address); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good place to elaborate a bit more on the significance of src.Val
being null. Is this a bug in FASTER, or expected, or something else? Can you please add a small blurb about that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's sort of what the comment on line 2084 is supposed to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that gives me enough context though. The way the comment is phrased ("we have observed...sometimes") makes me thing it is not expected that src.Val
would be null, and if that's true then I think that's worth calling out explicitly. It would tell me whether or not there's something weird that FASTER is doing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I actually also don't know. I can add a bit more text to say more explicitly that I don't really know what is supposed to be happening.
We have seen a lot of hangs recently in the CI, which seem to not be caused by recent changes but look a bit like errors in the test framework. I created this PR as a way to further investigate and hopefully fix these hangs.
In this first step, I am trying to understand hangs in the QueriesCopyToTail test which I have seen repeatedly in CI. The hangs seem to have something to do with the checkpoint injection mechanism used by the test. To narrow this down, we revise this state machine and add lots of tracing.