-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulation Failure/Timeout #13088
Comments
I also don't see any errors. I printed the group IDs and member addresses and it seems to get stuck after getting the group ID:
This tell smells like an ORM bug maybe? |
I already debugged some group sims in the past, can take a look at this. |
@AmauryM it's either this call: var groupMember group.GroupMember
_, err = memIt.LoadNext(&groupMember)
if errors.ErrORMIteratorDone.Is(err) {
break
} Or this call: memIt, err := groupMemberByGroupIndex.Get(ctx.KVStore(key), groupInfo.Id)
if err != nil {
msg += fmt.Sprintf("error while returning group member iterator for group with ID %d\n%v\n", groupInfo.Id, err)
return msg, broken
}
defer memIt.Close() |
I confirm via See my branch with some logging:
which means it hangs on this line cosmos-sdk/store/cachekv/store.go Line 392 in 8704983
This brings directly to tm-db code. Before I dig into this deeper, @alexanderbez do you have an idea why tm-db might hang here? |
Interesting. I have zero clue lol. All my PR did was remove a few types and add a new message type. As for the cache store, I do know there is a mutex and that EDIT: NVM, the deadlock would be coming from tm-db actually |
Perhaps we add some logging to |
Ok I think I got the output to display the lock contention: https://pastebin.com/TA70tkz7 But this seems to be coming from levelDB? Or at least our usage of it? Not clear... I also tweaked the local version of tm-db when running with https://github.com/sasha-s/go-deadlock and I got the following output (which seems to be more useful and relevant): https://pastebin.com/ZNDW4ADN Looks like we got some recursive locking in tm-db :/ |
@AmauryM take a look at that last pastebin...it's not clear to me if it's an issue/bug in tm-db mem's iterator or the ORM's potentially incorrect usage of it. It looks like to me and others took a look too, that it's in ORM's incorrect usage -- creating an iterator within a loop. Or maybe it's the invariant...two for loops? |
So it has been raised to my attention that non-PR CI simulations are timing out. Namely, they were passing/running to completion up and until this PR, however, they started to timeout after this PR which I implemented.
Thanks for raising this @kocubinski. I initially thought it was a bug or something in my PR, however, after digging into it a bit more, I don't think it is. Namely, if you run the sim with seed=11 for example, you'll get to a point where it's stuck on
EndBlock
, specifically,x/crisis
EndBlock
. If you look at that, you'll see it gets stuck on:Taking a look at this invariant, you see it's a naked
for
loop (very bad!). In any case, I don't see how this has anything to do with my PR and I think my PR just surfaced a mutation in the simulation execution branch path that simply uncovers a pre-existing issue in the x/groupGroup-TotalWeight
invariant.Take a look: https://github.com/cosmos/cosmos-sdk/blob/main/x/group/keeper/invariants.go#L41-L98
I'm convinced this has nothing to do with PR. Someone with x/group expertise should look into this. Here is the output with seed 11 for the invariant:
Also, I noticed the last message to be executed prior to this getting stuck was the following:
The text was updated successfully, but these errors were encountered: