Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulation Failure/Timeout #13088

Closed
alexanderbez opened this issue Aug 30, 2022 · 8 comments · Fixed by #13116
Closed

Simulation Failure/Timeout #13088

alexanderbez opened this issue Aug 30, 2022 · 8 comments · Fixed by #13116

Comments

@alexanderbez
Copy link
Contributor

So it has been raised to my attention that non-PR CI simulations are timing out. Namely, they were passing/running to completion up and until this PR, however, they started to timeout after this PR which I implemented.

Thanks for raising this @kocubinski. I initially thought it was a bug or something in my PR, however, after digging into it a bit more, I don't think it is. Namely, if you run the sim with seed=11 for example, you'll get to a point where it's stuck on EndBlock, specifically, x/crisis EndBlock. If you look at that, you'll see it gets stuck on:

ASSERTING CRISIS INVARIANTS; INV: 8/12, NAME: group/Group-TotalWeight

Taking a look at this invariant, you see it's a naked for loop (very bad!). In any case, I don't see how this has anything to do with my PR and I think my PR just surfaced a mutation in the simulation execution branch path that simply uncovers a pre-existing issue in the x/group Group-TotalWeight invariant.

Take a look: https://github.com/cosmos/cosmos-sdk/blob/main/x/group/keeper/invariants.go#L41-L98

I'm convinced this has nothing to do with PR. Someone with x/group expertise should look into this. Here is the output with seed 11 for the invariant:

...
groupWeight: 18, membersWeight: 18
groupWeight: 23, membersWeight: 23
groupWeight: 20, membersWeight: 20
groupWeight: 5, membersWeight: 5
groupWeight: 9, membersWeight: 9

Also, I noticed the last message to be executed prior to this getting stuck was the following:

OP MSG: {"route":"/cosmos.authz.v1beta1.MsgRevoke","name":"/cosmos.authz.v1beta1.MsgRevoke","comment":"","ok":true,"msg":{"type":"cosmos-sdk/MsgRevoke","value":{"grantee":"cosmos1l3wl6d57krm4ujdgtg4p5pdh9fw9evhn3t8e2a","granter":"cosmos18ph9tar8en6z8f9tn529utk2xlmuk6waw0ltm0","msg_type_url":"/cosmos.bank.v1beta1.MsgSend"}}}
@alexanderbez
Copy link
Contributor Author

I also don't see any errors. I printed the group IDs and member addresses and it seems to get stuck after getting the group ID:

...
GROUP ID: 708
GROUP MEMBER ID: cosmos1tljq9xse7wvfuuajee80ukch7lu3x8rc8fs7kw
GROUP MEMBER ID: cosmos13f2hqukegfz75dqzpdc36uy7aq9elhal72emmc
GROUP MEMBER ID: cosmos1ed2j0xrdaattf82xc6rnxhsdel2gxeea8r2zqw
GROUP MEMBER ID: cosmos16nfnckv83a7n34l624flkqfcmvqn43mg8tlyhe
GROUP ID: 709
GROUP MEMBER ID: cosmos13f2hqukegfz75dqzpdc36uy7aq9elhal72emmc
GROUP ID: 710
GROUP MEMBER ID: cosmos1tljq9xse7wvfuuajee80ukch7lu3x8rc8fs7kw
GROUP MEMBER ID: cosmos13f2hqukegfz75dqzpdc36uy7aq9elhal72emmc
GROUP ID: 711
<HANGS HERE>

This tell smells like an ORM bug maybe?

@amaury1093
Copy link
Contributor

I already debugged some group sims in the past, can take a look at this.

@amaury1093 amaury1093 self-assigned this Aug 30, 2022
@alexanderbez
Copy link
Contributor Author

@AmauryM it's either this call:

			var groupMember group.GroupMember
			_, err = memIt.LoadNext(&groupMember)
			if errors.ErrORMIteratorDone.Is(err) {
				break
			}

Or this call:

		memIt, err := groupMemberByGroupIndex.Get(ctx.KVStore(key), groupInfo.Id)
		if err != nil {
			msg += fmt.Sprintf("error while returning group member iterator for group with ID %d\n%v\n", groupInfo.Id, err)
			return msg, broken
		}
		defer memIt.Close()

@amaury1093
Copy link
Contributor

amaury1093 commented Aug 31, 2022

I confirm via git bisect that this bug appeared in #12852. @alexanderbez It seems that your analysis is correct, it's hanging on the second call. However, it's not hanging because of group/ORM logic, but in the cachekv store.

See my branch with some logging: main..am/13088-group-sims. Using the same seed 11, I get:

ABC Running GroupTotalWeightInvariant on block 45
ABC Running GroupTotalWeightInvariant on block 50
ABC clearUnsortedCacheSubset BEFORE sortedCache.Set (HERE IT HANGS...)

which means it hangs on this line

store.sortedCache.Set(item.Key, []byte{})

This brings directly to tm-db code. Before I dig into this deeper, @alexanderbez do you have an idea why tm-db might hang here?

@amaury1093 amaury1093 moved this to 📝 Todo in Cosmos-SDK Aug 31, 2022
@amaury1093 amaury1093 moved this from 📝 Todo to 💪 In Progress in Cosmos-SDK Aug 31, 2022
@alexanderbez
Copy link
Contributor Author

alexanderbez commented Aug 31, 2022

Interesting. I have zero clue lol. All my PR did was remove a few types and add a new message type.

As for the cache store, I do know there is a mutex and that iterator gets a write lock, which could maybe cause contention? I'm not really sure. Seems like we need to dig into the cache KVStore more on this one.

EDIT: NVM, the deadlock would be coming from tm-db actually

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Aug 31, 2022

Perhaps we add some logging to iterator to see if there could be any thread contention that would cause a deadlock? Or we could even run the simulation with the race detector on?

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Aug 31, 2022

Ok I think I got the output to display the lock contention: https://pastebin.com/TA70tkz7

But this seems to be coming from levelDB? Or at least our usage of it? Not clear...

I also tweaked the local version of tm-db when running with https://github.com/sasha-s/go-deadlock and I got the following output (which seems to be more useful and relevant): https://pastebin.com/ZNDW4ADN

Looks like we got some recursive locking in tm-db :/

@alexanderbez
Copy link
Contributor Author

alexanderbez commented Aug 31, 2022

@AmauryM take a look at that last pastebin...it's not clear to me if it's an issue/bug in tm-db mem's iterator or the ORM's potentially incorrect usage of it.

It looks like to me and others took a look too, that it's in ORM's incorrect usage -- creating an iterator within a loop. Or maybe it's the invariant...two for loops?

Repository owner moved this from 💪 In Progress to 👏 Done in Cosmos-SDK Sep 1, 2022
@tac0turtle tac0turtle removed this from Cosmos-SDK Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants