Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/grandpa: ensure grandpa catch-up logic works #2983

Closed
wants to merge 19 commits into from

Conversation

kishansagathiya
Copy link
Contributor

@kishansagathiya kishansagathiya commented Nov 29, 2022

Changes

  • ensure catch-up logic works when interacting with other gossamer nodes
  • If round in the neighbour message is ahead of our current round by a
    threshold, send a catch up request
  • process the catch up response, if we can't process it at the moment,
    store it to process later.

Tests

  • Run westend local devnet with 4 substrate nodes and one gossamer node.
  • Let them run for a few minutes, let few rounds finish.
  • Stop gossamer node. Wait for few more rounds to finish (at least 2 more rounds).
  • Start the gossamer node.
  • Check through logs that it sends a catch up request to other nodes and catches up to the latest round

Issues

Primary Reviewer

@timwu20

- ensure grandpa catch-up logic works when interacting with substrate nodes
- If round in the neighbour message is ahead of our current round by a
threshold, send a catch up request
- process the catch up response, if we can't process it at the moment,
store it to process later.

Closes #1531
return fmt.Errorf("failed to send catch up request: %w", err)
}

logger.Debugf("successfully sent a catch up request to node %s, for round number %d and set ID %d",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Debugf("successfully sent a catch up request to node %s, for round number %d and set ID %d",
logger.Debugf("successfully sent a catch up request to peer %s, for round %d and set ID %d",

case <-c.catchUpResponseCh:
return nil
case <-timer.C:
return errors.New("timeout")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have this as an sentinel error like: ErrCatchUpTimeout

logger.Debugf("successfully sent a catch up request to node %s, for round number %d and set ID %d",
to, round, setID)

c.waitingOnResponse.Store(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you call this c.waitingOnResponse.Store(true) function in sendCatchUpRequest and call here as well, should you remove this one?

}

logger.Debugf(
"processing catch up response with hash %s for round %d and set id %d",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"processing catch up response with hash %s for round %d and set id %d",
"processing catch up response with hash %s for round %d and set ID %d",

msg.Hash, msg.Round, msg.SetID)

// if we aren't currently expecting a catch up response, return
if !c.waitingOnResponse.Load().(bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !c.waitingOnResponse.Load().(bool) {
// TODO: decrease the peer reputation
if !c.waitingOnResponse.Load().(bool) {

Comment on lines +128 to +130
if msg.Hash.IsEmpty() || msg.Number == 0 {
return ErrGHOSTlessCatchUp
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be the second verification step in this function placed right after the if !c.grandpa.authority

c.requestsSent = make(map[peer.ID]CatchUpRequest)
c.lock.Unlock()

logger.Debugf("caught up to round; unpaused service and grandpa state round is %d", c.grandpa.state.round)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Debugf("caught up to round; unpaused service and grandpa state round is %d", c.grandpa.state.round)
logger.Debugf("caught up to round; starting at round %d", c.grandpa.state.round)

return nil
}

func (c *catchUp) verifyPreCommitJustification(msg *CatchUpResponse) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function already exists in lib/grandpa/message_handler.go, why not use it?

return nil
}

func (c *catchUp) verifyPreVoteJustification(msg *CatchUpResponse) (common.Hash, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function already exists in lib/grandpa/message_handler.go, why not use it?

Comment on lines +54 to +61
// TODO: Clean up all request sent before 5 min / (neighbour message interval)
c.lock.Lock()
_, ok := c.requestsSent[to]
c.lock.Unlock()
if ok {
logger.Debugf("ignoring neighbour message since we already sent a catch-up request to this peer: %s", to)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we implement a TTL on this map?

Will there be cases where you could ask for multiple catch up messages to the same peer?

@@ -0,0 +1,268 @@
// Copyright 2021 ChainSafe Systems (ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename to catch_up.go.

}

c.lock.Lock()
c.requestsSent[to] = *req
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we need to copy the req here? Can this map hold *CatchupRequest?

requestsSent map[peer.ID]CatchUpRequest

catchUpResponseCh chan *CatchUpResponse
waitingOnResponse *atomic.Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
waitingOnResponse *atomic.Value
waitingOnResponse *atomic.Bool

case VoteMessageType, CommitMessageType:
s.network.GossipMessage(msg)
return true, nil
case NeighborMessageType, CatchUpRequestMessageType, CatchUpResponseMessageType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a default case that returns an error for unsupported messages?

it seems like polkadot sends current round in neighbor message instead
of last finalised round. So, I had to send a catch request with one
round previous. Otherwise, polkadot would discard our request.
@kishansagathiya
Copy link
Contributor Author

Current testing status,
Screenshot from 2023-01-13 21-25-20
Screenshot from 2023-01-13 21-25-39

On seeing a neighbour message, gossamer sends a catch up request. Substrate responds with catch up response.

Gossamer doesnot receive the catch message or doesn't process it properly.

@timwu20
Copy link
Contributor

timwu20 commented Aug 14, 2023

can you mark as draft or close @kishansagathiya ?

@kishansagathiya kishansagathiya marked this pull request as draft August 15, 2023 05:50
@q9f
Copy link
Contributor

q9f commented Jan 12, 2024

Shall we merge this some day or put the code somewhere else?

@q9f q9f added A-stale issue or PR is deprecated and needs to be closed. S-grandpa issues related to block finality. labels Jan 16, 2024
@kishansagathiya
Copy link
Contributor Author

kishansagathiya commented Jan 16, 2024

Shall we merge this some day or put the code somewhere else?

When I tried to do this, I could not finish this task completely. I was able to send a catch up request to polkadot, I could see the request on logs of polkadot. I could see in polkadot logs that it was sending a response back, but on gossamer side I was not able to see the response. I tried to do this for quite a few days. Got some other folks to help as well, but could not get any results. And eventually gave up.

In gossamer, there is an active task of grandpa refactor is going on (in branch feat/grandpa). That might be the reason no one else has picked it up as well.

@kishansagathiya
Copy link
Contributor Author

closing for too old

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stale issue or PR is deprecated and needs to be closed. S-grandpa issues related to block finality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lib/grandpa: ensure catch-up logic works when interacting with substrate nodes
4 participants