-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve resilience to failed packets on ordered channels #3540
Comments
@soareschen suggested that we could maintain a priority queue of packets to be relayed, ordered by sequence numbers. As a I see it, we would update the queue whenever we get a new event batch. After successfully submitting a tx, we would remove from the queue the packets we just submitted and move on the remaining ones. Whenever we fail to do so, we can refresh the queue by querying for packets yet to be relayed on the channel. Does that make sense to you, @ljoss17 @soareschen? |
Why can't we trigger/call packet clearing if needed for seq < n, before we process the event for n? |
Quick notes from the call with @ljoss17 and @seanchen1991: diff --git a/crates/relayer/src/link/relay_path.rs b/crates/relayer/src/link/relay_path.rs
index dc01a71f7..e4b74a1b6 100644
--- a/crates/relayer/src/link/relay_path.rs
+++ b/crates/relayer/src/link/relay_path.rs
@@ -673,10 +673,40 @@ impl<ChainA: ChainHandle, ChainB: ChainHandle> RelayPath<ChainA, ChainB> {
}
Err(LinkError(error::LinkErrorDetail::Send(e), _)) => {
// This error means we could retry
- error!("error {}", e.event);
+ error!("error {}", e.event); // TODO: better message here
+
+ // if on ordered channel, do a packet clearing, but only if we are not doing
+ // a packet clearing already.
+ if self.ordered_channel() && i == 0 && !odata.tracking_id.is_clearing() {
+ // Need to specify height for clearing?
+ // Probably not since no progress will have been made
+ self.relay_pending_packets(None)?;
+ }
+
+ // fine up to: 112
+ // next expected: 113
+ // submitting: 120, 121
+ //
+ // a1) pending: 114, 115, 116, 117, 118, 119, 120, 121
+ // cleared: 114, 115, 116, 117, 118, 119, 120, 121
+ //
+ // a2) pending: 114, 115, 116, 117, 118, 119
+ // cleared: 114, 115, 116, 117, 118, 119
+ //
+ // b1) pending: 114, 115, 116, xxx, 118, 119
+ // cleared: 114, 115, 116 (?)
+ //
+ // b2) pending: 114, 115, 116, xxx, 118, 119
+ // cleared: none
+ //
+ // c) pending: 114, 115, 116, 117, 118, 119
+ // cleared: none
+
if i + 1 == MAX_RETRIES {
error!("{}/{} retries exhausted. giving up", i + 1, MAX_RETRIES)
} else {
+ // TODO: output current retry step here in debug!
+
// If we haven't exhausted all retries, regenerate the op. data & retry
match self.regenerate_operational_data(odata.clone()) {
None => return Ok(S::Reply::empty()), // Nothing to retry |
Can #3695 also be solved as a part of this? |
Yes, seems resolved with the change |
Closes: #3695
Summary
There is a risk that Hermes enters in a state where it will fail to relay all packets on an ordered channel until the packet clearing is triggered.
This can happen when a packet is not successfully relayed, but isn't noticed by Hermes. It will then try to relay the next packets but they will all fail due to a sequence number mismatch.
Hermes shouldn't blindly try to relay any packet if it fails to relay one in an ordered channel.
For Admin Use
The text was updated successfully, but these errors were encountered: