Improve resilience to failed packets on ordered channels #3540

ljoss17 · 2023-08-15T16:30:18Z

Closes: #3695

Summary

There is a risk that Hermes enters in a state where it will fail to relay all packets on an ordered channel until the packet clearing is triggered.
This can happen when a packet is not successfully relayed, but isn't noticed by Hermes. It will then try to relay the next packets but they will all fail due to a sequence number mismatch.

Hermes shouldn't blindly try to relay any packet if it fails to relay one in an ordered channel.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate milestone (priority) applied
Appropriate contributors tagged
Contributor assigned/self-assigned

romac · 2023-08-15T16:46:32Z

@soareschen suggested that we could maintain a priority queue of packets to be relayed, ordered by sequence numbers.

As a I see it, we would update the queue whenever we get a new event batch. After successfully submitting a tx, we would remove from the queue the packets we just submitted and move on the remaining ones. Whenever we fail to do so, we can refresh the queue by querying for packets yet to be relayed on the channel.

Does that make sense to you, @ljoss17 @soareschen?

ancazamfir · 2023-09-22T14:59:33Z

Why can't we trigger/call packet clearing if needed for seq < n, before we process the event for n?

romac · 2023-10-17T14:38:55Z

Quick notes from the call with @ljoss17 and @seanchen1991:

diff --git a/crates/relayer/src/link/relay_path.rs b/crates/relayer/src/link/relay_path.rs
index dc01a71f7..e4b74a1b6 100644
--- a/crates/relayer/src/link/relay_path.rs
+++ b/crates/relayer/src/link/relay_path.rs
@@ -673,10 +673,40 @@ impl<ChainA: ChainHandle, ChainB: ChainHandle> RelayPath<ChainA, ChainB> {
                 }
                 Err(LinkError(error::LinkErrorDetail::Send(e), _)) => {
                     // This error means we could retry
-                    error!("error {}", e.event);
+                    error!("error {}", e.event); // TODO: better message here
+
+                    // if on ordered channel, do a packet clearing, but only if we are not doing
+                    // a packet clearing already.
+                    if self.ordered_channel() && i == 0 && !odata.tracking_id.is_clearing() {
+                        // Need to specify height for clearing?
+                        // Probably not since no progress will have been made
+                        self.relay_pending_packets(None)?;
+                    }
+
+                    // fine up to: 112
+                    // next expected: 113
+                    // submitting: 120, 121
+                    //
+                    // a1) pending: 114, 115, 116, 117, 118, 119, 120, 121
+                    //    cleared: 114, 115, 116, 117, 118, 119, 120, 121
+                    //
+                    // a2) pending: 114, 115, 116, 117, 118, 119
+                    //     cleared: 114, 115, 116, 117, 118, 119
+                    //
+                    // b1) pending: 114, 115, 116, xxx, 118, 119
+                    //     cleared: 114, 115, 116 (?)
+                    //
+                    // b2) pending: 114, 115, 116, xxx, 118, 119
+                    //     cleared: none
+                    //
+                    // c)  pending: 114, 115, 116, 117, 118, 119
+                    //     cleared: none
+
                     if i + 1 == MAX_RETRIES {
                         error!("{}/{} retries exhausted. giving up", i + 1, MAX_RETRIES)
                     } else {
+                        // TODO: output current retry step here in debug!
+
                         // If we haven't exhausted all retries, regenerate the op. data & retry
                         match self.regenerate_operational_data(odata.clone()) {
                             None => return Ok(S::Reply::empty()), // Nothing to retry

devashishdxt · 2023-11-20T02:16:31Z

Can #3695 also be solved as a part of this?

mmsqe · 2023-11-20T08:16:09Z

Can #3695 also be solved as a part of this?

Yes, seems resolved with the change

mmsqe · 2024-02-17T01:56:26Z

Seems #3610 does not resolve such case: timeout is not triggered if exceeds max gas error is raised after submitting large txs as test in here. While golang relayer will send MsgTimeout to controller chain and then MsgChannelCloseConfirm to host chain after max gas error.

ljoss17 added A: bug Admin: something isn't working O: reliability Objective: cause to improve trustworthiness and consistent performing labels Aug 15, 2023

github-project-automation bot added this to Hermes Aug 15, 2023

github-project-automation bot moved this to 🩹 Triage in Hermes Aug 15, 2023

romac changed the title ~~Relaying issue for ordered channels~~ Improve resilience to failed packets on ordered channels Aug 15, 2023

romac added this to the v1.7 milestone Aug 15, 2023

romac moved this from 🩹 Triage to 📥 Todo in Hermes Aug 15, 2023

seanchen1991 mentioned this issue Aug 15, 2023

Refactoring, cleanup, and improvements #3217

Closed

3 tasks

romac added the A: critical Admin: critical or important label Aug 22, 2023

seanchen1991 self-assigned this Aug 31, 2023

seanchen1991 moved this from 📥 Todo to 🏗 In progress in Hermes Sep 12, 2023

seanchen1991 mentioned this issue Sep 13, 2023

Make ordered channels more resilient in the face of failing packets #3610

Merged

7 tasks

romac modified the milestones: v1.7, v1.7.1 Oct 2, 2023

seanchen1991 assigned ljoss17 Oct 17, 2023

seanchen1991 unassigned ljoss17 Oct 20, 2023

romac modified the milestones: v1.7.1, v1.8 Nov 8, 2023

romac modified the milestones: v1.8, v1.9 Jan 16, 2024

romac closed this as completed in #3610 Feb 28, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Hermes Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resilience to failed packets on ordered channels #3540

Improve resilience to failed packets on ordered channels #3540

ljoss17 commented Aug 15, 2023 •

edited by romac

Loading

romac commented Aug 15, 2023

ancazamfir commented Sep 22, 2023

romac commented Oct 17, 2023

devashishdxt commented Nov 20, 2023

mmsqe commented Nov 20, 2023

mmsqe commented Feb 17, 2024

Improve resilience to failed packets on ordered channels #3540

Improve resilience to failed packets on ordered channels #3540

Comments

ljoss17 commented Aug 15, 2023 • edited by romac Loading

Summary

For Admin Use

romac commented Aug 15, 2023

ancazamfir commented Sep 22, 2023

romac commented Oct 17, 2023

devashishdxt commented Nov 20, 2023

mmsqe commented Nov 20, 2023

mmsqe commented Feb 17, 2024

ljoss17 commented Aug 15, 2023 •

edited by romac

Loading