Relayer error handling specification #712

ancazamfir · 2021-02-25T16:44:03Z

Crate

docs

Summary

Currently the relayer performs retries for any error that occurs while building messages and transactions.

Problem Definition

There are RPC, light client, on-chain, internal errors, etc. The most complex ones are the on-chain errors that are a bit more subtle (leaving aside errors in the relayer code) and also they are not typed (we just get an error message and the reason is embedded in a string). The retry may apply for some cases, in others we should not retry (e.g. a transaction has already been submitted by another relayer), and yet in other cases we should rebuild stuff, etc.

Proposal

Analyze the code, document all errors and the actions to be taken in each case.
Consider also the exponential backoff as per @romac's informalsystems/ibc-rs#709 (comment), where applicable.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate milestone (priority) applied
Appropriate contributors tagged
Contributor assigned/self-assigned

ebuchman · 2021-05-05T00:34:34Z

also they are not typed (we just get an error message and the reason is embedded in a string)

May be worth trying to get these typed upstream in ibc

soareschen · 2021-05-07T14:57:14Z

I tried to do a full trace of an example error message to see how it is propagated. There is a complicated path from the Hermes relayer to the IBC module, and there are many potential errors that can occur in each step.

On the Go side, the error messages are formatted using helpers such as sdkerrors.Wrapf that are converted into strings. It would be challenging to modify the Go code to use more structured logging, so we can only do pattern matching on the string to find the type of error that are of our interest.

Here is an example error message:

Error: tx error: failed with underlying cause: tx response error: deliver_tx reports error: log=Log("failed to execute message; message index: 1: connection handshake open try failed: failed connection state verification for client (07-tendermint-4): chained membership proof failed to verify membership of value: 0A0F... in subroot C9FE...at index 0. Please ensure the path and value are both correct.: invalid proof")

And here is the manual stack trace:

tx error - ibc-rs/relayer-cli/src/error.rs:33
failed with underlying cause - ibc-rs/relayer/src/connection.rs:31
tx response error - ibc-rs/relayer/src/connection.rs:614
deliver_tx reports error - ibc-rs/relayer/src/chain/cosmos.rs:1472
failed to execute message - cosmos-sdk/baseapp/baseapp.go:735
connection handshake open try failed - ibc-go/modules/core/keeper/msg_server.go:L174
failed connection state verification - ibc-go/modules/core/03-connection/keeper/verify.go
chained membership proof failed to verify membership - ibc-go/modules/core/23-commitment/types/merkle.go:249
invalid proof - ibc-go/modules/core/23-commitment/types/errors.go#L12

I'm not sure if there are good way to document all errors on the Go code, other than trying to search through the code with keywords such as: status.Errorf, sdkerrors.Wrap, sdkerrors.Wrapf, fmt.Errorf, and look for standard errors in ibc-go/modules/*/errors.go. If we do document the errors, we could submit the documentation to ibc-go as well.

ancazamfir · 2021-05-18T11:30:54Z

we get the response from the Tx here: https://github.com/informalsystems/ibc-rs/blob/2592440b0c599304a1032746ce975b2cf7baa673/relayer/src/chain/cosmos.rs#L1479

The response has a check_tx and a deliver_tx result with type TxResult defined here https://github.com/informalsystems/tendermint-rs/blob/89b3fd0a236b54fcbd149aa9917a52f7d0a6de26/rpc/src/endpoint/broadcast/tx_commit.rs#L61-L91

We currently take the log and embed it in an error. I think we are interested in returning some typed errors based on the code and let the caller decide if the error is recoverable or not.

For IBC the errors (codes) are defined in the errors.go file in each sub-protocol. One example: https://github.com/cosmos/ibc-go/blob/main/modules/core/03-connection/types/errors.go

If I try to send a chan-open-try for a connection that is not opened, an error is issued from the chain here (ErrInvalidConnectionState which is code 6):
https://github.com/cosmos/ibc-go/blob/f4123347754d814759c9208273fe6772880063cb/modules/core/04-channel/keeper/handshake.go#L154-L158

And we get this response (prinln from our tx_result_to_event)

Response {
    check_tx: TxResult {
        code: Ok,
        data: None,
        log: Log(
            "[]",
        ),
        info: Info(
            "",
        ),
        gas_wanted: Gas(
            0,
        ),
        gas_used: Gas(
            0,
        ),
        events: [],
        codespace: Codespace(
            "",
        ),
    },
    deliver_tx: TxResult {
        code: Err(
            6,
        ),
        data: None,
        log: Log(
            "failed to execute message; message index: 1: channel handshake open try failed: connection state is not OPEN (got STATE_INIT): invalid connection state",
        ),
        info: Info(
            "",
        ),
        gas_wanted: Gas(
            0,
        ),
        gas_used: Gas(
            0,
        ),
        events: [],
        codespace: Codespace(
            "connection",
        ),
    },
    hash: transaction::Hash(7D51A00CC530011768600E1BF64816D3CB99BE0EBBC1408E7DCC09CC4C9972B7),
    height: block::Height(39444),
}

I think we need to look at all ibc error.rs files and map the errror code -s to hermes typed errors. Not sure if a one-to-one mapping is needed but it is a start.

soareschen · 2021-05-18T12:16:49Z

Yes, that is my impression as well. However the error code are just integers specific to the module, and I am not sure if they are guaranteed to not overlap when making a call. My initial thought was to do string pattern matching to determine the exact error. Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf. We could perhaps try to classify the more prominent errors in the errors.go files first, though I don't think they are exhaustive.

So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description, and I ended up hunting the entire code base and not knowing where the errors of interest are originated from. 😅

ancazamfir · 2021-05-18T12:32:10Z

and I am not sure if they are guaranteed to not overlap when making a call.

they don't within the codespace

Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf

yes, but on the relayer side we are interested in that code integer and translate it into a typed error.

So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description,

Sorry, I should have added more details to .. most complex ones are the on-chain errors But looking at the traces you posted it looked to me that you were on the right track.

soareschen · 2021-05-18T20:20:46Z

they don't within the codespace

Aha! So that's the devil in the detail! Then I'm confident to interpret the error code directly.

Btw I'm still quite confused that when an error happens, tx_result_to_event still succeeds with the error event:

    if response.check_tx.code.is_err() {
        return Ok(vec![IbcEvent::ChainError(format!(
            "check_tx reports error: log={:?}",
            response.check_tx.log
        ))]);
    }

Is that intentional? If not I will just change it to return them as newly classified errors in relayer::errors::Error.

ancazamfir · 2021-05-25T11:30:50Z

Is that intentional? If not I will just change it to return them as newly classified errors in relayer::errors::Error.

that's our current way to distinguish between chain IBC errors and all the rest. There are very few for which we should retry here but we need to go over the full list of IBC SdkErrors first. And yes we can work with typed errors here also.

adizere · 2022-06-03T13:14:55Z

Seems like we have some dead code capturing various errors that come from IBC-go. Ref: https://github.com/informalsystems/ibc-rs/pull/2266#issuecomment-1145951834

ancazamfir · 2024-01-11T18:08:22Z

Closing as most of the work has been done in #988 and other PRs over the years. Let' reopen if needed.

ancazamfir added the I: logic Internal: related to the relaying logic label Feb 25, 2021

ancazamfir added this to the 05.2021 milestone Feb 25, 2021

ancazamfir mentioned this issue Feb 25, 2021

Relay pending packets in the relayer loop #709

Merged

5 tasks

ancazamfir modified the milestones: 05.2021, 06.2021 Apr 28, 2021

adizere modified the milestones: 06.2021, 05.2021 Apr 29, 2021

adizere assigned soareschen and adizere Apr 29, 2021

romac added the E: gravity External: related to Gravity DEX label Apr 29, 2021

adizere mentioned this issue May 6, 2021

Use a single supervisor to subscribe to all configured chains #881

Merged

10 tasks

adizere mentioned this issue May 11, 2021

Reactive relayer architecture for packets relaying and channel/connection handshakes #764

Closed

11 tasks

soareschen mentioned this issue May 17, 2021

Fine grained error handling #950

Closed

5 tasks

ancazamfir modified the milestones: 05.2021, 06.2021 May 25, 2021

ancazamfir mentioned this issue May 25, 2021

Issues uncovered when testing with DEX liquidityd #939

Closed

14 tasks

soareschen mentioned this issue May 25, 2021

Refactor retry logic in relayer to use retry crate #989

Open

20 tasks

adizere modified the milestones: 06.2021, 08.2021 Jun 7, 2021

soareschen mentioned this issue Jun 10, 2021

Propagate LightClient errors except IO as recoverable error all the way up to update client command #1072

Closed

5 tasks

adizere linked a pull request Jul 20, 2021 that will close this issue

Propagate LightClient errors except IO as recoverable error all the way up to update client command #1072

Closed

5 tasks

adizere removed this from the 08.2021 milestone Aug 3, 2021

adizere added this to the 10.2021 milestone Aug 3, 2021

adizere modified the milestones: 10.2021, 12.2021 Sep 6, 2021

romac mentioned this issue Nov 4, 2021

Avoid submitting txs when simulation fails, except for one special case #1542

Merged

5 tasks

adizere modified the milestones: v1.0.0, Backlog Feb 25, 2022

adizere unassigned soareschen and adizere Feb 25, 2022

adizere added I: dependencies Internal: related to dependencies I: rpc Internal: related to (g)RPC labels Feb 25, 2022

adizere modified the milestones: Backlog, v2 May 9, 2022

adizere modified the milestones: v2, v1.1 Jun 29, 2022

ancazamfir closed this as completed Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relayer error handling specification #712

Relayer error handling specification #712

ancazamfir commented Feb 25, 2021 •

edited

Loading

ebuchman commented May 5, 2021

soareschen commented May 7, 2021 •

edited

Loading

ancazamfir commented May 18, 2021

soareschen commented May 18, 2021

ancazamfir commented May 18, 2021

soareschen commented May 18, 2021

ancazamfir commented May 25, 2021

adizere commented Jun 3, 2022

ancazamfir commented Jan 11, 2024

Relayer error handling specification #712

Relayer error handling specification #712

Comments

ancazamfir commented Feb 25, 2021 • edited Loading

Crate

Summary

Problem Definition

Proposal

For Admin Use

ebuchman commented May 5, 2021

soareschen commented May 7, 2021 • edited Loading

ancazamfir commented May 18, 2021

soareschen commented May 18, 2021

ancazamfir commented May 18, 2021

soareschen commented May 18, 2021

ancazamfir commented May 25, 2021

adizere commented Jun 3, 2022

ancazamfir commented Jan 11, 2024

ancazamfir commented Feb 25, 2021 •

edited

Loading

soareschen commented May 7, 2021 •

edited

Loading