Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relayer error handling specification #712

Closed
3 of 5 tasks
ancazamfir opened this issue Feb 25, 2021 · 9 comments
Closed
3 of 5 tasks

Relayer error handling specification #712

ancazamfir opened this issue Feb 25, 2021 · 9 comments
Labels
E: gravity External: related to Gravity DEX I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC
Milestone

Comments

@ancazamfir
Copy link
Collaborator

ancazamfir commented Feb 25, 2021

Crate

docs

Summary

Currently the relayer performs retries for any error that occurs while building messages and transactions.

Problem Definition

There are RPC, light client, on-chain, internal errors, etc. The most complex ones are the on-chain errors that are a bit more subtle (leaving aside errors in the relayer code) and also they are not typed (we just get an error message and the reason is embedded in a string). The retry may apply for some cases, in others we should not retry (e.g. a transaction has already been submitted by another relayer), and yet in other cases we should rebuild stuff, etc.

Proposal

Analyze the code, document all errors and the actions to be taken in each case.
Consider also the exponential backoff as per @romac's informalsystems/ibc-rs#709 (comment), where applicable.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ancazamfir ancazamfir added the I: logic Internal: related to the relaying logic label Feb 25, 2021
@ancazamfir ancazamfir added this to the 05.2021 milestone Feb 25, 2021
@ancazamfir ancazamfir modified the milestones: 05.2021, 06.2021 Apr 28, 2021
@adizere adizere modified the milestones: 06.2021, 05.2021 Apr 29, 2021
@romac romac added the E: gravity External: related to Gravity DEX label Apr 29, 2021
@ebuchman
Copy link
Member

ebuchman commented May 5, 2021

also they are not typed (we just get an error message and the reason is embedded in a string)

May be worth trying to get these typed upstream in ibc

@soareschen
Copy link
Contributor

soareschen commented May 7, 2021

I tried to do a full trace of an example error message to see how it is propagated. There is a complicated path from the Hermes relayer to the IBC module, and there are many potential errors that can occur in each step.

On the Go side, the error messages are formatted using helpers such as sdkerrors.Wrapf that are converted into strings. It would be challenging to modify the Go code to use more structured logging, so we can only do pattern matching on the string to find the type of error that are of our interest.

Here is an example error message:

Error: tx error: failed with underlying cause: tx response error: deliver_tx reports error: log=Log("failed to execute message; message index: 1: connection handshake open try failed: failed connection state verification for client (07-tendermint-4): chained membership proof failed to verify membership of value: 0A0F... in subroot C9FE...at index 0. Please ensure the path and value are both correct.: invalid proof")

And here is the manual stack trace:

I'm not sure if there are good way to document all errors on the Go code, other than trying to search through the code with keywords such as: status.Errorf, sdkerrors.Wrap, sdkerrors.Wrapf, fmt.Errorf, and look for standard errors in ibc-go/modules/*/errors.go. If we do document the errors, we could submit the documentation to ibc-go as well.

@ancazamfir
Copy link
Collaborator Author

we get the response from the Tx here: https://github.com/informalsystems/ibc-rs/blob/2592440b0c599304a1032746ce975b2cf7baa673/relayer/src/chain/cosmos.rs#L1479

The response has a check_tx and a deliver_tx result with type TxResult defined here https://github.com/informalsystems/tendermint-rs/blob/89b3fd0a236b54fcbd149aa9917a52f7d0a6de26/rpc/src/endpoint/broadcast/tx_commit.rs#L61-L91

We currently take the log and embed it in an error. I think we are interested in returning some typed errors based on the code and let the caller decide if the error is recoverable or not.

For IBC the errors (codes) are defined in the errors.go file in each sub-protocol. One example: https://github.com/cosmos/ibc-go/blob/main/modules/core/03-connection/types/errors.go

If I try to send a chan-open-try for a connection that is not opened, an error is issued from the chain here (ErrInvalidConnectionState which is code 6):
https://github.com/cosmos/ibc-go/blob/f4123347754d814759c9208273fe6772880063cb/modules/core/04-channel/keeper/handshake.go#L154-L158

And we get this response (prinln from our tx_result_to_event)

Response {
    check_tx: TxResult {
        code: Ok,
        data: None,
        log: Log(
            "[]",
        ),
        info: Info(
            "",
        ),
        gas_wanted: Gas(
            0,
        ),
        gas_used: Gas(
            0,
        ),
        events: [],
        codespace: Codespace(
            "",
        ),
    },
    deliver_tx: TxResult {
        code: Err(
            6,
        ),
        data: None,
        log: Log(
            "failed to execute message; message index: 1: channel handshake open try failed: connection state is not OPEN (got STATE_INIT): invalid connection state",
        ),
        info: Info(
            "",
        ),
        gas_wanted: Gas(
            0,
        ),
        gas_used: Gas(
            0,
        ),
        events: [],
        codespace: Codespace(
            "connection",
        ),
    },
    hash: transaction::Hash(7D51A00CC530011768600E1BF64816D3CB99BE0EBBC1408E7DCC09CC4C9972B7),
    height: block::Height(39444),
}

I think we need to look at all ibc error.rs files and map the errror code -s to hermes typed errors. Not sure if a one-to-one mapping is needed but it is a start.

@soareschen
Copy link
Contributor

Yes, that is my impression as well. However the error code are just integers specific to the module, and I am not sure if they are guaranteed to not overlap when making a call. My initial thought was to do string pattern matching to determine the exact error. Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf. We could perhaps try to classify the more prominent errors in the errors.go files first, though I don't think they are exhaustive.

So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description, and I ended up hunting the entire code base and not knowing where the errors of interest are originated from. 😅

@ancazamfir
Copy link
Collaborator Author

and I am not sure if they are guaranteed to not overlap when making a call.

they don't within the codespace

Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf

yes, but on the relayer side we are interested in that code integer and translate it into a typed error.

So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description,

Sorry, I should have added more details to .. most complex ones are the on-chain errors But looking at the traces you posted it looked to me that you were on the right track.

@soareschen
Copy link
Contributor

they don't within the codespace

Aha! So that's the devil in the detail! Then I'm confident to interpret the error code directly.

Btw I'm still quite confused that when an error happens, tx_result_to_event still succeeds with the error event:

    if response.check_tx.code.is_err() {
        return Ok(vec![IbcEvent::ChainError(format!(
            "check_tx reports error: log={:?}",
            response.check_tx.log
        ))]);
    }

Is that intentional? If not I will just change it to return them as newly classified errors in relayer::errors::Error.

@ancazamfir ancazamfir modified the milestones: 05.2021, 06.2021 May 25, 2021
@ancazamfir
Copy link
Collaborator Author

Is that intentional? If not I will just change it to return them as newly classified errors in relayer::errors::Error.

that's our current way to distinguish between chain IBC errors and all the rest. There are very few for which we should retry here but we need to go over the full list of IBC SdkErrors first. And yes we can work with typed errors here also.

@adizere adizere added this to the 10.2021 milestone Aug 3, 2021
@adizere adizere modified the milestones: 10.2021, 12.2021 Sep 6, 2021
@adizere adizere modified the milestones: v1.0.0, Backlog Feb 25, 2022
@adizere adizere added I: dependencies Internal: related to dependencies I: rpc Internal: related to (g)RPC labels Feb 25, 2022
@adizere adizere modified the milestones: Backlog, v2 May 9, 2022
@adizere
Copy link
Member

adizere commented Jun 3, 2022

Seems like we have some dead code capturing various errors that come from IBC-go. Ref: https://github.com/informalsystems/ibc-rs/pull/2266#issuecomment-1145951834

@adizere adizere modified the milestones: v2, v1.1 Jun 29, 2022
@ancazamfir
Copy link
Collaborator Author

Closing as most of the work has been done in #988 and other PRs over the years. Let' reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E: gravity External: related to Gravity DEX I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC
Projects
None yet
5 participants