Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry & backoff mechanism in worker loop #913

Merged
merged 18 commits into from
May 14, 2021
Merged

Retry & backoff mechanism in worker loop #913

merged 18 commits into from
May 14, 2021

Conversation

adizere
Copy link
Member

@adizere adizere commented May 6, 2021

Closes: #943
Follow-up to: #903

Description

  • Adds the Fibonacci-based retry mechanism to method run_uni_chan_path()

  • Ensure we don't retry for more than 2 seconds

  • Fix the sending on a disconnected channel dangling error (details in the trace below)

    Show trace
    May 06 21:29:44.358 ERROR worker loop{worker=channel-0/transfer:ibc-1->ibc-0}: ibc_relayer::supervisor: failed during query to chain id ibc-0 with underlying error: RPC error to endpoint http://localhost:26657/: error trying to connect: tcp connect error: Connection refused (os error 61) (code: 0)
    May 06 21:29:44.359 ERROR worker loop{worker=channel-0/transfer:ibc-1->ibc-0}: ibc_relayer::supervisor: worker error: UnidirectionalChannelPath worker failed after 6 retries
    May 06 21:29:44.359 INFO worker loop{worker=channel-0/transfer:ibc-1->ibc-0}: ibc_relayer::supervisor: worker exits
    May 06 21:29:45.331 ERROR ibc_relayer::supervisor: [ibc-1] error during batch processing: sending on a disconnected channel
    May 06 21:29:46.469 ERROR ibc_relayer::supervisor: [ibc-1] error during batch processing: sending on a disconnected channel
    ...
  • Ensure we can properly resume relaying. At the moment the relayer does not receive events from the chain that went down after the chain is restarted. Fixed: what happened was that the event monitor stopped retrying before the node went back up again.

  • Do not stop the workers when the node goes down because otherwise we don't pick up on packets that were sent after the worker got stopped until some more packets are sent so that the worker starts again and clears them.


For contributor use:

  • Updated the Unreleased section of CHANGELOG.md with the issue.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

@adizere adizere requested review from ancazamfir and romac as code owners May 6, 2021 19:26
@adizere adizere changed the title Limited retry & backoff mechanism in worker loop Retry & backoff mechanism in worker loop May 6, 2021
@adizere adizere marked this pull request as draft May 7, 2021 08:06
@romac romac marked this pull request as ready for review May 14, 2021 14:05
@romac romac removed the in progress label May 14, 2021
Copy link
Collaborator

@ancazamfir ancazamfir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, i tested by making the packet worker fail on send_packet event and successful in clearing packets. It switches nicely between event based relaying (passive mode) to packet clearing (active mode) and then back to event. Should be good for now.

@romac romac merged commit d2eaa6a into master May 14, 2021
@romac romac deleted the adi/worker-retry branch May 14, 2021 15:17
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Retry strategy in uni-directional worker

* Better retry in worker. Default

* Re-add retries in uni_chan_path worker

* Cleanup

* Ensure total duration of worker retries does not exceed 2 seconds

* Notify supervisor when unichan worker stops after exceeding max retries

* Increase max total retry delay in event monitor to 10min

* Make event monitor less verbose in log level higher than trace

* Fix bug where worker would retry with the next event batch for every retry

* Do not stop the worker after exhausting retry attempts

* Cleanup

* Fix compilation error

* Set clear packets again after failing to retry

* Update changelog

* Add unit test for clamp_delay and fix case where it would overshoot the max delay

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make UniChanPath worker more resilient to node going down
3 participants