Retry & backoff mechanism in worker loop #913

adizere · 2021-05-06T19:26:50Z

Closes: #943
Follow-up to: #903

Description

Adds the Fibonacci-based retry mechanism to method run_uni_chan_path()
Ensure we don't retry for more than 2 seconds
Fix the sending on a disconnected channel dangling error (details in the trace below)

Show trace

May 06 21:29:44.358 ERROR worker loop{worker=channel-0/transfer:ibc-1->ibc-0}: ibc_relayer::supervisor: failed during query to chain id ibc-0 with underlying error: RPC error to endpoint http://localhost:26657/: error trying to connect: tcp connect error: Connection refused (os error 61) (code: 0)
May 06 21:29:44.359 ERROR worker loop{worker=channel-0/transfer:ibc-1->ibc-0}: ibc_relayer::supervisor: worker error: UnidirectionalChannelPath worker failed after 6 retries
May 06 21:29:44.359 INFO worker loop{worker=channel-0/transfer:ibc-1->ibc-0}: ibc_relayer::supervisor: worker exits
May 06 21:29:45.331 ERROR ibc_relayer::supervisor: [ibc-1] error during batch processing: sending on a disconnected channel
May 06 21:29:46.469 ERROR ibc_relayer::supervisor: [ibc-1] error during batch processing: sending on a disconnected channel
...
Ensure we can properly resume relaying. At the moment the relayer does not receive events from the chain that went down after the chain is restarted. Fixed: what happened was that the event monitor stopped retrying before the node went back up again.
Do not stop the workers when the node goes down because otherwise we don't pick up on packets that were sent after the worker got stopped until some more packets are sent so that the worker starts again and clears them.

For contributor use:

Updated the Unreleased section of CHANGELOG.md with the issue.
If applicable: Unit tests written, added test to CI.
Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
Updated relevant documentation (docs/) and code comments.
Re-reviewed Files changed in the Github PR explorer.

…retry

…he max delay

ancazamfir

Looks good, i tested by making the packet worker fail on send_packet event and successful in clearing packets. It switches nicely between event based relaying (passive mode) to packet clearing (active mode) and then back to event. Should be good for now.

* Retry strategy in uni-directional worker * Better retry in worker. Default * Re-add retries in uni_chan_path worker * Cleanup * Ensure total duration of worker retries does not exceed 2 seconds * Notify supervisor when unichan worker stops after exceeding max retries * Increase max total retry delay in event monitor to 10min * Make event monitor less verbose in log level higher than trace * Fix bug where worker would retry with the next event batch for every retry * Do not stop the worker after exhausting retry attempts * Cleanup * Fix compilation error * Set clear packets again after failing to retry * Update changelog * Add unit test for clamp_delay and fix case where it would overshoot the max delay Co-authored-by: Romain Ruetschi <romain@informal.systems>

adizere added 2 commits May 6, 2021 21:24

Retry strategy in uni-directional worker

f93936e

Better retry in worker. Default

ef5d6a2

adizere requested review from ancazamfir and romac as code owners May 6, 2021 19:26

adizere changed the title ~~Limited retry & backoff mechanism in worker loop~~ Retry & backoff mechanism in worker loop May 6, 2021

adizere marked this pull request as draft May 7, 2021 08:06

Merge branch 'master' into adi/worker-retry

0c7044b

adizere assigned romac May 11, 2021

romac added 4 commits May 12, 2021 10:44

Merge branch 'master' into adi/worker-retry

4036c7d

Re-add retries in uni_chan_path worker

db733fd

Cleanup

384fddc

Ensure total duration of worker retries does not exceed 2 seconds

837a83d

romac added the in progress label May 12, 2021

romac added 9 commits May 12, 2021 19:06

Notify supervisor when unichan worker stops after exceeding max retries

a0d7f28

Merge branch 'master' into adi/worker-retry

b27af19

Increase max total retry delay in event monitor to 10min

8108877

Make event monitor less verbose in log level higher than trace

e074abf

Fix bug where worker would retry with the next event batch for every …

b2a54e1

…retry

Do not stop the worker after exhausting retry attempts

4d1ee76

Cleanup

7fb9cef

Fix compilation error

890cc81

Set clear packets again after failing to retry

c9e70d6

romac marked this pull request as ready for review May 14, 2021 14:05

romac added 2 commits May 14, 2021 16:09

Update changelog

09fca03

Add unit test for clamp_delay and fix case where it would overshoot t…

2914bf5

…he max delay

romac removed the in progress label May 14, 2021

ancazamfir reviewed May 14, 2021

View reviewed changes

ancazamfir approved these changes May 14, 2021

View reviewed changes

romac merged commit d2eaa6a into master May 14, 2021

romac deleted the adi/worker-retry branch May 14, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry & backoff mechanism in worker loop #913

Retry & backoff mechanism in worker loop #913

adizere commented May 6, 2021 •

edited by romac

Loading

ancazamfir left a comment •

edited

Loading

Retry & backoff mechanism in worker loop #913

Retry & backoff mechanism in worker loop #913

Conversation

adizere commented May 6, 2021 • edited by romac Loading

Description

ancazamfir left a comment • edited Loading

Choose a reason for hiding this comment

adizere commented May 6, 2021 •

edited by romac

Loading

ancazamfir left a comment •

edited

Loading