Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: determine F3 participants relative to current network name #12597

Merged
merged 8 commits into from
Oct 17, 2024

Conversation

masih
Copy link
Member

@masih masih commented Oct 14, 2024

Related Issues

Fixes #12519

Proposed Changes

When manifest changes, depending on the timing it is possible for newly generated valid leases to get removed if the sign message loop attempts to sign messages that are as a result of progressing previous network.

Here is an example scenario in a specific order that was causing itests to fail:

  • participants get a lease for network A up to instance 5
  • network A progresses to instance 6
  • manifest changes the network name to B
  • participants get a new lease for network B up to instance 5
  • sign loop receives a message from network A, instance 6
  • getParticipantsByInstance lazily removes leases since it only
    checks the instance.
  • the node ends up with no participants, and stuck.

To fix this:

  1. check if participants asked for are within the current network, and if not refuse to participate.
  2. check network name, as well as instance, to lazily remove expired leases.

To aid debugging failing tests in the future add option to print progress of all nodes at every eventual assertion, disabled by default.

Additional Info

Look closely at the commits in this PR. The commits introduce a dedicated CI job that repeatedly runs the flaky tests (50 times) to assert that they are indeed fixed. The job is then removed in later commits.

These commits are left here for the benefit of the reviewer as proof of the pudding.

Checklist

Before you mark the PR ready for review, please make sure that:

@masih masih force-pushed the masih/repeat-f3-tests-on-ci branch 6 times, most recently from f957084 to 610be8c Compare October 15, 2024 14:10
Repeat F3 itests on CI to investigate intermittent failures.
@masih masih force-pushed the masih/repeat-f3-tests-on-ci branch 2 times, most recently from 1a2cd45 to ba33546 Compare October 16, 2024 14:58
masih added 2 commits October 16, 2024 16:51
When manifest changes, depending on the timing it is possible for newly
generated valid leases to get removed if the sign message loop attempts
to sign messages that are as a result of progressing previous network.

Here is an example scenario in a specific order that was causing itests
to fail:
 * participants get a lease for network A up to instance 5
 * network A progresses to instance 6
 * manifest changes the network name to B
 * participants get a new lease for network B up to instance 5
 * sign loop receives a message from network A, instance 6
 * `getParticipantsByInstance` lazily removes leases since it only
   checks the instance.
 * the node ends up with no participants, and stuck.

To fix this:
 1) check if participants asked for are within the current network, and
    if not refuse to participate.
 2) check network name, as well as instance, to lazily remove expired
    leases.
To aid debugging failing tests add option to print progress of all nodes
at every eventual assertion, disabled by default.
@masih masih force-pushed the masih/repeat-f3-tests-on-ci branch from ba33546 to 0a15c68 Compare October 16, 2024 15:51
@masih masih self-assigned this Oct 16, 2024
Defaults are based on epoch of 30s and real RTT. Shorten Delta and
rebroadcast times.
@masih masih force-pushed the masih/repeat-f3-tests-on-ci branch from 0a15c68 to 0d3cb66 Compare October 16, 2024 15:59
@masih masih changed the title Investigate intermittent F3 itest failures on CI fix: determine F3 participants relative to current network name Oct 16, 2024
@masih masih marked this pull request as ready for review October 16, 2024 16:15
@masih masih requested review from Stebalien and Kubuxu October 16, 2024 16:15
chain/lf3/participation_lease.go Outdated Show resolved Hide resolved
itests/f3_test.go Outdated Show resolved Hide resolved
itests/f3_test.go Outdated Show resolved Hide resolved
@masih masih merged commit fda61d3 into master Oct 17, 2024
83 checks passed
@masih masih deleted the masih/repeat-f3-tests-on-ci branch October 17, 2024 14:32
Kubuxu pushed a commit that referenced this pull request Oct 21, 2024
* Investigate intermittent F3 itest failures on CI

Repeat F3 itests on CI to investigate intermittent failures.

* Fix participation lease removal for wrong network

When manifest changes, depending on the timing it is possible for newly
generated valid leases to get removed if the sign message loop attempts
to sign messages that are as a result of progressing previous network.

Here is an example scenario in a specific order that was causing itests
to fail:
 * participants get a lease for network A up to instance 5
 * network A progresses to instance 6
 * manifest changes the network name to B
 * participants get a new lease for network B up to instance 5
 * sign loop receives a message from network A, instance 6
 * `getParticipantsByInstance` lazily removes leases since it only
   checks the instance.
 * the node ends up with no participants, and stuck.

To fix this:
 1) check if participants asked for are within the current network, and
    if not refuse to participate.
 2) check network name, as well as instance, to lazily remove expired
    leases.

* Add debug capability to F3 itests to print current progress

To aid debugging failing tests add option to print progress of all nodes
at every eventual assertion, disabled by default.

* Shorten GPBFT settings for a more responsive timing

Defaults are based on epoch of 30s and real RTT. Shorten Delta and
rebroadcast times.

* Remove F3 itest repetitions on CI now that saul goodman

See proof of the pudding:
 * https://github.com/filecoin-project/lotus/actions/runs/11369403828/job/31626763159?pr=12597

* Update the changelog

* Address review comments

* Remove the sanity check that all nodes use the same initial manifest

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>
rjan90 pushed a commit that referenced this pull request Oct 24, 2024
* Investigate intermittent F3 itest failures on CI

Repeat F3 itests on CI to investigate intermittent failures.

* Fix participation lease removal for wrong network

When manifest changes, depending on the timing it is possible for newly
generated valid leases to get removed if the sign message loop attempts
to sign messages that are as a result of progressing previous network.

Here is an example scenario in a specific order that was causing itests
to fail:
 * participants get a lease for network A up to instance 5
 * network A progresses to instance 6
 * manifest changes the network name to B
 * participants get a new lease for network B up to instance 5
 * sign loop receives a message from network A, instance 6
 * `getParticipantsByInstance` lazily removes leases since it only
   checks the instance.
 * the node ends up with no participants, and stuck.

To fix this:
 1) check if participants asked for are within the current network, and
    if not refuse to participate.
 2) check network name, as well as instance, to lazily remove expired
    leases.

* Add debug capability to F3 itests to print current progress

To aid debugging failing tests add option to print progress of all nodes
at every eventual assertion, disabled by default.

* Shorten GPBFT settings for a more responsive timing

Defaults are based on epoch of 30s and real RTT. Shorten Delta and
rebroadcast times.

* Remove F3 itest repetitions on CI now that saul goodman

See proof of the pudding:
 * https://github.com/filecoin-project/lotus/actions/runs/11369403828/job/31626763159?pr=12597

* Update the changelog

* Address review comments

* Remove the sanity check that all nodes use the same initial manifest
rjan90 pushed a commit that referenced this pull request Oct 28, 2024
* Investigate intermittent F3 itest failures on CI

Repeat F3 itests on CI to investigate intermittent failures.

* Fix participation lease removal for wrong network

When manifest changes, depending on the timing it is possible for newly
generated valid leases to get removed if the sign message loop attempts
to sign messages that are as a result of progressing previous network.

Here is an example scenario in a specific order that was causing itests
to fail:
 * participants get a lease for network A up to instance 5
 * network A progresses to instance 6
 * manifest changes the network name to B
 * participants get a new lease for network B up to instance 5
 * sign loop receives a message from network A, instance 6
 * `getParticipantsByInstance` lazily removes leases since it only
   checks the instance.
 * the node ends up with no participants, and stuck.

To fix this:
 1) check if participants asked for are within the current network, and
    if not refuse to participate.
 2) check network name, as well as instance, to lazily remove expired
    leases.

* Add debug capability to F3 itests to print current progress

To aid debugging failing tests add option to print progress of all nodes
at every eventual assertion, disabled by default.

* Shorten GPBFT settings for a more responsive timing

Defaults are based on epoch of 30s and real RTT. Shorten Delta and
rebroadcast times.

* Remove F3 itest repetitions on CI now that saul goodman

See proof of the pudding:
 * https://github.com/filecoin-project/lotus/actions/runs/11369403828/job/31626763159?pr=12597

* Update the changelog

* Address review comments

* Remove the sanity check that all nodes use the same initial manifest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: ☑️ Done (Archive)
Development

Successfully merging this pull request may close these issues.

Flaky F3 itests
4 participants