Skip to content

Commit

Permalink
fix: determine F3 participants relative to current network name (#12597)
Browse files Browse the repository at this point in the history
* Investigate intermittent F3 itest failures on CI

Repeat F3 itests on CI to investigate intermittent failures.

* Fix participation lease removal for wrong network

When manifest changes, depending on the timing it is possible for newly
generated valid leases to get removed if the sign message loop attempts
to sign messages that are as a result of progressing previous network.

Here is an example scenario in a specific order that was causing itests
to fail:
 * participants get a lease for network A up to instance 5
 * network A progresses to instance 6
 * manifest changes the network name to B
 * participants get a new lease for network B up to instance 5
 * sign loop receives a message from network A, instance 6
 * `getParticipantsByInstance` lazily removes leases since it only
   checks the instance.
 * the node ends up with no participants, and stuck.

To fix this:
 1) check if participants asked for are within the current network, and
    if not refuse to participate.
 2) check network name, as well as instance, to lazily remove expired
    leases.

* Add debug capability to F3 itests to print current progress

To aid debugging failing tests add option to print progress of all nodes
at every eventual assertion, disabled by default.

* Shorten GPBFT settings for a more responsive timing

Defaults are based on epoch of 30s and real RTT. Shorten Delta and
rebroadcast times.

* Remove F3 itest repetitions on CI now that saul goodman

See proof of the pudding:
 * https://github.com/filecoin-project/lotus/actions/runs/11369403828/job/31626763159?pr=12597

* Update the changelog

* Address review comments

* Remove the sanity check that all nodes use the same initial manifest

Signed-off-by: Jakub Sztandera <oss@kubuxu.com>
  • Loading branch information
masih authored and Kubuxu committed Oct 21, 2024
1 parent 36e6a63 commit ac5c89f
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 110 deletions.
114 changes: 18 additions & 96 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,106 +1,28 @@
# Lotus changelog

# Node and Miner v1.30.0-rc2 / 2024-10-14
# UNRELEASED

This is the second release candidate of the upcoming MANDATORY Lotus v1.30.0 release, which will deliver the Filecoin network version 24, codenamed Tuk Tuk 🛺. This release candidate sets the calibration network upgrade to `epoch 207879`, which corresponds to `2024-10-23T13:30:00Z`. F3 is set to be automatically activated one day later at epoch `2081674`, which corresponds to `2024-10-24T13:30:00Z`.

> [!NOTE]
> 1. This release candidate does NOT set the mainnet network upgrade epoch. It will be added in the final release (expected October 30th).

- You can follow this release issue for keeping up with the release dates, epochs, and updates: https://github.com/filecoin-project/lotus/issues/12480

## ☢️ Upgrade Warnings ☢️

- If you are running the v1.28.x version of Lotus, please go through the Upgrade Warnings section for the v1.28.* releases and v1.29.*, before upgrading to this RC.
- This release requires a minimum Go version of v1.22.7 or higher.
- The `releases` branch has been deprecated with the 202408 split of 'Lotus Node' and 'Lotus Miner'. See https://github.com/filecoin-project/lotus/blob/master/LOTUS_RELEASE_FLOW.md#why-is-the-releases-branch-deprecated-and-what-are-alternatives for more info and alternatives for getting the latest release for both the 'Lotus Node' and 'Lotus Miner' based on the Branch and Tag Strategy.
- To get the latest Lotus Node tag: git tag -l 'v*' | sort -V -r | head -n 1
- To get the latest Lotus Miner tag: git tag -l 'miner/v*' | sort -V -r | head -n 1

## 🏛️ Filecoin network version 24 FIPs

- [FIP-0081: Introduce lower bound for sector initial pledge](https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0081.md)
- [FIP-0086: Fast Finality in Filecoin (F3)](https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0086.md)
- [FIP-0094: Add Support for EIP-5656 (MCOPY Opcode) in the FEVM](https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0094.md)
- [FIP-0095: Add FEVM precompile to fetch beacon digest from chain history](https://github.com/filecoin-project/FIPs/blob/master/FIPS/fip-0095.md)

## 📦 v15 Builtin Actor Bundle

This release candidate uses the [v15.0.0-rc1](https://github.com/filecoin-project/builtin-actors/releases/tag/v15.0.0-rc1)

## 🚚 Migration

All node operators, including storage providers, should be aware that ONE pre-migration is being scheduled 120 epochs before the network upgrade. The migration for the NV24 upgrade is expected to be light with no heavy pre-migrations:

- Pre-Migration is expected to take less then 1 minute.
- The migration on the upgrade epoch is expected to take less than 30 seconds on a node with a NVMe-drive and a newer CPU. For nodes running on slower disks/CPU, it is still expected to take less then 1 minute.
- RAM usages is expected to be under 20GiB RAM for both the pre-migration and migration.

We recommend node operators (who haven't enabled splitstore discard mode) that do not care about historical chain states, to prune the chain blockstore by syncing from a snapshot 1-2 days before the upgrade.

For certain node operators, such as full archival nodes or systems that need to keep large amounts of state (RPC providers), we recommend skipping the pre-migration and run the non-cached migration (i.e., just running the migration at the network upgrade epoch), and schedule for some additional downtime. Operators of such nodes can read the [How to disable premigration in network upgrade tutorial](https://lotus.filecoin.io/kb/disable-premigration/).

## 📝 Changelog
## New features
- Update `EthGetBlockByNumber` to return a pointer to ethtypes.EthBlock or nil for null rounds. ([filecoin-project/lotus#12529](https://github.com/filecoin-project/lotus/pull/12529))
- Reduce size of embedded genesis CAR files by removing WASM actor blocks and compressing with zstd. This reduces the `lotus` binary size by approximately 10 MiB. ([filecoin-project/lotus#12439](https://github.com/filecoin-project/lotus/pull/12439))
- Add ChainSafe operated Calibration archival node to the bootstrap list ([filecoin-project/lotus#12517](https://github.com/filecoin-project/lotus/pull/12517))
- Fix hotloop in F3 pariticpation API ([filecoin-project/lotus#12575](https://github.com/filecoin-project/lotus/pull/12575))
- `lotus chain head` now supports a `--height` flag to print just the epoch number of the current chain head ([filecoin-project/lotus#12609](https://github.com/filecoin-project/lotus/pull/12609))
- `lotus-shed indexes inspect-indexes` now performs a comprehensive comparison of the event index data for each message by comparing the AMT root CID from the message receipt with the root of a reconstructed AMT. Previously `inspect-indexes` simply compared event counts, comparing AMT roots confirms all the event data is byte-perfect. ([filecoin-project/lotus#12570](https://github.com/filecoin-project/lotus/pull/12570))

For the set of changes since the last stable release:
## Bug Fixes
- Fix a bug in the `lotus-shed indexes backfill-events` command that may result in either duplicate events being backfilled where there are existing events (such an operation *should* be idempotent) or events erroneously having duplicate `logIndex` values when queried via ETH APIs. ([filecoin-project/lotus#12567](https://github.com/filecoin-project/lotus/pull/12567))
- Event APIs (Eth events and actor events) should only return reverted events if client queries by specific block hash / tipset. Eth and actor event subscription APIs should always return reverted events to enable accurate observation of real-time changes. ([filecoin-project/lotus#12585](https://github.com/filecoin-project/lotus/pull/12585))
- Add logic to check if the miner's owner address is delegated (f4 address). If it is delegated, the `lotus-shed sectors termination-estimate` command now sends the termination state call using the worker ID. This fix resolves the issue where termination-estimate did not function correctly for miners with delegated owner addresses. ([filecoin-project/lotus#12569](https://github.com/filecoin-project/lotus/pull/12569))
- Fix a bug in F3 participation API where valid leases may get removed due to dynamic manifest update. ([filecoin-project/lotus#12597](https://github.com/filecoin-project/lotus/pull/12597))

* Node: https://github.com/filecoin-project/lotus/compare/v1.29.2...v1.30.0-rc2
* Miner: https://github.com/filecoin-project/lotus/compare/v1.28.3...miner/v1.30.0-rc2
## Deps

## 👨‍👩‍👧‍👦 Contributors
# UNRELEASED Node v1.30.0
See https://github.com/filecoin-project/lotus/blob/release/v1.30.0/CHANGELOG.md

| Contributor | Commits | Lines ± | Files Changed |
|-------------|---------|---------|---------------|
| Krishang | 2 | +34106/-0 | 109 |
| Rod Vagg | 86 | +10643/-8291 | 456 |
| Masih H. Derkani | 59 | +7700/-4725 | 298 |
| Steven Allen | 55 | +6113/-3169 | 272 |
| kamuik16 | 7 | +4618/-1333 | 285 |
| Jakub Sztandera | 10 | +3995/-1226 | 94 |
| Peter Rabbitson | 26 | +2313/-2718 | 275 |
| Viraj Bhartiya | 5 | +2624/-580 | 50 |
| Phi | 7 | +1337/-1519 | 257 |
| Mikers | 1 | +1274/-455 | 23 |
| Phi-rjan | 29 | +736/-600 | 92 |
| Andrew Jackson (Ajax) | 3 | +732/-504 | 75 |
| LexLuthr | 3 | +167/-996 | 8 |
| Aarsh Shah | 12 | +909/-177 | 47 |
| web3-bot | 40 | +445/-550 | 68 |
| Piotr Galar | 6 | +622/-372 | 15 |
| aarshkshah1992 | 18 | +544/-299 | 40 |
| Steve Loeppky | 14 | +401/-196 | 22 |
| Frrist | 1 | +403/-22 | 5 |
| Łukasz Magiera | 4 | +266/-27 | 13 |
| winniehere | 1 | +146/-144 | 3 |
| Jon | 1 | +209/-41 | 4 |
| Aryan Tikarya | 2 | +183/-8 | 7 |
| adlrocha | 2 | +123/-38 | 21 |
| dependabot[bot] | 11 | +87/-61 | 22 |
| Jiaying Wang | 8 | +61/-70 | 12 |
| Ian Davis | 2 | +60/-38 | 5 |
| Aayush Rajasekaran | 2 | +81/-3 | 3 |
| hanabi1224 | 4 | +46/-4 | 5 |
| Laurent Senta | 1 | +44/-1 | 2 |
| jennijuju | 6 | +21/-20 | 17 |
| parthshah1 | 1 | +23/-13 | 1 |
| Brendan O'Brien | 1 | +25/-10 | 2 |
| Jennifer Wang | 4 | +24/-8 | 6 |
| Matthew Rothenberg | 3 | +10/-18 | 6 |
| riskrose | 1 | +8/-8 | 7 |
| linghuying | 1 | +5/-5 | 5 |
| fsgerse | 2 | +3/-7 | 3 |
| PolyMa | 1 | +5/-5 | 5 |
| zhangguanzhang | 1 | +3/-3 | 2 |
| luozexuan | 1 | +3/-3 | 3 |
| Po-Chun Chang | 1 | +6/-0 | 2 |
| Kevin Martin | 1 | +4/-1 | 2 |
| simlecode | 1 | +2/-2 | 2 |
| ZenGround0 | 1 | +2/-2 | 2 |
| GFZRZK | 1 | +2/-1 | 1 |
| DemoYeti | 1 | +2/-1 | 1 |
| qwdsds | 1 | +1/-1 | 1 |
| Samuel Arogbonlo | 1 | +2/-0 | 2 |
| Elias Rad | 1 | +1/-1 | 1 |
# UNRELEASED Miner v1.30.0
See https://github.com/filecoin-project/lotus/blob/release/miner/v1.30.0/CHANGELOG.md

# Node v1.29.2 / 2024-10-03

Expand Down
2 changes: 1 addition & 1 deletion chain/lf3/f3.go
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ func (fff *F3) runSigningLoop(ctx context.Context) {
clear(alreadyParticipated)
}

participants := fff.leaser.getParticipantsByInstance(mb.Payload.Instance)
participants := fff.leaser.getParticipantsByInstance(mb.NetworkName, mb.Payload.Instance)
for _, id := range participants {
if _, ok := alreadyParticipated[id]; ok {
continue
Expand Down
16 changes: 13 additions & 3 deletions chain/lf3/participation_lease.go
Original file line number Diff line number Diff line change
Expand Up @@ -112,15 +112,25 @@ func (l *leaser) participate(ticket api.F3ParticipationTicket) (api.F3Participat
return newLease, nil
}

func (l *leaser) getParticipantsByInstance(instance uint64) []uint64 {
func (l *leaser) getParticipantsByInstance(network gpbft.NetworkName, instance uint64) []uint64 {
l.mutex.Lock()
defer l.mutex.Unlock()
currentManifest, _ := l.status()
currentNetwork := currentManifest.NetworkName
if currentNetwork != network {
return nil
}
var participants []uint64
for id, lease := range l.leases {
if instance > lease.ToInstance() {
if currentNetwork != lease.Network {
// Lazily delete any lease that does not belong to network, likely acquired from
// prior manifests.
delete(l.leases, id)
log.Warnf("lost F3 participation lease for miner %d at instance %d due to network mismatch: %s != %s", id, instance, currentNetwork, lease.Network)
} else if instance > lease.ToInstance() {
// Lazily delete the expired leases.
delete(l.leases, id)
log.Warnf("lost F3 participation lease for miner %d", id)
log.Warnf("lost F3 participation lease for miner %d due to instance (%d) > lease to instance (%d)", id, instance, lease.ToInstance())
} else {
participants = append(participants, id)
}
Expand Down
6 changes: 3 additions & 3 deletions chain/lf3/participation_lease_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,18 +42,18 @@ func TestLeaser(t *testing.T) {
require.NoError(t, err)

// Both participants should still be valid.
participants := subject.getParticipantsByInstance(11)
participants := subject.getParticipantsByInstance(testManifest.NetworkName, 11)
require.Len(t, participants, 2)
require.Contains(t, participants, uint64(123))
require.Contains(t, participants, uint64(456))

// After instance 16, only participant 456 should be valid.
participants = subject.getParticipantsByInstance(16)
participants = subject.getParticipantsByInstance(testManifest.NetworkName, 16)
require.Len(t, participants, 1)
require.Contains(t, participants, uint64(456))

// After instance 17, no participant must have a lease.
participants = subject.getParticipantsByInstance(17)
participants = subject.getParticipantsByInstance(testManifest.NetworkName, 17)
require.Empty(t, participants)
})
t.Run("expired ticket", func(t *testing.T) {
Expand Down
64 changes: 57 additions & 7 deletions itests/f3_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package itests

import (
"context"
"sync"
"testing"
"time"

Expand Down Expand Up @@ -36,6 +37,7 @@ type testEnv struct {
m *manifest.Manifest
t *testing.T
testCtx context.Context
debug bool
}

// Test that checks that F3 is enabled successfully,
Expand Down Expand Up @@ -194,6 +196,24 @@ func (e *testEnv) waitFor(f func(n *kit.TestFullNode) bool, timeout time.Duratio
e.t.Helper()
require.Eventually(e.t, func() bool {
e.t.Helper()
defer func() {
if e.debug {
var wg sync.WaitGroup
printProgress := func(index int, n *kit.TestFullNode) {
defer wg.Done()
if progress, err := n.F3GetProgress(e.testCtx); err != nil {
e.t.Logf("Node #%d progress: err: %v", index, err)
} else {
e.t.Logf("Node #%d progress: %v", index, progress)
}
}
for i, n := range e.minerFullNodes {
wg.Add(1)
go printProgress(i, n)
}
wg.Wait()
}
}()
for _, n := range e.minerFullNodes {
if !f(n) {
return false
Expand All @@ -209,8 +229,42 @@ func (e *testEnv) waitFor(f func(n *kit.TestFullNode) bool, timeout time.Duratio
// and the second full-node is an observer that is not directly connected to
// a miner. The last return value is the manifest sender for the network.
func setup(t *testing.T, blocktime time.Duration) *testEnv {
manif := lf3.NewManifest(BaseNetworkName+"/1", DefaultFinality, DefaultBootstrapEpoch, blocktime, cid.Undef)
return setupWithStaticManifest(t, manif, false)
return setupWithStaticManifest(t, newTestManifest(blocktime), false)
}

func newTestManifest(blocktime time.Duration) *manifest.Manifest {
return &manifest.Manifest{
ProtocolVersion: manifest.VersionCapability,
BootstrapEpoch: DefaultBootstrapEpoch,
NetworkName: BaseNetworkName + "/1",
InitialPowerTable: cid.Undef,
CommitteeLookback: manifest.DefaultCommitteeLookback,
CatchUpAlignment: blocktime / 2,
Gpbft: manifest.GpbftConfig{
// Use smaller time intervals for more responsive test progress/assertion.
Delta: 250 * time.Millisecond,
DeltaBackOffExponent: 1.3,
MaxLookaheadRounds: 5,
RebroadcastBackoffBase: 500 * time.Millisecond,
RebroadcastBackoffSpread: 0.1,
RebroadcastBackoffExponent: 1.3,
RebroadcastBackoffMax: 1 * time.Second,
},
EC: manifest.EcConfig{
Period: blocktime,
Finality: DefaultFinality,
DelayMultiplier: manifest.DefaultEcConfig.DelayMultiplier,
BaseDecisionBackoffTable: manifest.DefaultEcConfig.BaseDecisionBackoffTable,
HeadLookback: 0,
Finalize: true,
},
CertificateExchange: manifest.CxConfig{
ClientRequestTimeout: manifest.DefaultCxConfig.ClientRequestTimeout,
ServerRequestTimeout: manifest.DefaultCxConfig.ServerRequestTimeout,
MinimumPollInterval: blocktime,
MaximumPollInterval: 4 * blocktime,
},
}
}

func setupWithStaticManifest(t *testing.T, manif *manifest.Manifest, testBootstrap bool) *testEnv {
Expand Down Expand Up @@ -262,10 +316,7 @@ func setupWithStaticManifest(t *testing.T, manif *manifest.Manifest, testBootstr
cancel()
}

m, err := n1.F3GetManifest(ctx)
require.NoError(t, err)

e := &testEnv{m: m, t: t, testCtx: context.Background()}
e := &testEnv{m: manif, t: t, testCtx: ctx}
// in case we want to use more full-nodes in the future
e.minerFullNodes = []*kit.TestFullNode{&n1, &n2, &n3}

Expand All @@ -275,7 +326,6 @@ func setupWithStaticManifest(t *testing.T, manif *manifest.Manifest, testBootstr
err = n.NetConnect(ctx, e.ms.PeerInfo())
require.NoError(t, err)
}

errgrp.Go(func() error {
defer func() {
require.NoError(t, manifestServerHost.Close())
Expand Down

0 comments on commit ac5c89f

Please sign in to comment.