Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test with latest actors version #6998

Merged
merged 2 commits into from
Aug 11, 2021
Merged

Test with latest actors version #6998

merged 2 commits into from
Aug 11, 2021

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Aug 6, 2021

We were already trying to do this, but had some issues:

  1. We were forced to run all upgrades early, in-sequence.
    1. This actually has some issues as can be seen in Deadlines can get messed up around the start of a new network specs-actors#1453.
    2. We had to carefully wait for upgrades to finish before moving on.
    3. This slows down the tests.
  2. We weren't consistently testing with the latest version.
    1. Lots of tests would just run the default upgrade schedule, and would sometimes fail if the upgrade did something unexpected.
    2. We weren't always testing the latest code.

This patch:

  1. Sets the genesis version, where applicable.
  2. Defaults to an empty upgrade schedule with genesis at the latest network version.
  3. Makes some of the tests more reliable (e.g., wait a full proving period instead of the remainder of the proving period).
  4. Fixes genesis at actors v2.

TODO figure out why it's failing:

  1. I'm seeing errors about missing t_aux files in tests that try to use the actual prover. I have no idea how these tests worked before.
  2. When I remove the check for t_aux files (do we need them with fauxseals?), post still doesn't work. I'm seeing an error about trying to prove deadline 0 when we're already in deadline 1, which is suspiciously like Deadlines can get messed up around the start of a new network specs-actors#1453.

Answer: in case 2, my machine was just too slow. CI and my new machine are faster.

@Stebalien Stebalien requested a review from ZenGround0 August 6, 2021 21:47
@Stebalien Stebalien force-pushed the feat/latest-genesis branch from 713afa3 to 337c65c Compare August 9, 2021 22:27
@Stebalien
Copy link
Member Author

Rebasing on #7011. The tests now pass locally when doing this. I believe this was a timing issue (I'm now using a faster computer).

However, this has me a bit concerned (for the tests, not the network). Now that we're starting from a later actors version, we may have to prove pre-seal sectors immediately where before the proving epoch would start later. In theory, we could start in the middle of of an active proving window. I'm pretty sure that's what was happening, actually...

@Stebalien Stebalien marked this pull request as ready for review August 9, 2021 23:12
@Stebalien Stebalien requested a review from a team as a code owner August 9, 2021 23:12
@Stebalien Stebalien marked this pull request as draft August 9, 2021 23:12
@Stebalien
Copy link
Member Author

Given that the tests pass reliably in CI, I think we're probably fine.

@Stebalien
Copy link
Member Author

Hm. So I still see this issue if I enable TestDealWithMarketAndMinerNode (at least locally). Now, this test was disabled for a reason, but this is still interesting.

@Stebalien Stebalien force-pushed the feat/latest-genesis branch from 337c65c to faf893e Compare August 11, 2021 18:29
@Stebalien Stebalien marked this pull request as ready for review August 11, 2021 18:29
@Stebalien Stebalien removed the request for review from ZenGround0 August 11, 2021 18:29
Properly handle genesis in actors tests. Fast-forward upgrading to
actors v13 doesn't work because there needs to be at least a day between
v0 genesis and v13 (due to differences in miner cron).
@Stebalien Stebalien force-pushed the feat/latest-genesis branch from faf893e to dcff06b Compare August 11, 2021 20:30
@Stebalien Stebalien requested a review from ZenGround0 August 11, 2021 20:57
5000, // after
-1, // before
162, // while sealing
560, // after upgrade deal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was happening a bit too early and causing a test failure.

-1, // before
162, // while sealing
530, // after upgrade deal
5000, // after
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't really a point to this. It was just testing with the previous actors version.

Copy link
Contributor

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -584,14 +587,23 @@ func currentEpochBlockReward(ctx context.Context, vm *vm.VM, maddr address.Addre
}

// TODO: This hack should move to reward actor wrapper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this TODO still represent our current plans? If so I can look into scoping this work on the actors side. However it seems like our idea of this has changed and this is no longer considered a hack. If so let's remove the TODO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would move into the reward actor wrapper. That is, the shim in chain/actors/builtin/reward. We have some abstractions there around message sending.

@@ -208,11 +216,15 @@ func (n *Ensemble) Miner(miner *TestMiner, full *TestFullNode, opts ...NodeOpt)
genm *genesis.Miner
)

// Default 2KiB sector for the network version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like this should be an ensemble option so the default can be overridden, though maybe this should be left to whoever first wants to use non-2KiB sectors in an itest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm going to leave it to them.

}

var DefaultEnsembleOpts = ensembleOpts{
pastOffset: 10000000 * time.Second, // time sufficiently in the past to trigger catch-up mining.
upgradeSchedule: stmgr.UpgradeSchedule{{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the fact that this is so nicely expressed as GenesisNetworkVersion(build.NewestNetworkVersion) makes me wonder if we'd be slightly better off with a DefaultEnsembleOpts of type []EnsembleOpts applied to a zero valued ensembleOpts. The name ensembleOpts could be changed to something different, ensembleCfg?, to make this a little less confusing.

Not very important just riffing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'd much rather have a set of default options (also avoids cases where the default "config" gets modified accidentally). But that'll be a future cleanup.

@@ -63,7 +65,7 @@ func testWindowPostUpgrade(t *testing.T, blocktime time.Duration, nSectors int,
require.NoError(t, err)

t.Log("Running one proving period")
waitUntil := di.PeriodStart + di.WPoStProvingPeriod + 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this what fixed the "proving deadline 0 in deadline 1" issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that issue appeared to be "my old laptop was too slow to compute 2k sector proofs and seal 2k sectors at the same time".

Here, I just needed to wait a full proving period instead of waiting until our next proving period started. It likely worked before because the sectors happened to align on the right proving periods for this to "just work". But the new code is "more correct".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see that issue in the "+20". Termination happens at the end of the deadline always. If we need "slack", something is wrong.

@Stebalien Stebalien merged commit af9a71a into master Aug 11, 2021
@Stebalien Stebalien deleted the feat/latest-genesis branch August 11, 2021 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants