Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename oni/lotus-soup imports to testplans/lotus-soup and go get lotus@master on trigger #4995

Merged
merged 17 commits into from
Feb 8, 2021

Conversation

nonsense
Copy link
Member

@nonsense nonsense commented Nov 24, 2020

This PR is:

  1. Locking a specific Lotus version in go.mod to be used with the testplans ran on TaaS (currently based on go get lotus@master but could be changed to use the commit from CircleCI as well).
  2. Removing imports from filecoin-project/oni.
  3. Removing go.mod replace to Lotus as in replace github.com/filecoin-project/lotus => ../../ since the parent folder is not included with the testplan, and this doesn't work with the test plan.
  4. Upgrading sectors from 2KiB to 8MiB - computationally sealing is similarly expensive, but this allows for future tests where we cancel transfer and restart nodes/miners in the middle of a transfer.
  5. Upgrade deal size from 1600 bytes to 5MB.

It's worth noting that without making too many changes to the tests from one version of Lotus to another, we are sometimes getting different flakyness / different success rates. These tests were very stable for a few months, then they got broken for a month, and now they seem to pass some of the time. It is worth integrating them and running them soon as part of the Lotus test suite, so that we get them to a stable state, which is not easy since most of the logic really happens behind multiple APIs and is abstracted away from the test itself.

I am not sure how much of this is timing issues due to much lower block delays / propagation delays / etc. vs actual undergoing work on Lotus / go-fil-markets / etc.


PRIOR TO MERGE:

  • remove lock-lotus-version-in-testplan branch from .circleci/config.yaml

TODO:

  • 1. e2e test is still flaky - sometimes clients detect fork (although not all clients detect it):
2020-12-07T13:45:09.016Z        WARN    chain   chain/sync.go:1418      (fork detected) synced header chain ([bafy2bzaceay2gcgqbzqumy7meq3u6yjjghyk4svdt6fetvdgpdkr6inaa7tp2] - 19) does not link to our best block ([bafy2bzacebp3ut76gfxygwqocacgykreiegg5pdqhfxqx6yfl2gsoadxtyjcc] - 1)
2020-12-07T13:45:09.017Z        WARN    chain   chain/sync.go:1418      (fork detected) synced header chain ([bafy2bzacea5dzdbsj3azff7dvdulqffedw5eekk6jbs5wcqrcadbfzhhmqsey] - 19) does not link to our best block ([bafy2bzacebp3ut76gfxygwqocacgykreiegg5pdqhfxqx6yfl2gsoadxtyjcc] - 1)
2020-12-07T13:45:09.021Z        ERROR   chain   chain/sync_manager.go:253       error during sync in [bafy2bzacea5dzdbsj3azff7dvdulqffedw5eekk6jbs5wcqrcadbfzhhmqsey]: collectChain failed: failed to sync fork: synced chain forked at genesis, refusing to sync; incoming: [bafy2bzacedcxdw6n62i32dmjjf6dr4ujzfflzr3zhfmjkxcre44aoj4aonxxa bafy2bzaceadvi5vs5e6pkzeolhwsel6n3mkknxm4ipkiflcfwskjco5sohxyg]
2020-12-07T13:45:09.021Z        ERROR   chain   chain/sync_manager.go:253       error during sync in [bafy2bzaceay2gcgqbzqumy7meq3u6yjjghyk4svdt6fetvdgpdkr6inaa7tp2]: collectChain failed: failed to sync fork: synced chain forked at genesis, refusing to sync; incoming: [bafy2bzacedcxdw6n62i32dmjjf6dr4ujzfflzr3zhfmjkxcre44aoj4aonxxa bafy2bzaceadvi5vs5e6pkzeolhwsel6n3mkknxm4ipkiflcfwskjco5sohxyg]
  • 2. e2e test is still flaky - sometimes miners supposedly fail to lock funds as pledge:

miner

2020-12-07T13:31:27.425Z        WARN    vm      vm/runtime.go:332       Abortf: failed to lock balance: failed to lock client funds: not enough balance to lock for addr t0102: escrow balance 0 < locked 0 + required 2561092000000          2020-12-07T13:31:27.425Z        WARN    vm      vm/runtime.go:145       VM.Call failure: failed to lock balance: failed to lock client funds: not enough balance to lock for addr t0102: escrow balance 0 < locked 0 + required 2561092000000
(RetCode=19):
    github.com/filecoin-project/specs-actors/v2/actors/builtin.RequireNoErr
        /go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.2/actors/builtin/shared.go:60
2020-12-07T13:31:27.425Z        WARN    vm      vm/vm.go:528    Send actor error        {"from": "t0104", "to": "t05", "nonce": 1, "method": "4", "height": "28", "error": "failed to lock balance: failed to lock client funds: not enough ba
lance to lock for addr t0102: escrow balance 0 < locked 0 + required 2561092000000 (RetCode=19):\n    github.com/filecoin-project/specs-actors/v2/actors/builtin.RequireNoErr\n        /go/pkg/mod/github.com/filecoin-project/specs-actors/v2
@v2.3.2/actors/builtin/shared.go:60"}

client

{"ts":1607347882314916921,"msg":"","group_id":"clients","run_id":"bv72sdd5p7acvjljrr9g","event":{"message_event":{"message":"deal state: StorageDealClientFunding"}}}
2020-12-07T13:31:27.019Z        WARN    chainstore      store/store.go:282      chain head sub exit loop
2020-12-07T13:31:27.019Z        WARN    chainstore      store/store.go:282      chain head sub exit loop
2020-12-07T13:31:28.155Z        ERROR   storagemarket_impl      clientstates/client_states.go:304       deal bafyreif447ug2lstf5abwxzzjnsrh3flmhfcdq2sra3jhnhc36xv5zagie failed: deal failed: (State=26) error calling node: publishing deal: GasEstimateMessageGas error: estimating gas used: message execution failed: exit 19, reason: failed to lock balance: failed to lock client funds: not enough balance to lock for addr t0102: escrow balance 0 < locked 0 + required 2561092000000 (RetCode=19)
{"ts":1607347889019236712,"msg":"","group_id":"clients","run_id":"bv72sdd5p7acvjljrr9g","event":{"message_event":{"message":"got tipset: height 28"}}}
  • 3. e2e test is flaky: sometimes one of the clients blocks in StorageDealAwaitingPreCommit state

  • 4. WindowPoSt deadlines appear to have a one-off error with the configs in init.go. (This is not a problem for any of the existing tests, but worth figuring out and fixing. For the current tests, usually by the time the first WindowPoSt is to be submitted the deals should have been sealed by miners and retrieved from the clients).

@@ -43,5 +43,3 @@ require (
replace github.com/filecoin-project/filecoin-ffi => ../../extern/filecoin-ffi

replace github.com/supranational/blst => ../../extern/blst

replace github.com/filecoin-project/lotus => ../../
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this plays well with Testground. filecoin-ffi and blst are bundled with the base image (as they don't change often), but we are not including ../../ when we ship the test plan to Testground, so I am a bit puzzled how this compiled and worked over the last few days.

@nonsense nonsense force-pushed the lock-lotus-version-in-testplan branch from b0fdf9a to bd6d814 Compare December 7, 2020 13:06
@nonsense nonsense force-pushed the lock-lotus-version-in-testplan branch 4 times, most recently from 12e9e8a to 301e2d0 Compare December 7, 2020 14:41
@nonsense nonsense changed the title rename oni/lotus-soup imports to testplans/lotus-soup and lock lotus version rename oni/lotus-soup imports to testplans/lotus-soup and go get lotus@master on trigger Dec 7, 2020
@nonsense nonsense marked this pull request as ready for review December 7, 2020 14:52
@nonsense
Copy link
Member Author

nonsense commented Dec 7, 2020

I am marking this as ready for review, because I think it is an improvement to master, even if the deals e2e is still flaky (issues described in the TODO list).

I think that many of the issues described are not directly related to the test plan, and something that might involve a bit more effort to address in the underlying APIs within Lotus.

For reference the lotus-soup/deals e2e test is very similar to the TestDealFlow test in Lotus, but AFAICT TestDealFlow is also not using multiple clients/miners, so might as well not hit the same problems as are visible here.


FROM golang:${GO_VERSION}-buster

RUN apt-get update && apt-get install -y ca-certificates llvm clang mesa-opencl-icd ocl-icd-opencl-dev jq gcc git pkg-config bzr libhwloc-dev

ARG FILECOIN_FFI_COMMIT=1985275547f222e8c97a8ab70b5cc26bc1fa50b1
ARG FILECOIN_FFI_COMMIT=1d9cb3e8ff53f51f9318fc57e5d00bc79bdc0128
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to extract this from the Docker image and compile on every test run. filecoin-ffi library is changing much more often than originally anticipated, so the fact that we have to maintain the correct version in the base image we use for the testplan is not ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree. In fact, it's already outdated since you submitted the PR and I'm reviewing this. We can rely on docker caching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rely on docker caching.

Not sure what you mean with that. Could you elaborate?


I think we should just rely on Lotus and clone Lotus for a given commit. We discarded that option early on, but time and time again Lotus is changing underneath these tests and we have to replicate the Lotus setup - for example we no longer need blst to be included as an extern.


ARG LOTUS_COMMIT=1a170e18a
ARG LOTUS_COMMIT=b13226bc2f8a0233b3beb9bd18ab266547bc29c7
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for the debug image as we bundle lotus-miner and lotus binaries, so that we can inspect miners and clients while tests are running. This is NOT the version under test, but if it gets old enough, lotus and lotus-miner might stop working (due to CBOR serialisation changes for example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth rescuing the change log from README-old-from-oni.md? To keep tabs on the changes we're making.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to document what this image is for. Maybe add comments to the top of the Dockerfile?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also to the others)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth rescuing the change log from README-old-from-oni.md? To keep tabs on the changes we're making.

I'd rather lock these tests with a specific Lotus commit. The initial thinking that "filecoin-ffi" does not change often is just not correct at this stage, we have updated it tens of times, it should be regarded as any other dependency in the go.mod of Lotus in my opinion.

@@ -0,0 +1,67 @@
[metadata]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bundling a few more composition files for easier debugging, when we notice that tests are broken / flaky.

@@ -26,7 +26,7 @@
miners = "2"
genesis_timestamp_offset = "0"
balance = "20000000" # These balances will work for maximum 100 nodes, as TotalFilecoin is 2B
sectors = "10"
sectors = "4"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving to 8MiB sectors, so no need to preseal 10 sectors when starting the miners, as it takes unnecessary time.

Comment on lines +18 to +22
_ = log.SetLogLevel("*", "DEBUG")
_ = log.SetLogLevel("dht", "WARN")
_ = log.SetLogLevel("swarm2", "WARN")
_ = log.SetLogLevel("addrutil", "WARN")
_ = log.SetLogLevel("stats", "WARN")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled DEBUG logging by default, so that we can monitor TaaS and try to fix flakiness continuously, and also reduced verbosity on some of the louder subsystems.

@nonsense nonsense requested review from raulk and vyzo December 7, 2020 15:03
@nonsense
Copy link
Member Author

nonsense commented Dec 7, 2020

Marking @vyzo and @raulk as they are familiar with these tests.

@nonsense nonsense force-pushed the lock-lotus-version-in-testplan branch from 301e2d0 to 4d69947 Compare December 11, 2020 09:13
Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Needs updating FFI and possibly BLST in the images (we really do need to find a better system; as @nonsense mentioned, we expected these modules to be mostly static, but they change more frequently than anticipated).

Also, needs a merge from master and a dependency upgrade, I think.


FROM golang:${GO_VERSION}-buster

RUN apt-get update && apt-get install -y ca-certificates llvm clang mesa-opencl-icd ocl-icd-opencl-dev jq gcc git pkg-config bzr libhwloc-dev

ARG FILECOIN_FFI_COMMIT=1985275547f222e8c97a8ab70b5cc26bc1fa50b1
ARG FILECOIN_FFI_COMMIT=1d9cb3e8ff53f51f9318fc57e5d00bc79bdc0128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree. In fact, it's already outdated since you submitted the PR and I'm reviewing this. We can rely on docker caching.


FROM golang:${GO_VERSION}-buster

RUN apt-get update && apt-get install -y ca-certificates llvm clang mesa-opencl-icd ocl-icd-opencl-dev jq gcc git pkg-config bzr libhwloc-dev

ARG FILECOIN_FFI_COMMIT=1985275547f222e8c97a8ab70b5cc26bc1fa50b1
ARG FILECOIN_FFI_COMMIT=1d9cb3e8ff53f51f9318fc57e5d00bc79bdc0128
ARG FFI_DIR=/extern/filecoin-ffi

ARG BLST_COMMIT=1cbb16ed9580dcd3e9593b71221fcf2a048faaef
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above? (for FFI)


ARG LOTUS_COMMIT=1a170e18a
ARG LOTUS_COMMIT=b13226bc2f8a0233b3beb9bd18ab266547bc29c7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth rescuing the change log from README-old-from-oni.md? To keep tabs on the changes we're making.


ARG LOTUS_COMMIT=1a170e18a
ARG LOTUS_COMMIT=b13226bc2f8a0233b3beb9bd18ab266547bc29c7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to document what this image is for. Maybe add comments to the top of the Dockerfile?


ARG LOTUS_COMMIT=1a170e18a
ARG LOTUS_COMMIT=b13226bc2f8a0233b3beb9bd18ab266547bc29c7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also to the others)

Comment on lines +75 to +78
// give some time to the miner, otherwise, we get errors like:
// deal errored deal failed: (State=26) error calling node: publishing deal: GasEstimateMessageGas
// error: estimating gas used: message execution failed: exit 19, reason: failed to lock balance: failed to lock client funds: not enough balance to lock for addr t0102: escrow balance 0 < locked 0 + required 640297000 (RetCode=19)
time.Sleep(50 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the miner would be able to publish an event when it's ready to start accepting deals.

.circleci/config.yml Outdated Show resolved Hide resolved
docker build -t "iptestground/oni-runtime:v5" -f "docker-images/Dockerfile.oni-runtime" "docker-images"
docker build -t "iptestground/oni-buildbase:v13-lotus" -f "docker-images/Dockerfile.oni-buildbase" "docker-images"
docker build -t "iptestground/oni-runtime:v7" -f "docker-images/Dockerfile.oni-runtime" "docker-images"
docker build -t "iptestground/oni-runtime:v8-debug" -f "docker-images/Dockerfile.oni-runtime-debug" "docker-images"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this -debug version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we want to keep it, -debug includes the lotus binary and the lotus-miner binary, so it is useful to have it. Actually at the moment the manifest.toml of the testplan refers to the debug version.

Co-authored-by: Łukasz Magiera <magik6k@users.noreply.github.com>
@magik6k magik6k merged commit 0afe732 into master Feb 8, 2021
@magik6k magik6k deleted the lock-lotus-version-in-testplan branch February 8, 2021 14:39
bibibong pushed a commit to EpiK-Protocol/go-epik that referenced this pull request Feb 22, 2021
…otus-version-in-testplan

rename oni/lotus-soup imports to testplans/lotus-soup and go get lotus@master on trigger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants