Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Testing] Add BuildJet CI runners for unit tests, integration tests; cache Docker images #4906

Merged
merged 89 commits into from
Nov 10, 2023

Conversation

gomisha
Copy link
Contributor

@gomisha gomisha commented Oct 30, 2023

This PR encompasses many features / improvements to CI and flaky tests in order to

  • reduce CI run times (it now takes about 15 minutes for a clean CI run)
  • reduce test flakiness in CI so that it's less necessary to re-run CI jobs in order to merge PRs

Major Features

  • added managed CI runners from BuildJet to allow running tests on larger hardware; granularity is at the job level so different jobs have the choice of running different custom runners or default runners
  • caching Docker images used for integration tests in new CI job (Docker Build) on a powerful BuildJet runner and all the integration test jobs load the Docker images from cache - this eliminates the need for each integration test job to build Docker images; this job takes about 5 minutes to build and cache all Docker images

Custom CI Runners

  • replaced default GitHub runners with custom BuildJet runners for 13 jobs (6 unit tests, 2 module unit tests, Docker Build job, 4 integration tests)

Unit Tests Improvements

  • added ability to specify non-default CI runners to any unit test job package or sub-package with colon notation (e.g. engine/execution:buildjet-4vcpu-ubuntu-2204, network/test:buildjet-16vcpu-ubuntu-2204)
  • added ability to specify sub-sub packages (any nesting level) of a parent package to be tested in a separate job, while still being able to test all the remaining packages of the parent package in a separate job
    • engine job and all it's sub-packages are good examples of how this works:
    • engine/access engine/collection engine/common engine/consensus engine/execution/ingestion:buildjet-8vcpu-ubuntu-2204 engine/execution/computation engine/execution engine/verification engine:buildjet-4vcpu-ubuntu-2204
  • extraced state and storage packages out of others into separate jobs since others job was becoming too big

Unit Tests Splits (28 CI jobs)

  • split up engine unit tests into multiple smaller jobs (some with custom runners) because the single engine job was very large and flaky:
    • engine/access engine/collection engine/common engine/consensus engine/execution/ingestion:buildjet-8vcpu-ubuntu-2204 engine/execution/computation engine/execution engine/verification engine:buildjet-4vcpu-ubuntu-2204
  • network tests further split out into network/alsp network/test/cohort1:buildjet-16vcpu-ubuntu-2204 network/test/cohort2:buildjet-4vcpu-ubuntu-2204 network/p2p/connection network/p2p/p2pnode:buildjet-4vcpu-ubuntu-2204 network/p2p/scoring network/p2p network
  • module job split up into module/dkg, module

Integration Tests (15 CI jobs)

  • updated integration test jobs to allow specifying custom runner
  • split up access integration tests (integration/tests/access/access_api_test.go) into smaller tests because the job was very large and flaky

Quarantined Flaky Tests

The following tests were quarantined because they kept failing / being flaky even with increased CI runners:

Unquarantined Tests

The following tests were removed from quarantine because now we can run them on larger runners and they don't fail in CI. They were running without problems locally but used to fail in CI:

  • TestMeshNetTestSuite (network/test/cohort1/meshengine_test.go)
  • TestEchoEngineTestSuite (network/test/cohort2/echoengine_test.go)
  • network/p2p/p2pnode/resourceManager_test.go increased to full load test - previously this test had to operate at about 10% load capacity in order to pass in CI but now it's running on a larger runner and can run at full load

Other

  • max retries for unit and integration tests were increased from 3 to 5 and the timeout was increased from 25 to 35 minutes to allow more attempts for CI to run before having to manually re-try
  • removed localnet-test CI job as it's not relevant anymore and it was taking a long time to run

ref: https://github.com/dapperlabs/flow-go/issues/6894

@codecov-commenter
Copy link

codecov-commenter commented Nov 1, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (6f601e4) 55.92% compared to head (d9d1b42) 56.11%.
Report is 42 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4906      +/-   ##
==========================================
+ Coverage   55.92%   56.11%   +0.19%     
==========================================
  Files         963      965       +2     
  Lines       89617    89689      +72     
==========================================
+ Hits        50115    50331     +216     
+ Misses      35752    35606     -146     
- Partials     3750     3752       +2     
Flag Coverage Δ
unittests 56.11% <96.66%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
network/test/cohort2/echoengine.go 79.71% <ø> (ø)
utils/test_matrix/test_matrix.go 64.70% <96.66%> (+7.56%) ⬆️

... and 21 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gomisha gomisha marked this pull request as ready for review November 9, 2023 17:07
Copy link
Contributor

@sideninja sideninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

@sideninja
Copy link
Contributor

Do we know what is the new runtime approx with this runners?

@gomisha
Copy link
Contributor Author

gomisha commented Nov 9, 2023

Do we know what is the new runtime approx with this runners?

Yes - usually about 15 minutes but sometimes takes longer (if needed to retry some jobs) up to about 25 minutes.

Comment on lines 39 to 40
unittest.SkipUnless(s.T(), unittest.TEST_TODO, "flaky")
s.RunTestEpochJoinAndLeave(flow.RoleVerification, s.AssertNetworkHealthyAfterVNChange)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I saw @jordanschalm fixing this test in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will remove this from quarantine once his PR is merged.

#4975

Copy link
Member

@jordanschalm jordanschalm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, great work Misha.

@@ -72,8 +80,10 @@ func generateTestMatrix(targetPackages map[string][]string, otherPackages []stri

// listTargetPackages returns a map-list of target packages to run as separate CI jobs, based on a list of target package prefixes.
// It also returns a list of the "seen" packages that can then be used to extract the remaining packages to run (in a separate CI job).
func listTargetPackages(targetPackagePrefixes []string, allFlowPackages []string) (map[string][]string, map[string]string) {
func listTargetPackages(targetPackagePrefixes []string, allFlowPackages []string) (targets, map[string]string) {
//func listTargetPackages(targetPackagePrefixes []string, allFlowPackages []string) (map[string][]string, map[string]string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//func listTargetPackages(targetPackagePrefixes []string, allFlowPackages []string) (map[string][]string, map[string]string) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -34,5 +36,6 @@ func (s *EpochJoinAndLeaveVNSuite) SetupTest() {
// TestEpochJoinAndLeaveVN should update verification nodes and assert healthy network conditions
// after the epoch transition completes. See health check function for details.
func (s *EpochJoinAndLeaveVNSuite) TestEpochJoinAndLeaveVN() {
unittest.SkipUnless(s.T(), unittest.TEST_TODO, "flaky")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unittest.SkipUnless(s.T(), unittest.TEST_TODO, "flaky")

can remove after merging with master :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unquarantined! Thanks for the fix!

b60901b

Comment on lines 8 to 11
# push:
# paths:
# - '.github/workflows/flaky-test-monitor.yml'
# - '.github/workflows/ci.yml'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be un-commented or removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Flaky Test Monitor CI workflow needs to have a major overhaul now that the main CI workflow uses custom runners and Docker caching. I will address that in a separate PR.

6761410

@gomisha gomisha added this pull request to the merge queue Nov 10, 2023
Merged via the queue into master with commit 72e9efd Nov 10, 2023
54 checks passed
@gomisha gomisha deleted the misha/6894-buildjet-ci-test branch November 10, 2023 14:53
peterargue pushed a commit that referenced this pull request Nov 14, 2023
[CI][Testing] Add BuildJet CI runners for unit tests, integration tests; cache Docker images
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants