feat(actions)!: add full sync test #3582

gustavovalverde · 2022-02-18T10:10:38Z

Motivation

Most of the motivation for this PR is available at #1592

Specifications

We need to run a full sync test on every PR before merging to main
Preferably this test should run at least once before starting the merge process
Multiple tests must be done to accomplish the best balance between performance and cost in GCP

Designs

Use a custom entrypoint.sh to overcome some limitations from GCP docker parser

Solution

Execute the full sync test using an environment variable
WIP

Review

Reviewer Checklist

Code implements Specs and Designs
Tests for Expected Behaviour
Tests for Errors

Follow Up Work

error: Found argument '--nocapture' which wasn't expected, or isn't valid in this context

teor2345 · 2022-02-22T02:19:11Z

We got to 87%, then the job timed out at 6 hours:

Feb 22 02:03:48.658 INFO net="Main": zebrad::commands::start: estimated progress to chain tip sync_percent=87.748 % current_height=Height(1382111) remaining_sync_blocks=192980 time_since_last_state_block=PT0S

https://github.com/ZcashFoundation/zebra/runs/5279457879?check_suite_focus=true#step:7:8774

gustavovalverde · 2022-02-22T02:22:40Z

Yeah, after 50% it started to get slow. I'll try a bigger machine to see if there's an impact there. Any other recommendations (including changes in the configuration file) are welcome @teor2345

gustavovalverde · 2022-02-22T02:29:18Z

Still running here: https://cloudlogging.app.goo.gl/RxUMH5twUULrijba9

teor2345 · 2022-02-22T02:46:06Z

Yeah, after 50% it started to get slow. I'll try a bigger machine to see if there's an impact there. Any other recommendations (including changes in the configuration file) are welcome @teor2345

I'd suggest:

try the biggest machine possible, just for now
setting checkpoint_sync = true
update the mainnet and pre-NU5 testnet checkpoint lists (part of Update Zebra checkpoint lists & mandatory checkpoint before NU5 activation #2368)
Fix "synced block height too far ahead of the tip: dropped downloaded block" #3603, if we're seeing lots of those logs
work out if the test is slowed down by network, CPU, RAM, or disk limits

teor2345 · 2022-02-22T02:46:42Z

I can do the checkpoint lists now, they're the quick part of that ticket.

gustavovalverde · 2022-02-22T02:54:03Z

Could it be something related to the amount of data (blocks) we're getting to be processed? At the beginning the network and disks had a big spike, and ended a bit after the sync call (just were the red rectangle ends), which makes me thing the disk shouldn't be the ones halting the process, as they can handled the spikes nicely.

I'll increase the machine type anyways.

teor2345 · 2022-02-22T04:25:12Z

I updated the checkpoint lists in:

fix(consensus): update Zebra's hard-coded blockchain checkpoint lists #3606

If we use that PR and checkpoint_sync = true, the sync should be a lot faster.

teor2345 · 2022-02-22T04:28:02Z

Could it be something related to the amount of data (blocks) we're getting to be processed?

It could be a concurrency issue. We can limit the number of blocks in the queue by fixing:

Fix "synced block height too far ahead of the tip: dropped downloaded block" #3603

But let's try the checkpoints first, they should speed things up a lot.

teor2345 · 2022-02-28T02:07:37Z

I'm currently running this branch with:

lookahead limit 1200 - 100% at 5:58 - test(sync): try a smaller lookahead limit #3661
lookahead limit 2000 - 100% at 5:30 - test(sync): try the default lookahead limit #3662 - the default
lookahead limit 2800 - 100% at 5:47 - test(sync): try a larger lookahead limit #3663

PR #3662 didn't change any code, but it was half an hour faster than the original. This could have happened due to the switch from Rust 1.58 to 1.59.

If this performance is consistent, we might not need to make any more performance improvements right now.

teor2345

I don't think this PR is ready to merge yet, because it disables a lot of existing tests.

docker/entrypoint.sh

.github/workflows/test.yml

We'll remove the blocking changes

teor2345

I think we're all done here, thanks for your patience working through all the sync issues.

* feat(actions)!: add full sync test (#3582) * add(tests): full sync test * fix(test): add build * fix(deploy): escape double dashes '--' correctly * fix(test): remove unexpected --no-capture arg error: Found argument '--nocapture' which wasn't expected, or isn't valid in this context * refactor(docker): use default executable as entrypoint * refactor(startup): add a custom entrypoint * fix(test): add missing TEST_FULL_SYNC variable * test(timeout): use the biggest machine * fix * fix(deploy): use latest successful image * typo * refactor(docker): generate config file at startup * revert(build): changes were made to docker * fix(docker): send variables correctly to the entrypoint * test different conf file approach * fix(env): add RUN_TEST env variable * ref: use previous approach * fix(color): use environment variable * fix(resources): use our normal machine size * fix(ci): double CPU and RAM for full sync test * fix(test): check for zebrad test output in the correct order The mempool is only activated once, so we must check for that log first. After mempool activation, the stop regex is logged at least once. (It might be logged before as well, but we can't rely on that.) When checking that the mempool didn't activate, wait for the `zebrad` command to exit, then check the entire log. * fix(ci): run full sync test with full compiler optimisations * fix(tests): reintroduce tests and run full sync on approval * fix(tests): reduce the changelog Co-authored-by: teor <teor@riseup.net> * fix(ci): update CI job path triggers (#3692) * ci(test): re-run tests when snapshot data changes * fix(ci): rebuild state when disk format changes * fix(ci): rebuild rust docs when code or dependencies change * doc(ci): explain why we run jobs when files change Co-authored-by: Gustavo Valverde <gustavo@iterativo.do> * fix(build): use the right multistage target (#3700) * fix(review): only assign one reviewer to general Rust reviews (#3708) If we assign two teams, GitHub assigns two reviewers. * fix(ci): change the color-eyre ignore to a tracing-subscriber ignore * fix(ci): ignore duplicate darling dependencies * doc(ci): remove an alternative resolution doc Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>

* Upgrade some dependencies * Upgrade some dependencies * Upgrade dependencies for zebrad * Upgrade tracing dependencies * Revert `tor` & `arti` * Upgrade `criterion` & `pin-project` in `deny.toml` * Remove some dependencies from `skip-tree` in `deny.toml` * Revert some the versions of dependencies because of duplicates * Revert proptest regressions * Upgrade dependencies, then ignore some more duplicates (#3716) * feat(actions)!: add full sync test (#3582) * add(tests): full sync test * fix(test): add build * fix(deploy): escape double dashes '--' correctly * fix(test): remove unexpected --no-capture arg error: Found argument '--nocapture' which wasn't expected, or isn't valid in this context * refactor(docker): use default executable as entrypoint * refactor(startup): add a custom entrypoint * fix(test): add missing TEST_FULL_SYNC variable * test(timeout): use the biggest machine * fix * fix(deploy): use latest successful image * typo * refactor(docker): generate config file at startup * revert(build): changes were made to docker * fix(docker): send variables correctly to the entrypoint * test different conf file approach * fix(env): add RUN_TEST env variable * ref: use previous approach * fix(color): use environment variable * fix(resources): use our normal machine size * fix(ci): double CPU and RAM for full sync test * fix(test): check for zebrad test output in the correct order The mempool is only activated once, so we must check for that log first. After mempool activation, the stop regex is logged at least once. (It might be logged before as well, but we can't rely on that.) When checking that the mempool didn't activate, wait for the `zebrad` command to exit, then check the entire log. * fix(ci): run full sync test with full compiler optimisations * fix(tests): reintroduce tests and run full sync on approval * fix(tests): reduce the changelog Co-authored-by: teor <teor@riseup.net> * fix(ci): update CI job path triggers (#3692) * ci(test): re-run tests when snapshot data changes * fix(ci): rebuild state when disk format changes * fix(ci): rebuild rust docs when code or dependencies change * doc(ci): explain why we run jobs when files change Co-authored-by: Gustavo Valverde <gustavo@iterativo.do> * fix(build): use the right multistage target (#3700) * fix(review): only assign one reviewer to general Rust reviews (#3708) If we assign two teams, GitHub assigns two reviewers. * fix(ci): change the color-eyre ignore to a tracing-subscriber ignore * fix(ci): ignore duplicate darling dependencies * doc(ci): remove an alternative resolution doc Co-authored-by: Gustavo Valverde <gustavo@iterativo.do> Co-authored-by: teor <teor@riseup.net> Co-authored-by: Gustavo Valverde <gustavo@iterativo.do>

add(tests): full sync test

e37c9d8

zfnd-bot bot assigned gustavovalverde Feb 18, 2022

gustavovalverde added 6 commits February 18, 2022 06:17

fix(test): add build

3bc3455

fix(deploy): escape double dashes '--' correctly

f5d710c

fix(test): remove unexpected --no-capture arg

b38a1a5

error: Found argument '--nocapture' which wasn't expected, or isn't valid in this context

refactor(docker): use default executable as entrypoint

3f42944

refactor(startup): add a custom entrypoint

6ae180b

fix(test): add missing TEST_FULL_SYNC variable

f7014c1

gustavovalverde added 4 commits February 21, 2022 23:10

test(timeout): use the biggest machine

445a3dd

fix

563f8ca

fix(deploy): use latest successful image

882da52

typo

f37f044

gustavovalverde added 10 commits February 22, 2022 05:56

Merge branch 'main' into add-full-sync-test

28d4a87

refactor(docker): generate config file at startup

56e77c1

revert(build): changes were made to docker

707cb43

fix(docker): send variables correctly to the entrypoint

ec076a0

test different conf file approach

3ded9fa

fix(env): add RUN_TEST env variable

38c6a5b

ref: use previous approach

06ccd6e

fix(color): use environment variable

c46dd50

Merge branch 'main' into add-full-sync-test

e699175

fix(resources): use our normal machine size

2452df2

teor2345 previously requested changes Feb 28, 2022

View reviewed changes

docker/entrypoint.sh Show resolved Hide resolved

.github/workflows/test.yml Outdated Show resolved Hide resolved

gustavovalverde added 3 commits March 1, 2022 06:38

Merge branch 'main' into add-full-sync-test

236e9c1

fix(tests): reintroduce tests and run full sync on approval

90a8e36

fix(tests): reduce the changelog

794d3fe

gustavovalverde changed the title ~~add(actions): full sync test~~ feat(actions): full sync test Mar 1, 2022

gustavovalverde changed the title ~~feat(actions): full sync test~~ feat(actions): add full sync test Mar 1, 2022

gustavovalverde changed the title ~~feat(actions): add full sync test~~ feat(actions)!: add full sync test Mar 1, 2022

teor2345 approved these changes Mar 2, 2022

View reviewed changes

This was referenced Mar 2, 2022

merge-queue: embarking main (30cc048) and #3628 together #3686

Closed

merge-queue: embarking main (41d61a6) and #3630 together #3687

Closed

mergify bot added a commit that referenced this pull request Mar 2, 2022

Merge of #3582

03526eb

mergify bot mentioned this pull request Mar 2, 2022

merge-queue: embarking main (744aca9) and [#3402 + #3582] together #3689

Closed

48 tasks

mergify bot added a commit that referenced this pull request Mar 2, 2022

Merge of #3582

59ab273

Merge branch 'main' into add-full-sync-test

7582189

gustavovalverde requested a review from a team as a code owner March 2, 2022 12:18

mergify bot added a commit that referenced this pull request Mar 2, 2022

Merge of #3582

e239549

mergify bot mentioned this pull request Mar 2, 2022

merge-queue: embarking main (a0c4512) and #3582 together #3698

Closed

45 tasks

mergify bot added a commit that referenced this pull request Mar 2, 2022

Merge of #3582

a28a7b2

mergify bot mentioned this pull request Mar 2, 2022

merge-queue: embarking main (a0c4512), #3582 and #3692 together #3699

Closed

45 tasks

mergify bot merged commit db966f2 into main Mar 2, 2022

mergify bot deleted the add-full-sync-test branch March 2, 2022 14:15

gustavovalverde mentioned this pull request Mar 6, 2022

Run full sync tests on Mainnet #1592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(actions)!: add full sync test #3582

feat(actions)!: add full sync test #3582

gustavovalverde commented Feb 18, 2022 •

edited

Loading

teor2345 commented Feb 22, 2022

gustavovalverde commented Feb 22, 2022

gustavovalverde commented Feb 22, 2022 •

edited

Loading

teor2345 commented Feb 22, 2022 •

edited

Loading

teor2345 commented Feb 22, 2022

gustavovalverde commented Feb 22, 2022

teor2345 commented Feb 22, 2022

teor2345 commented Feb 22, 2022

teor2345 commented Feb 28, 2022 •

edited

Loading

teor2345 left a comment

teor2345 left a comment

feat(actions)!: add full sync test #3582

feat(actions)!: add full sync test #3582

Conversation

gustavovalverde commented Feb 18, 2022 • edited Loading

Motivation

Specifications

Designs

Solution

Review

Reviewer Checklist

Follow Up Work

teor2345 commented Feb 22, 2022

gustavovalverde commented Feb 22, 2022

gustavovalverde commented Feb 22, 2022 • edited Loading

teor2345 commented Feb 22, 2022 • edited Loading

teor2345 commented Feb 22, 2022

gustavovalverde commented Feb 22, 2022

teor2345 commented Feb 22, 2022

teor2345 commented Feb 22, 2022

teor2345 commented Feb 28, 2022 • edited Loading

teor2345 left a comment

Choose a reason for hiding this comment

teor2345 left a comment

Choose a reason for hiding this comment

gustavovalverde commented Feb 18, 2022 •

edited

Loading

gustavovalverde commented Feb 22, 2022 •

edited

Loading

teor2345 commented Feb 22, 2022 •

edited

Loading

teor2345 commented Feb 28, 2022 •

edited

Loading