Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ci): fix hangs in lightwalletd tests by checking concurrent process output in different threads #4828

Merged
merged 16 commits into from
Jul 28, 2022

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jul 26, 2022

Motivation

lightwalletd is unstable in our CI, because zebrad hangs when its log buffer fills up.

Close #4820
Depends-On: #4826

Solution

  • Clear the zebrad and lightwalletd logs in different threads, to avoid deadlocks due to the log buffer filling up
  • Increase test timeouts

This increases the sync test time, but we can fix that in later PRs.

Related fixes:

  • Only copy the state once in the send transactions test, to avoid a "disk full" error
  • Wait longer for lightwalletd gRPC server startup, to avoid a "connection refused" error
  • Look for cached state disks for this commit and branch, then the main branch, then any branch
  • Do ZK parameter preloads in the lightwalletd tests that need them
  • Make code execution time logs shorter

Review

This is a high-priority sync fix.

Reviewer Checklist

  • Code implements Specs and Designs
  • CI passes

Follow Up Tasks

@teor2345 teor2345 added C-bug Category: This is a bug P-High 🔥 I-integration-fail Continuous integration fails, including build and test failures C-testing Category: These are tests A-rpc Area: Remote Procedure Call interfaces lightwalletd any work associated with lightwalletd labels Jul 26, 2022
@teor2345 teor2345 requested a review from a team as a code owner July 26, 2022 04:45
@teor2345 teor2345 self-assigned this Jul 26, 2022
@teor2345 teor2345 requested review from upbqdn and removed request for a team July 26, 2022 04:45
@teor2345 teor2345 changed the title fix(ci): restart lightwalletd when it hangs 3. fix(ci): restart lightwalletd when it hangs Jul 26, 2022
@teor2345 teor2345 marked this pull request as draft July 26, 2022 04:49
@codecov
Copy link

codecov bot commented Jul 26, 2022

Codecov Report

Merging #4828 (3cb3798) into main (404f682) will increase coverage by 0.07%.
The diff coverage is 57.89%.

@@            Coverage Diff             @@
##             main    #4828      +/-   ##
==========================================
+ Coverage   78.80%   78.88%   +0.07%     
==========================================
  Files         305      305              
  Lines       38733    38730       -3     
==========================================
+ Hits        30524    30551      +27     
+ Misses       8209     8179      -30     

@teor2345
Copy link
Contributor Author

I ran 77 of these tests locally overnight, and they all succeeded.
(CI could be different, but it's still a good sign the restart seems to be working reliably.)

I just need to fix a bug in the GitHub actions where we're using an old cached state.

@teor2345 teor2345 added the A-devops Area: Pipelines, CI/CD and Dockerfiles label Jul 26, 2022
@teor2345 teor2345 force-pushed the concurrent-get-headers branch from 032c6a4 to 9fdf5f5 Compare July 26, 2022 20:25
Base automatically changed from concurrent-get-headers to main July 26, 2022 20:26
@teor2345 teor2345 force-pushed the lightwalletd-restart branch from 82b0fd4 to 3ae946b Compare July 26, 2022 20:27
@teor2345 teor2345 marked this pull request as ready for review July 26, 2022 20:27
@teor2345 teor2345 requested a review from a team as a code owner July 26, 2022 20:27
@teor2345 teor2345 requested review from gustavovalverde and removed request for a team July 26, 2022 20:27
@teor2345
Copy link
Contributor Author

This should all work now, we're just waiting for CI.

@teor2345 teor2345 mentioned this pull request Jul 26, 2022
31 tasks
@teor2345 teor2345 requested review from oxarbitrage and removed request for upbqdn July 26, 2022 22:44
@teor2345 teor2345 force-pushed the lightwalletd-restart branch from c2e6eb9 to c2dc411 Compare July 27, 2022 06:44
@teor2345 teor2345 force-pushed the lightwalletd-restart branch from c2dc411 to 1d9ec5a Compare July 28, 2022 00:38
@teor2345 teor2345 changed the title fix(ci): fix hangs in lightwalletd tests fix(ci): fix hangs in lightwalletd tests by checking concurrent process output in different threads Jul 28, 2022
@teor2345 teor2345 removed the request for review from gustavovalverde July 28, 2022 00:40
@teor2345
Copy link
Contributor Author

I needed to re-do the fix, because the problem was actually deadlocks from not reading the zebrad logs.

@teor2345
Copy link
Contributor Author

There was a google cloud SSH failure:

Timeout, server 35.222.36.92 not responding.

Recommendation: To check for possible causes of SSH connectivity issues and get
recommendations, rerun the ssh command with the --troubleshoot option. Example:

gcloud None compute ssh example-instance --zone=us-central1-a --troubleshoot

Or, to investigate an IAP tunneling issue:

gcloud None compute ssh example-instance --zone=us-central1-a --tunnel-through-iap --troubleshoot

ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

https://github.com/ZcashFoundation/zebra/runs/7553151823?check_suite_focus=true#step:6:332

I've never seen one of these before, so I restarted those jobs.

@teor2345
Copy link
Contributor Author

Admin-merging because the one failing test is going to be fixed soon.

There's just one issue with a regex in the send transactions test, I'll fix it in PR #4840.

@teor2345 teor2345 merged commit 89a0410 into main Jul 28, 2022
@teor2345 teor2345 deleted the lightwalletd-restart branch July 28, 2022 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles A-rpc Area: Remote Procedure Call interfaces C-bug Category: This is a bug C-testing Category: These are tests I-integration-fail Continuous integration fails, including build and test failures lightwalletd any work associated with lightwalletd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix lightwalletd test hangs by checking process output in dedicated threads
2 participants