Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error on Github CI while run some erigon hive tests #11834

Closed
lystopad opened this issue Sep 2, 2024 · 5 comments
Closed

An error on Github CI while run some erigon hive tests #11834

lystopad opened this issue Sep 2, 2024 · 5 comments
Assignees

Comments

@lystopad
Copy link
Member

lystopad commented Sep 2, 2024

System information

It happens with the latest master as well as with v2.60.4

OS & Version: Ubuntu 16GB RAM (Kernel Version: 6.5.0-1025-azure)

Commit hash: 68f4196

Erigon Command (with flags/config):

Consensus Layer:

Consensus Layer Command (with flags/config):

Chain/Network:

Quoting a message from the partners

Hi, Erigon team
We have faced with a strange error on Github CI while run some erigon hive tests. After 30-60 minutes, the job fails with error code 143 (aborted). Example of such fail: xx-xx-xx-xx/job/29501828480

It is hard to debug the issue because in most cases we even cannot get github actions logs (. Currently we know that:

  1. It never happens with nethermind (even after 4.5 hours of run)
  2. Short suites (less than 30 minutes) works well
  3. The suite passes locally without any issues (I tested on linux mint with 8 vCPUs and 32 RAM)
  4. It happens with the latest master as well as with v2.60.4 (without --sync.parallel-state-flushing=false )
  5. There are no any errors on hive / erigon side

Google says that it may be related to CPU or RAM usage actions/runner-images#6680
I assume RAM is ok (github ranner has 16 GB RAM), so maybe the problem relates to CPU.

If so, is there a way to decrease CPU usage inside docker container? I know it is possible to do on docker side, but it is not easy with hive, so maybe there is a way to do it on erigon side? Also, any insights about how to debug the issue will be grateful.

More details could be found in internal messanger in "erigon3" channel.

@lystopad
Copy link
Member Author

lystopad commented Sep 4, 2024

Update from the partner:

We have run the same workflows against our self-hosted runners:

  • 8 CPUs and 64 RAM
  • 4 CPUs and 8 GB RAM
    For both, the tests run without any issues.

So, looks like the issue is not related to the number of RAM.

@lystopad
Copy link
Member Author

lystopad commented Sep 4, 2024

One more update

Also, I compared the configuration

Github Standard Runner

Kernel Version: 6.5.0-1025-azure
   Operating System: Ubuntu ***.04.4 LTS
   OSType: linux
   Architecture: x86_64
   CPUs: 4
   Total Memory: 15.61GiB

Self-hosted runner

Kernel Version: 6.8.0-41-generic
   Operating System: Ubuntu 24.04.1 LTS
   OSType: linux
   Architecture: x86_64
   CPUs: 4
   Total Memory: 7.755GiB

The difference is Kernel Version. Maybe the problem with github runner's ubuntu

@lystopad
Copy link
Member Author

lystopad commented Sep 5, 2024

One more update:

Updates regarding Erigon issue with workflow canceling: changing ubuntu version did not help.
Tested on: ubuntu-latest (22.04), ubuntu-24.04, ubuntu-20.04

Test passed on a self-hosted runner:

   Kernel Version: 6.8.0-41-generic
   Operating System: Ubuntu 24.04.1 LTS
   OSType: linux
   Architecture: x86_64
   CPUs: 4
   Total Memory: 7.755GiB

@somnathb1
Copy link
Contributor

It seems to me that the issue is not related to performance or resources. I have tried running one of the failing workflows, namely, dashboard_erigon_withdrawals.yml and it runs fine for me - https://github.com/somnathb1/hive/actions/runs/10728609938/job/29753487168
I have also tried running several instances of the hive tests in parallel on my local with low overall resource usage.

@somnathb1
Copy link
Contributor

The issue was intermittently only appearing on some github runners. Issues related to the main branch for hive failures aren't related to CI and has a separate issue. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants