ci: run test on self-hosted runners #3782

galargh · 2023-04-12T16:07:45Z

Description

This PR moves Test jobs from Continuous Integration workflow and Interoperability Testing job to self-hosted runners.

The former are moved to machines backed by either c5.large/m5.large, c5.xlarge/m5.xlarge or c5.2xlarge AWS instances while the latter c5.4xlarge.

The self-hosted runners are set up using https://github.com/pl-strflt/tf-aws-gh-runner.

The jobs are being moved to self-hosted because we're exhausting hosted GHA limits.

Notes & open questions

c5.large/m5.large are quite small which gives us plenty of room for improvement
c5.large/m5.large and c5.xlarge/m5.xlarge runners are currently configured with gp3 disks, with default IOPS - also potential for improvement
this should alleviate the pressure on hosted GHA runners
interop workflow took 10m 5s after the changes
ci workflow took 9m 30s after the changes
⚠️ in one of the test runs, one of the jobs hung on updating index
❗ it'd be very interesting to explore S3 backed sccache as an alternative to the current caching setup on self-hosted (there's also https://github.com/mozilla/sccache/blob/main/docs/GHA.md for hosted)

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

thomaseizinger

Thanks!

.github/workflows/ci.yml

thomaseizinger

Thanks! A few more comments :)

.github/workflows/ci.yml

.github/workflows/interop-test.yml

.github/workflows/ci.yml

.github/workflows/interop-test.yml

thomaseizinger

Thank you for your help, much appreciated :)

thomaseizinger · 2023-04-26T11:43:32Z

I believe since we merged this, our caches are no longer working, see https://github.com/libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.

The caches are created here: https://github.com/libp2p/rust-libp2p/blob/master/.github/workflows/cache-factory.yml

Do we need to run these steps on the self-hosted runners to in order to access these caches?

thomaseizinger · 2023-04-26T12:35:39Z

This also seems to cause much higher random failures, I've just had three timeouts in a row:

@galargh Do you have an idea on how we can fix this? It doesn't seem to be very reliable unfortunately :(

galargh · 2023-04-26T17:58:49Z

I can see 2 kinds of errors happening here.

The first group is 500s from api.github.com. These look like they could be intermittent issues with GitHub. They did auto-resolve in an acceptable timeframe so I'd be inclined to wait and see how often it occurs before investigating further.

Wed, 26 Apr 2023 11:51:59 GMT
Download action repository 'actions/cache@v3' (SHA:88522ab9f39a2ea568f7027eddc7d8d8bc9d59c8)
Wed, 26 Apr 2023 11:52:0[4](https://github.com/libp2p/rust-libp2p/actions/runs/4808279968/jobs/8558083207#step:16:4) GMT
Warning: Failed to download action 'https://api.github.com/repos/actions/cache/tarball/88522ab9f39a2ea568f7027eddc7d8d8bc9d59c8'. Error: Response status code does not indicate success: 500 (Internal Server Error).
Wed, 26 Apr 2023 11:52:04 GMT
Warning: Back off 26.[5](https://github.com/libp2p/rust-libp2p/actions/runs/4808279968/jobs/8558083207#step:16:5)24 seconds before retry.
Wed, 2[6](https://github.com/libp2p/rust-libp2p/actions/runs/4808279968/jobs/8558083207#step:16:6) Apr 2023 11:52:31 GMT
Run ./.github/actions/cargo-semver-checks

The second group is a certainly more worrying as it leads to job timeouts. It looks like Updating index when we call cargo hangs indefinitely sometimes.

Wed, 26 Apr 2023 11:52:32 GMT
Run cargo semver-checks check-release --package libp2p-swarm-derive --verbose
Wed, 26 Apr 2023 11:52:32 GMT
    Updating index
Wed, 26 Apr 2023 12:00:42 GMT
Error: The operation was canceled.

Do you know what exactly happens during Updating index operation? Is there some log we could archive to get more insights?

We're not running rust elsewhere on self-hosted yet so I haven't investigated this specific issue. We did, however, face other network related issues. E.g. we were being heavily rate limited by DockerHub because we're running behind a single NAT Gateway, and we saw the official Go Modules Proxy stopping responding and eventually killing connections (we weren't able to identify the root cause for this yet). Both these issues were resolved by creating S3 backed read-through proxies to the respective services.

thomaseizinger · 2023-04-27T10:44:55Z

The second group is a certainly more worrying as it leads to job timeouts. It looks like Updating index when we call cargo hangs indefinitely sometimes.
Wed, 26 Apr 2023 11:52:32 GMT
Run cargo semver-checks check-release --package libp2p-swarm-derive --verbose
Wed, 26 Apr 2023 11:52:32 GMT
    Updating index
Wed, 26 Apr 2023 12:00:42 GMT
Error: The operation was canceled.
Do you know what exactly happens during Updating index operation? Is there some log we could archive to get more insights?

Updating index means that cargo is downloading the crates.io index which is essentially this: https://github.com/rust-lang/crates.io-index

With Rust 1.68, there is now the sparse-registry protocol but it is not the default yet. See https://blog.rust-lang.org/2023/03/09/Rust-1.68.0.html#cargos-sparse-protocol.

Currently, it is cloning the above Git repository. I'd assume that GitHub itself has tuned that on its own infrastructure and now that we are using self-hosted, we are accessing it from the outside.

Let's see if it helps if we set the CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse environment variable.

thomaseizinger · 2023-04-27T10:47:16Z

I just realised that we already set this through dtolnay/rust-toolchain: https://github.com/libp2p/rust-libp2p/actions/runs/4808395536/jobs/8558331147#step:16:74

galargh · 2023-04-27T11:53:12Z

I found rust-lang/rust#64248 which led me to rust-lang/cargo#7662. Reading through the comments I think we could try:

enabling debug logging for HTTP over to the registry - ci: enable cargo debug logging in semver-checks #3838
disabling cache in semver-checks - ci: disable cache in semver-checks #3839

This PR enables debug logging on requests from cargo to the registry in semver-checks action (rust-lang/cargo#7662 (comment)). Hopefully, it will let us debug the network issue reported here: #3782 (comment) Pull-Request: #3838.

thomaseizinger · 2023-04-27T12:41:33Z

I've opened an issue with cargo semver-checks: obi1kenobi/cargo-semver-checks#443

thomaseizinger · 2023-04-27T12:48:09Z

I believe since we merged this, our caches are no longer working, see libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.

The caches are created here: master/.github/workflows/cache-factory.yml

Do we need to run these steps on the self-hosted runners to in order to access these caches?

@galargh Do you also have some input on this?

galargh · 2023-04-27T13:54:44Z

I believe since we merged this, our caches are no longer working, see libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.
The caches are created here: master/.github/workflows/cache-factory.yml
Do we need to run these steps on the self-hosted runners to in order to access these caches?

@galargh Do you also have some input on this?

I looked into this. When restoring cache, we take not only the cache/restore-key into account but also cache paths. In case of rust, the cache paths would be cargo home + paths to targets. rust-cache uses absolute paths which differ between hosted and self-hosted runners. I think the quickest way to fix it would be to make sure we use the same paths on self-hosted runners. Also, everyone'd benefit from cache interop. I'll try to look into it tomorrow.

thomaseizinger · 2023-04-27T15:47:29Z

I believe since we merged this, our caches are no longer working, see libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.
The caches are created here: master/.github/workflows/cache-factory.yml
Do we need to run these steps on the self-hosted runners to in order to access these caches?

@galargh Do you also have some input on this?

I looked into this. When restoring cache, we take not only the cache/restore-key into account but also cache paths. In case of rust, the cache paths would be cargo home + paths to targets. rust-cache uses absolute paths which differ between hosted and self-hosted runners. I think the quickest way to fix it would be to make sure we use the same paths on self-hosted runners. Also, everyone'd benefit from cache interop. I'll try to look into it tomorrow.

Awesome, thank you!

thomaseizinger · 2023-04-28T09:52:17Z

@galargh More funny timeouts :(

https://github.com/libp2p/rust-libp2p/actions/runs/4829313211/jobs/8604199371?pr=3746#step:3:44

galargh · 2023-04-28T11:23:50Z

@galargh More funny timeouts :(

https://github.com/libp2p/rust-libp2p/actions/runs/4829313211/jobs/8604199371?pr=3746#step:3:44

This one is from a GHA hosted runner (a windows one) so it's likely unrelated to the others we've been seeing.

BTW, the caches are now interoperable between hosted and self-hosted runners 🥳

thomaseizinger · 2023-04-28T13:30:05Z

BTW, the caches are now interoperable between hosted and self-hosted runners partying_face

Exciting, thank you!

galargh · 2023-04-28T13:37:45Z

https://github.com/libp2p/rust-libp2p/actions/runs/4829816994/jobs/8605307310 <- this is a new one

I thought I already disabled auto-upgrades. I'll have a look if there's something else that might be calling apt-get.

edit: I previously disabled unattended updates, but apparently, that's not all. We should also disable periodic package list updates. I should be able to get to it later today before I start my holiday.

thomaseizinger · 2023-04-28T13:45:03Z

libp2p/rust-libp2p/actions/runs/4829816994/jobs/8605307310 <- this is a new one

I thought I already disabled auto-upgrades. I'll have a look if there's something else that might be calling apt-get.

edit: I previously disabled unattended updates, but apparently, that's not all. We should also disable periodic package list updates. I should be able to get to it later today before I start my holiday.

Could it be that the problem on self-hosted runners is that the individual jobs concurrently try to make an apt-get install? In any case, we are going to remove this very soon so it is not a big problem. We only install protoc for legacy reasons (semver checking against old packages which still need protoc to build).

thomaseizinger · 2023-04-28T13:50:55Z

With the caches working, our CI is flying along! 9m for all jobs: https://github.com/libp2p/rust-libp2p/actions/runs/4831101435

Amazing.

galargh · 2023-04-28T15:23:22Z

Could it be that the problem on self-hosted runners is that the individual jobs concurrently try to make an apt-get install? In any case, we are going to remove this very soon so it is not a big problem. We only install protoc for legacy reasons (semver checking against old packages which still need protoc to build).

Not really, they're all running on different instances. I disabled periodic package list updates now - hopefully, it was just that.

Thank you for all your help finding issues and fixing them, we're really making things better for everyone here :) Greatly appreciated 🙇

thomaseizinger · 2023-04-28T16:16:46Z

Could it be that the problem on self-hosted runners is that the individual jobs concurrently try to make an apt-get install? In any case, we are going to remove this very soon so it is not a big problem. We only install protoc for legacy reasons (semver checking against old packages which still need protoc to build).

Not really, they're all running on different instances. I disabled periodic package list updates now - hopefully, it was just that.

Great, thank you!

Thank you for all your help finding issues and fixing them, we're really making things better for everyone here :) Greatly appreciated bow

Haha, all good! Thank you for being so responsive and tuning the runners!

galargh added 13 commits April 12, 2023 18:07

ci: run test on self-hosted runners

551c52c

ci: confirm proto compiler install

aec3de2

ci: install rust on self-hosted

6889314

ci: fix initial rust installation

822f568

ci: fix initial rust installation

a22d309

ci: fix initial rust installation

010390e

ci: fix initial rust installation

e24f73f

ci: run interop on self-hosted

6123d58

ci: run some test jobs on more powerful machines

fb16c94

ci: fix contains clause for test jobs

18d743f

ci: fix contains clause for test jobs

78fb901

ci: run some test jobs on even more powerful machines

ab253ba

ci: add inline comments about caches and fix worker count for interop

37a66a5

galargh requested review from thomaseizinger and mxinden April 13, 2023 10:03

galargh marked this pull request as ready for review April 13, 2023 10:03

thomaseizinger reviewed Apr 13, 2023

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Show resolved Hide resolved

.github/workflows/ci.yml Show resolved Hide resolved

.github/workflows/ci.yml Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/master' into ci/self-hosted

ca973b1

galargh force-pushed the ci/self-hosted branch from ecd0f95 to ca973b1 Compare April 18, 2023 09:14

ci: remove redundant if

47935d4

thomaseizinger reviewed Apr 24, 2023

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

.github/workflows/interop-test.yml Outdated Show resolved Hide resolved

.github/workflows/ci.yml Show resolved Hide resolved

Merge branch 'master' into ci/self-hosted

f76bf40

galargh commented Apr 24, 2023

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

chore: replace inline comments with an issue

cfff0f8

galargh commented Apr 24, 2023

View reviewed changes

.github/workflows/interop-test.yml Outdated Show resolved Hide resolved

ci: run interop tests on xlarge self-hosted runner

ff3fadd

thomaseizinger approved these changes Apr 25, 2023

View reviewed changes

thomaseizinger added the send-it label Apr 25, 2023

Merge branch 'master' into ci/self-hosted

5f78d45

mergify bot merged commit 9d78331 into master Apr 25, 2023

mergify bot deleted the ci/self-hosted branch April 25, 2023 14:57

This was referenced Apr 27, 2023

ci: enable cargo debug logging in semver-checks #3838

Merged

ci: disable cache in semver-checks #3839

Closed

pguyot mentioned this pull request May 21, 2023

Rust build possibly running out of RAM pguyot/arm-runner-action#77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: run test on self-hosted runners #3782

ci: run test on self-hosted runners #3782

galargh commented Apr 12, 2023 •

edited

Loading

thomaseizinger left a comment

thomaseizinger left a comment

thomaseizinger left a comment

thomaseizinger commented Apr 26, 2023

thomaseizinger commented Apr 26, 2023 •

edited

Loading

galargh commented Apr 26, 2023

thomaseizinger commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

galargh commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

galargh commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

thomaseizinger commented Apr 28, 2023

galargh commented Apr 28, 2023

thomaseizinger commented Apr 28, 2023

galargh commented Apr 28, 2023 •

edited

Loading

thomaseizinger commented Apr 28, 2023

thomaseizinger commented Apr 28, 2023

galargh commented Apr 28, 2023 •

edited

Loading

thomaseizinger commented Apr 28, 2023

ci: run test on self-hosted runners #3782

ci: run test on self-hosted runners #3782

Conversation

galargh commented Apr 12, 2023 • edited Loading

Description

Notes & open questions

Change checklist

thomaseizinger left a comment

Choose a reason for hiding this comment

thomaseizinger left a comment

Choose a reason for hiding this comment

thomaseizinger left a comment

Choose a reason for hiding this comment

thomaseizinger commented Apr 26, 2023

thomaseizinger commented Apr 26, 2023 • edited Loading

galargh commented Apr 26, 2023

thomaseizinger commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

galargh commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

galargh commented Apr 27, 2023

thomaseizinger commented Apr 27, 2023

thomaseizinger commented Apr 28, 2023

galargh commented Apr 28, 2023

thomaseizinger commented Apr 28, 2023

galargh commented Apr 28, 2023 • edited Loading

thomaseizinger commented Apr 28, 2023

thomaseizinger commented Apr 28, 2023

galargh commented Apr 28, 2023 • edited Loading

thomaseizinger commented Apr 28, 2023

galargh commented Apr 12, 2023 •

edited

Loading

thomaseizinger commented Apr 26, 2023 •

edited

Loading

galargh commented Apr 28, 2023 •

edited

Loading

galargh commented Apr 28, 2023 •

edited

Loading