Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: run test on self-hosted runners #3782

Merged
merged 19 commits into from
Apr 25, 2023
Merged

ci: run test on self-hosted runners #3782

merged 19 commits into from
Apr 25, 2023

Conversation

galargh
Copy link
Contributor

@galargh galargh commented Apr 12, 2023

Description

This PR moves Test jobs from Continuous Integration workflow and Interoperability Testing job to self-hosted runners.

The former are moved to machines backed by either c5.large/m5.large, c5.xlarge/m5.xlarge or c5.2xlarge AWS instances while the latter c5.4xlarge.

The self-hosted runners are set up using https://github.com/pl-strflt/tf-aws-gh-runner.

The jobs are being moved to self-hosted because we're exhausting hosted GHA limits.

Notes & open questions

  • c5.large/m5.large are quite small which gives us plenty of room for improvement
  • c5.large/m5.large and c5.xlarge/m5.xlarge runners are currently configured with gp3 disks, with default IOPS - also potential for improvement
  • this should alleviate the pressure on hosted GHA runners
  • interop workflow took 10m 5s after the changes
  • ci workflow took 9m 30s after the changes
  • ⚠️ in one of the test runs, one of the jobs hung on updating index
  • ❗ it'd be very interesting to explore S3 backed sccache as an alternative to the current caching setup on self-hosted (there's also https://github.com/mozilla/sccache/blob/main/docs/GHA.md for hosted)

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@galargh galargh marked this pull request as ready for review April 13, 2023 10:03
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

.github/workflows/ci.yml Outdated Show resolved Hide resolved
.github/workflows/ci.yml Show resolved Hide resolved
.github/workflows/ci.yml Show resolved Hide resolved
.github/workflows/ci.yml Outdated Show resolved Hide resolved
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few more comments :)

.github/workflows/ci.yml Outdated Show resolved Hide resolved
.github/workflows/interop-test.yml Outdated Show resolved Hide resolved
.github/workflows/ci.yml Show resolved Hide resolved
.github/workflows/ci.yml Outdated Show resolved Hide resolved
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your help, much appreciated :)

@mergify mergify bot merged commit 9d78331 into master Apr 25, 2023
@mergify mergify bot deleted the ci/self-hosted branch April 25, 2023 14:57
@thomaseizinger
Copy link
Contributor

I believe since we merged this, our caches are no longer working, see https://github.com/libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.

The caches are created here: https://github.com/libp2p/rust-libp2p/blob/master/.github/workflows/cache-factory.yml

Do we need to run these steps on the self-hosted runners to in order to access these caches?

@thomaseizinger
Copy link
Contributor

thomaseizinger commented Apr 26, 2023

This also seems to cause much higher random failures, I've just had three timeouts in a row:

@galargh Do you have an idea on how we can fix this? It doesn't seem to be very reliable unfortunately :(

@galargh
Copy link
Contributor Author

galargh commented Apr 26, 2023

I can see 2 kinds of errors happening here.

The first group is 500s from api.github.com. These look like they could be intermittent issues with GitHub. They did auto-resolve in an acceptable timeframe so I'd be inclined to wait and see how often it occurs before investigating further.

Wed, 26 Apr 2023 11:51:59 GMT
Download action repository 'actions/cache@v3' (SHA:88522ab9f39a2ea568f7027eddc7d8d8bc9d59c8)
Wed, 26 Apr 2023 11:52:0[4](https://github.com/libp2p/rust-libp2p/actions/runs/4808279968/jobs/8558083207#step:16:4) GMT
Warning: Failed to download action 'https://api.github.com/repos/actions/cache/tarball/88522ab9f39a2ea568f7027eddc7d8d8bc9d59c8'. Error: Response status code does not indicate success: 500 (Internal Server Error).
Wed, 26 Apr 2023 11:52:04 GMT
Warning: Back off 26.[5](https://github.com/libp2p/rust-libp2p/actions/runs/4808279968/jobs/8558083207#step:16:5)24 seconds before retry.
Wed, 2[6](https://github.com/libp2p/rust-libp2p/actions/runs/4808279968/jobs/8558083207#step:16:6) Apr 2023 11:52:31 GMT
Run ./.github/actions/cargo-semver-checks

The second group is a certainly more worrying as it leads to job timeouts. It looks like Updating index when we call cargo hangs indefinitely sometimes.

Wed, 26 Apr 2023 11:52:32 GMT
Run cargo semver-checks check-release --package libp2p-swarm-derive --verbose
Wed, 26 Apr 2023 11:52:32 GMT
    Updating index
Wed, 26 Apr 2023 12:00:42 GMT
Error: The operation was canceled.

Do you know what exactly happens during Updating index operation? Is there some log we could archive to get more insights?

We're not running rust elsewhere on self-hosted yet so I haven't investigated this specific issue. We did, however, face other network related issues. E.g. we were being heavily rate limited by DockerHub because we're running behind a single NAT Gateway, and we saw the official Go Modules Proxy stopping responding and eventually killing connections (we weren't able to identify the root cause for this yet). Both these issues were resolved by creating S3 backed read-through proxies to the respective services.

@thomaseizinger
Copy link
Contributor

The second group is a certainly more worrying as it leads to job timeouts. It looks like Updating index when we call cargo hangs indefinitely sometimes.

Wed, 26 Apr 2023 11:52:32 GMT
Run cargo semver-checks check-release --package libp2p-swarm-derive --verbose
Wed, 26 Apr 2023 11:52:32 GMT
    Updating index
Wed, 26 Apr 2023 12:00:42 GMT
Error: The operation was canceled.

Do you know what exactly happens during Updating index operation? Is there some log we could archive to get more insights?

Updating index means that cargo is downloading the crates.io index which is essentially this: https://github.com/rust-lang/crates.io-index

With Rust 1.68, there is now the sparse-registry protocol but it is not the default yet. See https://blog.rust-lang.org/2023/03/09/Rust-1.68.0.html#cargos-sparse-protocol.

Currently, it is cloning the above Git repository. I'd assume that GitHub itself has tuned that on its own infrastructure and now that we are using self-hosted, we are accessing it from the outside.

Let's see if it helps if we set the CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse environment variable.

@thomaseizinger
Copy link
Contributor

I just realised that we already set this through dtolnay/rust-toolchain: https://github.com/libp2p/rust-libp2p/actions/runs/4808395536/jobs/8558331147#step:16:74

@galargh
Copy link
Contributor Author

galargh commented Apr 27, 2023

I found rust-lang/rust#64248 which led me to rust-lang/cargo#7662. Reading through the comments I think we could try:

mergify bot pushed a commit that referenced this pull request Apr 27, 2023
This PR enables debug logging on requests from cargo to the registry in semver-checks action (rust-lang/cargo#7662 (comment)). Hopefully, it will let us debug the network issue reported here: #3782 (comment)

Pull-Request: #3838.
@thomaseizinger
Copy link
Contributor

I've opened an issue with cargo semver-checks: obi1kenobi/cargo-semver-checks#443

@thomaseizinger
Copy link
Contributor

I believe since we merged this, our caches are no longer working, see libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.

The caches are created here: master/.github/workflows/cache-factory.yml

Do we need to run these steps on the self-hosted runners to in order to access these caches?

@galargh Do you also have some input on this?

@galargh
Copy link
Contributor Author

galargh commented Apr 27, 2023

I believe since we merged this, our caches are no longer working, see libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.
The caches are created here: master/.github/workflows/cache-factory.yml
Do we need to run these steps on the self-hosted runners to in order to access these caches?

@galargh Do you also have some input on this?

I looked into this. When restoring cache, we take not only the cache/restore-key into account but also cache paths. In case of rust, the cache paths would be cargo home + paths to targets. rust-cache uses absolute paths which differ between hosted and self-hosted runners. I think the quickest way to fix it would be to make sure we use the same paths on self-hosted runners. Also, everyone'd benefit from cache interop. I'll try to look into it tomorrow.

@thomaseizinger
Copy link
Contributor

I believe since we merged this, our caches are no longer working, see libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example.
The caches are created here: master/.github/workflows/cache-factory.yml
Do we need to run these steps on the self-hosted runners to in order to access these caches?

@galargh Do you also have some input on this?

I looked into this. When restoring cache, we take not only the cache/restore-key into account but also cache paths. In case of rust, the cache paths would be cargo home + paths to targets. rust-cache uses absolute paths which differ between hosted and self-hosted runners. I think the quickest way to fix it would be to make sure we use the same paths on self-hosted runners. Also, everyone'd benefit from cache interop. I'll try to look into it tomorrow.

Awesome, thank you!

@thomaseizinger
Copy link
Contributor

@galargh
Copy link
Contributor Author

galargh commented Apr 28, 2023

@galargh More funny timeouts :(

https://github.com/libp2p/rust-libp2p/actions/runs/4829313211/jobs/8604199371?pr=3746#step:3:44

This one is from a GHA hosted runner (a windows one) so it's likely unrelated to the others we've been seeing.

BTW, the caches are now interoperable between hosted and self-hosted runners 🥳

@thomaseizinger
Copy link
Contributor

BTW, the caches are now interoperable between hosted and self-hosted runners partying_face

Exciting, thank you!

@galargh
Copy link
Contributor Author

galargh commented Apr 28, 2023

https://github.com/libp2p/rust-libp2p/actions/runs/4829816994/jobs/8605307310 <- this is a new one

I thought I already disabled auto-upgrades. I'll have a look if there's something else that might be calling apt-get.

edit: I previously disabled unattended updates, but apparently, that's not all. We should also disable periodic package list updates. I should be able to get to it later today before I start my holiday.

@thomaseizinger
Copy link
Contributor

libp2p/rust-libp2p/actions/runs/4829816994/jobs/8605307310 <- this is a new one

I thought I already disabled auto-upgrades. I'll have a look if there's something else that might be calling apt-get.

edit: I previously disabled unattended updates, but apparently, that's not all. We should also disable periodic package list updates. I should be able to get to it later today before I start my holiday.

Could it be that the problem on self-hosted runners is that the individual jobs concurrently try to make an apt-get install? In any case, we are going to remove this very soon so it is not a big problem. We only install protoc for legacy reasons (semver checking against old packages which still need protoc to build).

@thomaseizinger
Copy link
Contributor

With the caches working, our CI is flying along! 9m for all jobs: https://github.com/libp2p/rust-libp2p/actions/runs/4831101435

Amazing.

@galargh
Copy link
Contributor Author

galargh commented Apr 28, 2023

Could it be that the problem on self-hosted runners is that the individual jobs concurrently try to make an apt-get install? In any case, we are going to remove this very soon so it is not a big problem. We only install protoc for legacy reasons (semver checking against old packages which still need protoc to build).

Not really, they're all running on different instances. I disabled periodic package list updates now - hopefully, it was just that.

Thank you for all your help finding issues and fixing them, we're really making things better for everyone here :) Greatly appreciated 🙇

@thomaseizinger
Copy link
Contributor

Could it be that the problem on self-hosted runners is that the individual jobs concurrently try to make an apt-get install? In any case, we are going to remove this very soon so it is not a big problem. We only install protoc for legacy reasons (semver checking against old packages which still need protoc to build).

Not really, they're all running on different instances. I disabled periodic package list updates now - hopefully, it was just that.

Great, thank you!

Thank you for all your help finding issues and fixing them, we're really making things better for everyone here :) Greatly appreciated bow

Haha, all good! Thank you for being so responsive and tuning the runners!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants