-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: run test on self-hosted runners #3782
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few more comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your help, much appreciated :)
I believe since we merged this, our caches are no longer working, see https://github.com/libp2p/rust-libp2p/actions/runs/4808190866/jobs/8557865839#step:11:91 for example. The caches are created here: https://github.com/libp2p/rust-libp2p/blob/master/.github/workflows/cache-factory.yml Do we need to run these steps on the self-hosted runners to in order to access these caches? |
This also seems to cause much higher random failures, I've just had three timeouts in a row:
@galargh Do you have an idea on how we can fix this? It doesn't seem to be very reliable unfortunately :( |
I can see 2 kinds of errors happening here. The first group is 500s from
The second group is a certainly more worrying as it leads to job timeouts. It looks like
Do you know what exactly happens during We're not running rust elsewhere on self-hosted yet so I haven't investigated this specific issue. We did, however, face other network related issues. E.g. we were being heavily rate limited by DockerHub because we're running behind a single NAT Gateway, and we saw the official Go Modules Proxy stopping responding and eventually killing connections (we weren't able to identify the root cause for this yet). Both these issues were resolved by creating S3 backed read-through proxies to the respective services. |
Updating index means that cargo is downloading the crates.io index which is essentially this: https://github.com/rust-lang/crates.io-index With Rust 1.68, there is now the sparse-registry protocol but it is not the default yet. See https://blog.rust-lang.org/2023/03/09/Rust-1.68.0.html#cargos-sparse-protocol. Currently, it is cloning the above Git repository. I'd assume that GitHub itself has tuned that on its own infrastructure and now that we are using self-hosted, we are accessing it from the outside. Let's see if it helps if we set the |
I just realised that we already set this through |
I found rust-lang/rust#64248 which led me to rust-lang/cargo#7662. Reading through the comments I think we could try:
|
This PR enables debug logging on requests from cargo to the registry in semver-checks action (rust-lang/cargo#7662 (comment)). Hopefully, it will let us debug the network issue reported here: #3782 (comment) Pull-Request: #3838.
I've opened an issue with |
@galargh Do you also have some input on this? |
I looked into this. When restoring cache, we take not only the cache/restore-key into account but also cache paths. In case of rust, the cache paths would be cargo home + paths to targets. rust-cache uses absolute paths which differ between hosted and self-hosted runners. I think the quickest way to fix it would be to make sure we use the same paths on self-hosted runners. Also, everyone'd benefit from cache interop. I'll try to look into it tomorrow. |
Awesome, thank you! |
This one is from a GHA hosted runner (a windows one) so it's likely unrelated to the others we've been seeing. BTW, the caches are now interoperable between hosted and self-hosted runners 🥳 |
Exciting, thank you! |
https://github.com/libp2p/rust-libp2p/actions/runs/4829816994/jobs/8605307310 <- this is a new one I thought I already disabled auto-upgrades. I'll have a look if there's something else that might be calling apt-get. edit: I previously disabled unattended updates, but apparently, that's not all. We should also disable periodic package list updates. I should be able to get to it later today before I start my holiday. |
Could it be that the problem on self-hosted runners is that the individual jobs concurrently try to make an |
With the caches working, our CI is flying along! 9m for all jobs: https://github.com/libp2p/rust-libp2p/actions/runs/4831101435 Amazing. |
Not really, they're all running on different instances. I disabled periodic package list updates now - hopefully, it was just that. Thank you for all your help finding issues and fixing them, we're really making things better for everyone here :) Greatly appreciated 🙇 |
Great, thank you!
Haha, all good! Thank you for being so responsive and tuning the runners! |
Description
This PR moves Test jobs from Continuous Integration workflow and Interoperability Testing job to self-hosted runners.
The former are moved to machines backed by either c5.large/m5.large, c5.xlarge/m5.xlarge or c5.2xlarge AWS instances while the latter c5.4xlarge.
The self-hosted runners are set up using https://github.com/pl-strflt/tf-aws-gh-runner.
The jobs are being moved to self-hosted because we're exhausting hosted GHA limits.
Notes & open questions
Change checklist