-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Don't run E2E tests on self-hosted CUDA in Nightly #14041
Conversation
The runner seems to be broken, don't run the tests until it's fixed.
I believe someone from @intel/llvm-reviewers-cuda (maybe @npmiller ?) has access to the runner and I expect them to fix it and then revert this PR. |
@@ -74,13 +74,6 @@ jobs: | |||
target_devices: opencl:cpu | |||
tests_selector: e2e | |||
|
|||
- name: Self-hosted CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of removing this code, can we just comment it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to remove. If the runner is unrecoverable or nobody is willing to fix it then there is no reason to have the "dead" comments inside the repo.
The Linux kernel and headers were updated on the CUDA runner - I don't know how - which caused the Nvidia driver to fail. I got the following error when trying to install Nvidia driver for CUDA 12.1: https://forums.developer.nvidia.com/t/linux-6-7-3-545-29-06-550-40-07-error-modpost-gpl-incompatible-module-nvidia-ko-uses-gpl-only-symbol-rcu-read-lock/280908 Instead, as an experiment, I tried installing CUDA 12.4 libraries and recommended driver, and it seems to work fine: https://github.com/intel/llvm/actions/runs/9360554942/job/25813680144 .(except the known E2E failure: #13661 ) I'll let @npmiller decide if we can keep CUDA 12.4 on the CI. If yes, someone needs to update the docker script (https://github.com/intel/llvm/blob/sycl/devops/containers/ubuntu2204_build.Dockerfile#L1) and disable to failing E2E test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff looks good, but I can't speak to the rationale of the removal, @npmiller would you like to have a look at this?
Nvidia released the 12.5 dev docker 5 days ago. I'm trying to build it locally now. If that succeeds we can go straight to 12.5. I've already checked that 12.5 passes all e2e tests. and using the updated driver they should have on the docker image, #13661 is fixed. |
I don't believe any of us have access to the runners, so I don't think we can fix them or investigate unfortunately. Thanks @uditagarwal97 for having a look, I think the 12.1 docker image should be able to run fine on the 12.4 driver that's on the runner, so if upgrading the runner's driver solves the issues you were seeing it should be all good even without updating the docker image. |
Testing on a runner with a 12.4 driver will result in the test failure here : #13661 (comment) |
Can you please get to the bottom of this so that it would be Codeplay maintaining it and not @uditagarwal97 ? |
Latest nightly failed due to infrastructural issues with the runner again. Resurrecting this PR to remove the faulty tasks until Codeplay folks will get access to the runner and assume ownership of that part of the CI. |
The runner seems to be broken, don't run the tests until it's fixed.
The runner seems to be broken, don't run the tests until it's fixed.