-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate AMD GPU in CI/CD environment #26007
Conversation
The documentation is not available anymore as the PR was closed or merged. |
@@ -0,0 +1,29 @@ | |||
FROM rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_2.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a workflow file that build this image. So far I don't see it, and I am wondering where is transformers-pytorch-latest-amdgpu-push-ci
created/pushed? Do you do this manually somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This image is built and published by AMD directly on the Docker Hub.
I can add a job in the build-docker-images.yml
workflow to build transformers-pytorch-latest-amdgpu-push-ci
.
It has been pushed to our Docker Hub organisation by myself 🙂
fix dockerfile Co-authored-by: Felix Marty <felix@hf.co>
AMDGPU CI lives in an other workflow
@ydshieh @LysandreJik I think we are in a good shape for review and merging. What we did:
What we cannot ensure as of today:
|
Hi @mfuntowicz Looking at the runs in https://github.com/huggingface/transformers/actions/workflows/self-push-amd.yml, you will see no test job (Model test) is being triggered (as no test is being collected). Also the slack report won't work as the tag is sitll using |
…g `amd-gpu` and `miXXX` labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
@LysandreJik in case you want to take a final look :-) |
Merge now so @mfuntowicz can show progress to AMD team today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok LGTM
* Add a Dockerfile for PyTorch + ROCm based on official AMD released artifact * Add a new artifact single-amdgpu testing on main * Attempt to test the workflow without merging. * Changed BERT to check if things are triggered * Meet the dependencies graph on workflow * Revert BERT changes * Add check_runners_amdgpu to correctly mount and check availability * Rename setup to setup_gpu for CUDA and add setup_amdgpu for AMD * Fix all the needs.setup -> needs.setup_[gpu|amdgpu] dependencies * Fix setup dependency graph to use check_runner_amdgpu * Let's do the runner status check only on AMDGPU target * Update the Dockerfile.amd to put ourselves in / rather than /var/lib * Restore the whole setup for CUDA too. * Let's redisable them * Change BERT to trigger tests * Restore BERT * Add torchaudio with rocm 5.6 to AMD Dockerfile (huggingface#26050) fix dockerfile Co-authored-by: Felix Marty <felix@hf.co> * Place AMD GPU tests in a separate workflow (correct branch) (huggingface#26105) AMDGPU CI lives in an other workflow * Fix invalid job name is dependencies. * Remove tests multi-amdgpu for now. * Use single-amdgpu * Use --net=host for now. * Remote host networking. * Removed duplicated check_runners_amdgpu step * Let's tag machine-types with mi210 for now. * Machine type should be only mi210 * Remove unnecessary push.branches item * Apply review suggestions moving from `x-amdgpu` to `x-gpu` introducing `amd-gpu` and `miXXX` labels. * Remove amdgpu from step names. * finalize * delete --------- Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
No description provided.