Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate AMD GPU in CI/CD environment #26007

Merged
merged 31 commits into from
Sep 20, 2023
Merged

Integrate AMD GPU in CI/CD environment #26007

merged 31 commits into from
Sep 20, 2023

Conversation

mfuntowicz
Copy link
Member

No description provided.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 6, 2023

The documentation is not available anymore as the PR was closed or merged.

@@ -0,0 +1,29 @@
FROM rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_2.0.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a workflow file that build this image. So far I don't see it, and I am wondering where is transformers-pytorch-latest-amdgpu-push-ci created/pushed? Do you do this manually somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is built and published by AMD directly on the Docker Hub.

I can add a job in the build-docker-images.yml workflow to build transformers-pytorch-latest-amdgpu-push-ci.
It has been pushed to our Docker Hub organisation by myself 🙂

.github/workflows/self-push.yml Outdated Show resolved Hide resolved
fix dockerfile

Co-authored-by: Felix Marty <felix@hf.co>
@mfuntowicz
Copy link
Member Author

mfuntowicz commented Sep 15, 2023

@ydshieh @LysandreJik I think we are in a good shape for review and merging.

What we did:

  • Added custom runners with tags docker-gpu, single-gpu, amd-gpu, mi210
  • Provide a custom PyTorch GPU Dockerfile for AMD dependencies
  • Create a new self-push-amd.yml workflow file for everything related to AMD testing
  • Validated the workflow against a simple BERT modifications

What we cannot ensure as of today:

  • All the current tests being executed on main will be green 😅

@ydshieh ydshieh self-assigned this Sep 18, 2023
@ydshieh
Copy link
Collaborator

ydshieh commented Sep 18, 2023

Hi @mfuntowicz

Looking at the runs in https://github.com/huggingface/transformers/actions/workflows/self-push-amd.yml, you will see no test job (Model test) is being triggered (as no test is being collected).

Also the slack report won't work as the tag is sitll using single-amdgpu instead of single-gpu.

@ydshieh ydshieh changed the title [WIP] Integrate AMDGPU in CI/CD environment Integrate AMD GPU in CI/CD environment Sep 20, 2023
Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@ydshieh ydshieh requested a review from LysandreJik September 20, 2023 09:26
@ydshieh
Copy link
Collaborator

ydshieh commented Sep 20, 2023

@LysandreJik in case you want to take a final look :-)

@ydshieh
Copy link
Collaborator

ydshieh commented Sep 20, 2023

Merge now so @mfuntowicz can show progress to AMD team today.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok LGTM

@ydshieh ydshieh merged commit 2d71307 into main Sep 20, 2023
@ydshieh ydshieh deleted the ci-amdgpu branch September 20, 2023 12:48
parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
* Add a Dockerfile for PyTorch + ROCm based on official AMD released artifact

* Add a new artifact single-amdgpu testing on main

* Attempt to test the workflow without merging.

* Changed BERT to check if things are triggered

* Meet the dependencies graph on workflow

* Revert BERT changes

* Add check_runners_amdgpu to correctly mount and check availability

* Rename setup to setup_gpu for CUDA and add setup_amdgpu for AMD

* Fix all the needs.setup -> needs.setup_[gpu|amdgpu] dependencies

* Fix setup dependency graph to use check_runner_amdgpu

* Let's do the runner status check only on AMDGPU target

* Update the Dockerfile.amd to put ourselves in / rather than /var/lib

* Restore the whole setup for CUDA too.

* Let's redisable them

* Change BERT to trigger tests

* Restore BERT

* Add torchaudio with rocm 5.6 to AMD Dockerfile (huggingface#26050)

fix dockerfile

Co-authored-by: Felix Marty <felix@hf.co>

* Place AMD GPU tests in a separate workflow (correct branch) (huggingface#26105)

AMDGPU CI lives in an other workflow

* Fix invalid job name is dependencies.

* Remove tests multi-amdgpu for now.

* Use single-amdgpu

* Use --net=host for now.

* Remote host networking.

* Removed duplicated check_runners_amdgpu step

* Let's tag machine-types with mi210 for now.

* Machine type should be only mi210

* Remove unnecessary push.branches item

* Apply review suggestions moving from `x-amdgpu` to `x-gpu` introducing `amd-gpu` and `miXXX` labels.

* Remove amdgpu from step names.

* finalize

* delete

---------

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants