Integrate AMD GPU in CI/CD environment #26007

mfuntowicz · 2023-09-06T13:35:09Z

No description provided.

…tifact

HuggingFaceDocBuilderDev · 2023-09-06T14:16:29Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2023-09-07T13:53:04Z

docker/transformers-pytorch-gpu/Dockerfile.amd

@@ -0,0 +1,29 @@
+FROM rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_2.0.1


There should be a workflow file that build this image. So far I don't see it, and I am wondering where is transformers-pytorch-latest-amdgpu-push-ci created/pushed? Do you do this manually somewhere?

This image is built and published by AMD directly on the Docker Hub.

I can add a job in the build-docker-images.yml workflow to build transformers-pytorch-latest-amdgpu-push-ci.
It has been pushed to our Docker Hub organisation by myself 🙂

.github/workflows/self-push.yml

fix dockerfile Co-authored-by: Felix Marty <felix@hf.co>

AMDGPU CI lives in an other workflow

mfuntowicz · 2023-09-15T10:59:25Z

@ydshieh @LysandreJik I think we are in a good shape for review and merging.

What we did:

Added custom runners with tags docker-gpu, single-gpu, amd-gpu, mi210
Provide a custom PyTorch GPU Dockerfile for AMD dependencies
Create a new self-push-amd.yml workflow file for everything related to AMD testing
Validated the workflow against a simple BERT modifications

What we cannot ensure as of today:

All the current tests being executed on main will be green 😅

ydshieh · 2023-09-18T11:47:56Z

Hi @mfuntowicz

Looking at the runs in https://github.com/huggingface/transformers/actions/workflows/self-push-amd.yml, you will see no test job (Model test) is being triggered (as no test is being collected).

Also the slack report won't work as the tag is sitll using single-amdgpu instead of single-gpu.

…g `amd-gpu` and `miXXX` labels.

ydshieh

🔥

ydshieh · 2023-09-20T09:27:18Z

@LysandreJik in case you want to take a final look :-)

ydshieh · 2023-09-20T12:40:29Z

Merge now so @mfuntowicz can show progress to AMD team today.

LysandreJik

Ok LGTM

* Add a Dockerfile for PyTorch + ROCm based on official AMD released artifact * Add a new artifact single-amdgpu testing on main * Attempt to test the workflow without merging. * Changed BERT to check if things are triggered * Meet the dependencies graph on workflow * Revert BERT changes * Add check_runners_amdgpu to correctly mount and check availability * Rename setup to setup_gpu for CUDA and add setup_amdgpu for AMD * Fix all the needs.setup -> needs.setup_[gpu|amdgpu] dependencies * Fix setup dependency graph to use check_runner_amdgpu * Let's do the runner status check only on AMDGPU target * Update the Dockerfile.amd to put ourselves in / rather than /var/lib * Restore the whole setup for CUDA too. * Let's redisable them * Change BERT to trigger tests * Restore BERT * Add torchaudio with rocm 5.6 to AMD Dockerfile (huggingface#26050) fix dockerfile Co-authored-by: Felix Marty <felix@hf.co> * Place AMD GPU tests in a separate workflow (correct branch) (huggingface#26105) AMDGPU CI lives in an other workflow * Fix invalid job name is dependencies. * Remove tests multi-amdgpu for now. * Use single-amdgpu * Use --net=host for now. * Remote host networking. * Removed duplicated check_runners_amdgpu step * Let's tag machine-types with mi210 for now. * Machine type should be only mi210 * Remove unnecessary push.branches item * Apply review suggestions moving from `x-amdgpu` to `x-gpu` introducing `amd-gpu` and `miXXX` labels. * Remove amdgpu from step names. * finalize * delete --------- Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

mfuntowicz added 6 commits September 6, 2023 15:31

Add a Dockerfile for PyTorch + ROCm based on official AMD released ar…

48d3efb

…tifact

Add a new artifact single-amdgpu testing on main

c1acac0

Attempt to test the workflow without merging.

70dbee0

Changed BERT to check if things are triggered

96639bb

Meet the dependencies graph on workflow

8f3e698

Revert BERT changes

4cd3871

mfuntowicz added 10 commits September 7, 2023 11:00

Add check_runners_amdgpu to correctly mount and check availability

cc62d3d

Rename setup to setup_gpu for CUDA and add setup_amdgpu for AMD

a4a639c

Fix all the needs.setup -> needs.setup_[gpu|amdgpu] dependencies

2485383

Fix setup dependency graph to use check_runner_amdgpu

f99374d

Let's do the runner status check only on AMDGPU target

c045c1e

Update the Dockerfile.amd to put ourselves in / rather than /var/lib

d3c3e72

Restore the whole setup for CUDA too.

d237227

Let's redisable them

1a0e302

Change BERT to trigger tests

ad48aeb

Restore BERT

0107f55

ydshieh reviewed Sep 7, 2023

View reviewed changes

Add torchaudio with rocm 5.6 to AMD Dockerfile (#26050)

4a2efa4

fix dockerfile Co-authored-by: Felix Marty <felix@hf.co>

This was referenced Sep 12, 2023

Place AMD GPU tests in a separate workflow #26104

Closed

Place AMD GPU tests in a separate workflow (correct branch) #26105

Merged

fxmarty and others added 9 commits September 12, 2023 16:54

Place AMD GPU tests in a separate workflow (correct branch) (#26105)

cd106b4

AMDGPU CI lives in an other workflow

Fix invalid job name is dependencies.

933b00f

Remove tests multi-amdgpu for now.

7c1edd9

Use single-amdgpu

4c35979

Use --net=host for now.

8dcc3b4

Remote host networking.

17e07f5

Removed duplicated check_runners_amdgpu step

d76455d

Let's tag machine-types with mi210 for now.

f58b7ae

Machine type should be only mi210

422110e

Remove unnecessary push.branches item

6b860be

ydshieh self-assigned this Sep 18, 2023

mfuntowicz and others added 3 commits September 18, 2023 14:15

Apply review suggestions moving from x-amdgpu to x-gpu introducin…

a8690cd

…g `amd-gpu` and `miXXX` labels.

Remove amdgpu from step names.

ec4787f

finalize

047dd96

ydshieh changed the title ~~[WIP] Integrate AMDGPU in CI/CD environment~~ Integrate AMD GPU in CI/CD environment Sep 20, 2023

ydshieh approved these changes Sep 20, 2023

View reviewed changes

ydshieh requested a review from LysandreJik September 20, 2023 09:26

delete

d1fb120

LysandreJik approved these changes Sep 20, 2023

View reviewed changes

ydshieh merged commit 2d71307 into main Sep 20, 2023

ydshieh deleted the ci-amdgpu branch September 20, 2023 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate AMD GPU in CI/CD environment #26007

Integrate AMD GPU in CI/CD environment #26007

mfuntowicz commented Sep 6, 2023

HuggingFaceDocBuilderDev commented Sep 6, 2023 •

edited

Loading

ydshieh Sep 7, 2023

mfuntowicz Sep 15, 2023

mfuntowicz commented Sep 15, 2023 •

edited

Loading

ydshieh commented Sep 18, 2023

ydshieh left a comment

ydshieh commented Sep 20, 2023

ydshieh commented Sep 20, 2023

LysandreJik left a comment

		@@ -0,0 +1,29 @@
		FROM rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_2.0.1

Integrate AMD GPU in CI/CD environment #26007

Integrate AMD GPU in CI/CD environment #26007

Conversation

mfuntowicz commented Sep 6, 2023

HuggingFaceDocBuilderDev commented Sep 6, 2023 • edited Loading

ydshieh Sep 7, 2023

Choose a reason for hiding this comment

mfuntowicz Sep 15, 2023

Choose a reason for hiding this comment

mfuntowicz commented Sep 15, 2023 • edited Loading

ydshieh commented Sep 18, 2023

ydshieh left a comment

Choose a reason for hiding this comment

ydshieh commented Sep 20, 2023

ydshieh commented Sep 20, 2023

LysandreJik left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 6, 2023 •

edited

Loading

mfuntowicz commented Sep 15, 2023 •

edited

Loading