Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce TPU pod launcher #815

Closed
wants to merge 29 commits into from
Closed

Introduce TPU pod launcher #815

wants to merge 29 commits into from

Conversation

muellerzr
Copy link
Collaborator

@muellerzr muellerzr commented Nov 2, 2022

This is a heavy POC actively in development and currently is awaiting on pytorch/xla#4149 to see if we can push forward, however this PR is out here so that the community can know that it's being worked on and almost there :)

Proposed API:

accelerate launch now allows for a configured pod setup through three new params/config items:

  • use_cluster, whether to use a TPU cluster
  • vm, this mimics xla_dist's vm argument and is a list of single compute VM names if you are not using an instance group. (generally not needed)
  • env, this is a list of environment variables to set on each of the compute VM instances

Currently we only support non-Docker, as GCP doesn't support docker yet on the larger pods.

To launch a script on a TPU pod, the API will look like such:

Fully configured:

accelerate launch myscript.py --arg1 --arg2 ...

No configuration:

accelerate launch --tpu --use_cluster myscript.py --arg1 --arg2 ...

TODO:

Write tests

Closes #501 and closes #471

@muellerzr muellerzr added the enhancement New feature or request label Nov 2, 2022
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@muellerzr muellerzr marked this pull request as ready for review November 15, 2022 21:59
@muellerzr muellerzr changed the title [Do not merge] Introduce TPU pod launcher Introduce TPU pod launcher Nov 15, 2022
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, thanks a lot for working on this!

src/accelerate/commands/config/cluster.py Outdated Show resolved Hide resolved
Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work 🤗! Left a comment.

src/accelerate/commands/config/cluster.py Show resolved Hide resolved
@huggingface huggingface deleted a comment from github-actions bot Dec 11, 2022
@huggingface huggingface deleted a comment from github-actions bot Jan 13, 2023
@huggingface huggingface deleted a comment from github-actions bot Feb 7, 2023
@muellerzr muellerzr closed this Feb 8, 2023
@muellerzr muellerzr deleted the tpu-pod-launch branch March 6, 2023 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Questions wrt training on TPU Pod Using Accelerate with TPU Pod VM like v3-32
5 participants