Optimal way to train on TPU POD #3325

DuyguA · 2025-01-06T19:41:46Z

Hello accelerate team,
I'm looking to pretrain on v4-32 TPU pod, using a HF dataset and HF Trainer. I have no problems running on a single TPU.

I already found this issue #501 and answer https://github.com/huggingface/accelerate/issues/501 , but it's 2 years old. I successfully install accelerate and xla to all workers, however at the step 2 looks like we need the file xla_dist.py , which doesn't exist any more in xla/master branch. What are the steps to train on TPU pods then? Thnx in advance!

The text was updated successfully, but these errors were encountered:

DuyguA · 2025-01-06T19:58:47Z

cc @muellerzr

github-actions · 2025-02-06T15:07:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

DuyguA · 2025-02-07T15:12:47Z

up please

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal way to train on TPU POD #3325

Optimal way to train on TPU POD #3325

DuyguA commented Jan 6, 2025

DuyguA commented Jan 6, 2025

github-actions bot commented Feb 6, 2025

DuyguA commented Feb 7, 2025

Optimal way to train on TPU POD #3325

Optimal way to train on TPU POD #3325

Comments

DuyguA commented Jan 6, 2025

DuyguA commented Jan 6, 2025

github-actions bot commented Feb 6, 2025

DuyguA commented Feb 7, 2025