-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] fix OOM when megatron loading large model by only rank 0 loads weights #330
base: main
Are you sure you want to change the base?
Conversation
0b6b28f
to
7b81634
Compare
Note: Error Log:
|
7b81634
to
a9d3b95
Compare
Hi @uygnef , Nice catch!
What do you mean by "actor attempted to upload a policy"? |
Yes, we have train serveral times (>5) with same setup, it did not happend again. Do you have any suggestions?
|
So this issue may related to PP? Are you using VPP? |
Based on this issue, I believe it's likely a NCCL bug and not related to PP or VPP. I'm currently using PP, but I haven't utilized VPP. Here's the setup details:
|
a9d3b95
to
eae2b20
Compare
Problem
Currently, when the Megatron worker loads a model, every rank loads the checkpoint (ckpt). For large models, this often causes out-of-memory (OOM) errors.
Solution
To address this, we've modified the process so that only rank 0 loads the actual model weights. This significantly reduces memory usage during model loading and prevents OOM issues.
Test
The solution has been tested on a 4 node H800 system, with each node equipped with 800GB of RAM, successfully loading the Megatron Qwen2.5-32B model without encountering OOM errors.