-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half #275
Comments
Hi @balcklive I am not sure if it will solve your problem but you can try to modify the DeepSpeed’s config file:
Thanks for your issue, let me know if it is working of if you are stuck with the same error. |
@PierpaoloSorbellini HI, thank you for your reply, I tried what you said.
|
@balcklive thanks for the quick reply. If you are only using one GPU, you can also disable DeepSpeed's or try Accelerate (which uses DeepSpeed's without having to manually configure it). I am sorry that you are having problems, let us know if you are still having issues. |
My bachszie is always 1, I can't lower it anymore. |
Hi @balcklive can you please send me the model that you used in actor, critic and reward (just the string used in the config.yaml) and which is the training procedure that fails (actor, RL or reward) ? |
It's REWARD model, As I mentioned in issue #281 , my config file specify the model type as opt-125m, but it actually is a gpt2 model. Deepspeed module can compress the model size efficiently, hope you can fix it. |
Hi @balcklive yes this problem with half precision should be fixed in #306 |
{
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
},
"contiguous_gradients": true,
"overlap_comm": true
},
"num_gpus": 1
}
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028967857360839844 seconds
Start Training the Reward Model
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Traceback (most recent call last):
File "artifacts/main.py", line 54, in
reward_trainer.train()
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 379, in train
self.model_engine.backward(loss)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2028, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 54, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected Half
Is it a GPU memory not enough problem? any help would be appreciated.
The text was updated successfully, but these errors were encountered: