Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half #275

Open
balcklive opened this issue Mar 20, 2023 · 7 comments

Comments

@balcklive
Copy link

{
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
},
"contiguous_gradients": true,
"overlap_comm": true
},
"num_gpus": 1
}
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028967857360839844 seconds
Start Training the Reward Model
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Traceback (most recent call last):
File "artifacts/main.py", line 54, in
reward_trainer.train()
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 379, in train
self.model_engine.backward(loss)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2028, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 54, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected Half

Is it a GPU memory not enough problem? any help would be appreciated.

@PierpaoloSorbellini
Copy link
Collaborator

Hi @balcklive I am not sure if it will solve your problem but you can try to modify the DeepSpeed’s config file:
or

  • "enabled": false, (under fp16)
    or
  • "auto_cast": true

Thanks for your issue, let me know if it is working of if you are stuck with the same error.

@balcklive
Copy link
Author

@PierpaoloSorbellini HI, thank you for your reply, I tried what you said.

  1. "enabled": false, (under fp16),:
    the error no longer appeared, but the GPU memory consumption is till high, I still can't train a opt-125m on a NVIDIA 4090
  2. "auto_cast": true:
    I got another error:
    Start Training the Reward Model
    Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
    [2023-03-20 09:28:07,680] [INFO] [scheduler.py:157:check_channel_pruning] Channel pruning is enabled at step 0
    Traceback (most recent call last):
    File "artifacts/main.py", line 54, in
    reward_trainer.train()
    File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 356, in train
    est_output = self.model_engine(
    File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
    File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1832, in forward
    loss = self.module(*inputs, **kwargs)
    File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "<@beartype(chatllama.rlhf.reward.RewardModel.forward) at 0x7fc893482b80>", line 51, in forward
    File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 133, in forward
    output = self.model(
    File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 848, in forward
    inputs_embeds = self.wte(input_ids)
    File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
    File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    RuntimeError: Expected tensor for argument Update README.md #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

@PierpaoloSorbellini PierpaoloSorbellini changed the title train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half [Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half Mar 20, 2023
@PierpaoloSorbellini
Copy link
Collaborator

@balcklive thanks for the quick reply.
We have tested on similar settings (i.e. 3090) without problems for models that small using PR #233, consider lowering your batch size if cuda runs out of memory. If this happens at very low batch sizes, please send us more information so we can try to replicate the problem.

If you are only using one GPU, you can also disable DeepSpeed's or try Accelerate (which uses DeepSpeed's without having to manually configure it).

I am sorry that you are having problems, let us know if you are still having issues.

@balcklive
Copy link
Author

My bachszie is always 1, I can't lower it anymore.
Yes I am using one GPU, but my GPU memory capacity got 24G, I think it's supposed to be able to run a opt-125m, I can't figure out why this training procedure need that much GPU memory.
If you think Accelerate could help, I would try it.

@PierpaoloSorbellini
Copy link
Collaborator

PierpaoloSorbellini commented Mar 22, 2023

Hi @balcklive can you please send me the model that you used in actor, critic and reward (just the string used in the config.yaml) and which is the training procedure that fails (actor, RL or reward) ?
I will try to replicate the setup on the same HW to see if I can replicate the problem and provide a solution.

@balcklive
Copy link
Author

It's REWARD model, As I mentioned in issue #281 , my config file specify the model type as opt-125m, but it actually is a gpt2 model. Deepspeed module can compress the model size efficiently, hope you can fix it.

@PierpaoloSorbellini
Copy link
Collaborator

Hi @balcklive yes this problem with half precision should be fixed in #306
Keep me posted if you are still have the same issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants