[Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half #275

balcklive · 2023-03-20T08:16:36Z

{
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
},
"contiguous_gradients": true,
"overlap_comm": true
},
"num_gpus": 1
}
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028967857360839844 seconds
Start Training the Reward Model
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Traceback (most recent call last):
File "artifacts/main.py", line 54, in
reward_trainer.train()
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 379, in train
self.model_engine.backward(loss)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1964, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2028, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 54, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Found dtype Float but expected Half

Is it a GPU memory not enough problem? any help would be appreciated.

PierpaoloSorbellini · 2023-03-20T09:16:31Z

Hi @balcklive I am not sure if it will solve your problem but you can try to modify the DeepSpeed’s config file:
or

"enabled": false, (under fp16)
or
"auto_cast": true

Thanks for your issue, let me know if it is working of if you are stuck with the same error.

balcklive · 2023-03-20T09:33:06Z

@PierpaoloSorbellini HI, thank you for your reply, I tried what you said.

"enabled": false, (under fp16),:
the error no longer appeared, but the GPU memory consumption is till high, I still can't train a opt-125m on a NVIDIA 4090
"auto_cast": true:
I got another error:
Start Training the Reward Model
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[2023-03-20 09:28:07,680] [INFO] [scheduler.py:157:check_channel_pruning] Channel pruning is enabled at step 0
Traceback (most recent call last):
File "artifacts/main.py", line 54, in
reward_trainer.train()
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 356, in train
est_output = self.model_engine(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1832, in forward
loss = self.module(*inputs, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "<@beartype(chatllama.rlhf.reward.RewardModel.forward) at 0x7fc893482b80>", line 51, in forward
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 133, in forward
output = self.model(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 848, in forward
inputs_embeds = self.wte(input_ids)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument Update README.md #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

PierpaoloSorbellini · 2023-03-20T16:09:36Z

@balcklive thanks for the quick reply.
We have tested on similar settings (i.e. 3090) without problems for models that small using PR #233, consider lowering your batch size if cuda runs out of memory. If this happens at very low batch sizes, please send us more information so we can try to replicate the problem.

If you are only using one GPU, you can also disable DeepSpeed's or try Accelerate (which uses DeepSpeed's without having to manually configure it).

I am sorry that you are having problems, let us know if you are still having issues.

balcklive · 2023-03-21T05:03:59Z

My bachszie is always 1, I can't lower it anymore.
Yes I am using one GPU, but my GPU memory capacity got 24G, I think it's supposed to be able to run a opt-125m, I can't figure out why this training procedure need that much GPU memory.
If you think Accelerate could help, I would try it.

PierpaoloSorbellini · 2023-03-22T08:56:07Z

Hi @balcklive can you please send me the model that you used in actor, critic and reward (just the string used in the config.yaml) and which is the training procedure that fails (actor, RL or reward) ?
I will try to replicate the setup on the same HW to see if I can replicate the problem and provide a solution.

balcklive · 2023-03-22T10:07:29Z

It's REWARD model, As I mentioned in issue #281 , my config file specify the model type as opt-125m, but it actually is a gpt2 model. Deepspeed module can compress the model size efficiently, hope you can fix it.

PierpaoloSorbellini · 2023-04-03T14:43:22Z

Hi @balcklive yes this problem with half precision should be fixed in #306
Keep me posted if you are still have the same issue!

PierpaoloSorbellini changed the title ~~train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half~~ [Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half #275

[Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half #275

balcklive commented Mar 20, 2023

PierpaoloSorbellini commented Mar 20, 2023

balcklive commented Mar 20, 2023

PierpaoloSorbellini commented Mar 20, 2023

balcklive commented Mar 21, 2023

PierpaoloSorbellini commented Mar 22, 2023 •

edited

Loading

balcklive commented Mar 22, 2023

PierpaoloSorbellini commented Apr 3, 2023

[Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half #275

[Chatllama] train chatllama REWARD model using deepspeed ,got:RuntimeError: Found dtype Float but expected Half #275

Comments

balcklive commented Mar 20, 2023

PierpaoloSorbellini commented Mar 20, 2023

balcklive commented Mar 20, 2023

PierpaoloSorbellini commented Mar 20, 2023

balcklive commented Mar 21, 2023

PierpaoloSorbellini commented Mar 22, 2023 • edited Loading

balcklive commented Mar 22, 2023

PierpaoloSorbellini commented Apr 3, 2023

PierpaoloSorbellini commented Mar 22, 2023 •

edited

Loading