-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Chatllama] RLHF training for Actor #240
Comments
Hi @Vincent131499 . Thank you for your feedback. We are testing the code to solve this problem which is probably due to the length of the model/data sequences. When we think we have solved the problem, we will write back to you so that you can test for yourself if the problem persists. |
@PierpaoloSorbellini
|
Hi @cokuehuang Yes, we have found the problem and will be releasing a fix for it very soon. We are trying to fix other issues as well to have a more stable code base to add more features to. Thanks for your patience, I will let you know when it is released. |
Hi @cokuehuang You can try the PR #306 where the problem should have been addressed! |
When I was training the actor with reinforcement learning, I encountered the following bug:
Current device used :cuda
Start RL Training
Episode: 1 of 100, Timestep: 1 of 8
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [32,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [33,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [34,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [35,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [36,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [37,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [38,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [39,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [40,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [41,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [42,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [43,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [44,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [45,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [46,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [47,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [48,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [49,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [50,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [51,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [52,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [53,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [54,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [55,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [56,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [57,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [58,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [59,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [60,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [61,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [62,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [63,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [32,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [33,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [34,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [35,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [36,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [37,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [38,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [39,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [40,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [41,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [42,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [43,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [44,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [45,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [46,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [47,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [48,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [49,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [50,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [51,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [52,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [53,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [54,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [55,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [56,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [57,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [58,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [59,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [60,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [61,0,0] Assertion
srcIndex < srcSelectDimSize
failed.values = self.critic.forward(sequences, sequences_mask)
File "<@beartype(chatllama.rlhf.reward.RewardModel.forward) at 0x2afb7867aaf0>", line 51, in forward
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 133, in forward
output = self.model(
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 831, in forward
position_embeds = self.wpe(position_ids)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
my confilg.yaml:
trainer_config:
actor_lr: 0.00001
critic_lr: 0.00001
actor_eps_clip: 0.2
critic_eps_clip: 0.2
beta_s: 0.1
examples_path: "./datasets/rlhf_training_data.json.repair"
num_episodes: 100
max_timesteps: 8
update_timesteps: 8
num_examples: 8
batch_size: 1
epochs: 1
update_checkpoint: 8
checkpoint_folder: "./models/checkpoints"
actor_config:
model: "facebook/opt-125m"
model_path: "path-to-model"
checkpoint_folder: "./models"
tokenizer_folder: "path-to-tokenizer"
train_dataset_path: "./datasets/actor_training_data.json"
validation_dataset_path: null
froze_embeddings: True
use_fairscale: False
max_sequence_length: 2048
max_tokens: 1024
temperature: 0.9
batch_size: 6
iteration_per_print: 100
lr: 0.0001
epochs: 5
deepspeed_enable: False
deepspeed_config_path: "path-to-deepspeed-conf"
reward_config:
model: "gpt2-large"
model_head_hidden_size: 2048
model_folder: "./models"
train_dataset_path: "./datasets/reward_training_data.json"
validation_dataset_path: null
batch_size: 1
epochs: 32
iteration_per_print: 1
lr: 0.0001
deepspeed_enable: False
deepspeed_config_path: "path-to-deepspeed-conf"
critic_config:
model: "gpt2-large"
model_head_hidden_size: 2048
model_folder: "./models"
deepspeed_enable: False
deepspeed_config_path: "path-to-deepspeed-conf"
The text was updated successfully, but these errors were encountered: