Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama] RLHF training for Actor #240

Open
Vincent131499 opened this issue Mar 10, 2023 · 4 comments
Open

[Chatllama] RLHF training for Actor #240

Vincent131499 opened this issue Mar 10, 2023 · 4 comments

Comments

@Vincent131499
Copy link

Vincent131499 commented Mar 10, 2023

When I was training the actor with reinforcement learning, I encountered the following bug:
Current device used :cuda
Start RL Training
Episode: 1 of 100, Timestep: 1 of 8
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
values = self.critic.forward(sequences, sequences_mask)
File "<@beartype(chatllama.rlhf.reward.RewardModel.forward) at 0x2afb7867aaf0>", line 51, in forward
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 133, in forward
output = self.model(
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 831, in forward
position_embeds = self.wpe(position_ids)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

my confilg.yaml:
trainer_config:
actor_lr: 0.00001
critic_lr: 0.00001
actor_eps_clip: 0.2
critic_eps_clip: 0.2
beta_s: 0.1
examples_path: "./datasets/rlhf_training_data.json.repair"
num_episodes: 100
max_timesteps: 8
update_timesteps: 8
num_examples: 8
batch_size: 1
epochs: 1
update_checkpoint: 8
checkpoint_folder: "./models/checkpoints"

actor_config:
model: "facebook/opt-125m"
model_path: "path-to-model"
checkpoint_folder: "./models"
tokenizer_folder: "path-to-tokenizer"
train_dataset_path: "./datasets/actor_training_data.json"
validation_dataset_path: null
froze_embeddings: True
use_fairscale: False
max_sequence_length: 2048
max_tokens: 1024
temperature: 0.9
batch_size: 6
iteration_per_print: 100
lr: 0.0001
epochs: 5
deepspeed_enable: False
deepspeed_config_path: "path-to-deepspeed-conf"

reward_config:
model: "gpt2-large"
model_head_hidden_size: 2048
model_folder: "./models"
train_dataset_path: "./datasets/reward_training_data.json"
validation_dataset_path: null
batch_size: 1
epochs: 32
iteration_per_print: 1
lr: 0.0001
deepspeed_enable: False
deepspeed_config_path: "path-to-deepspeed-conf"

critic_config:
model: "gpt2-large"
model_head_hidden_size: 2048
model_folder: "./models"
deepspeed_enable: False
deepspeed_config_path: "path-to-deepspeed-conf"

@PierpaoloSorbellini
Copy link
Collaborator

Hi @Vincent131499 . Thank you for your feedback. We are testing the code to solve this problem which is probably due to the length of the model/data sequences. When we think we have solved the problem, we will write back to you so that you can test for yourself if the problem persists.

@cokuehuang
Copy link

@PierpaoloSorbellini
Same Problem, is this problem solved?

Current device used :cuda
Loading
Start RL Training
Episode: 1 of 100, Timestep: 1 of 32
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [175,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
Traceback (most recent call last):
  File "artifacts/main.py", line 51, in <module>
    rlhf_trainer.train()
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 655, in train
    ) = self.actorcritic.generate(states, states_mask)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "<@beartype(chatllama.rlhf.trainer.ActorCritic.generate) at 0x7f92593d4f70>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 144, in generate
    actions, sequence = self.actor.generate(states, state_mask)
  File "<@beartype(chatllama.rlhf.actor.ActorModel.generate) at 0x7f925bd03160>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 163, in generate
    sequences = self.model.generate(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 529, in generate
    logits = self._forward(input_ids, attention_mask)[:, -1, :]
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 503, in _forward
    h, cache_k, cache_v = layer(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 401, in forward
    attn, cache_k, cache_v = self.attention.forward(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 281, in forward
    xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

@PierpaoloSorbellini
Copy link
Collaborator

Hi @cokuehuang Yes, we have found the problem and will be releasing a fix for it very soon. We are trying to fix other issues as well to have a more stable code base to add more features to. Thanks for your patience, I will let you know when it is released.

@PierpaoloSorbellini PierpaoloSorbellini changed the title RLHF training for Actor [Chatllama] RLHF training for Actor Mar 14, 2023
@PierpaoloSorbellini
Copy link
Collaborator

Hi @cokuehuang You can try the PR #306 where the problem should have been addressed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants