Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered for finetune using cuda #4212

Closed
jooray opened this issue Nov 25, 2023 · 3 comments

Comments

@jooray
Copy link
Contributor

jooray commented Nov 25, 2023

Expected Behavior

I built a docker image (with adding #4211) and wanted to do a finetune inside the docker image. Llama.cpp otherwise works in docker for me.

Current Behavior

I ended up with CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered

Environment and Context

I use multiple GPUs (7 3090s with 24GB VRAM). The model does not fit in one, so I could not try if the problem persists with one device.

I built it like this:

  1. edit .devops/full-cuda.Dockerfile, change ARG CUDA_VERSION=11.8.0 to match the machine's cuda version
  2. apply Add finetune option to the docker image. #4211 to enable access to finetune from the docker image
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .

Then run the finetune:

docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 -v /home/user/llama.cpp/models:/var/model -t local/llama.cpp:full-cuda --finetune \
    --model-base /var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf \
    --checkpoint-in /var/model/chk-in-noushermes-13b-LATEST.gguf \
    --checkpoint-out /var/model/chk-in-noushermes-13b-ITERATION.gguf \
        --lora-out /var/model/lora-noushermes-13b-ITERATION.bin \
        --train-data "/var/model/dataset.txt" \
        --save-every 10 \
        --threads 10 --adam-iter 30 --epochs 1 --batch 8 --ctx 256 \
        --sample-start '<s>' \
        --n-gpu-layers 999 \
        --use-checkpointing(env)

(Tried with different CUDA_VISIBLE_DEVICES setups, such as 0,1,2). This one works with inference using main.

llama.cpp$ git log | head -1
commit e9c13ff78114af6fc6a4f27cc8dcdda0f3d389fb

The run looks like this:

main: seed: 1700876892
main: model base = '/var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf'
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  5120, 32032,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  5120,  5120,     1,     1 ]
[...]
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: special tokens definition check successful ( 291/32032 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32032
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 24.25 GiB (16.00 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  312.95 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 24514.39 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MiB
llama_new_context_with_model: kv self size  =  400.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 78.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MiB
llama_new_context_with_model: total VRAM used: 24989.40 MiB (model: 24514.39 MiB, context: 475.00 MiB)
main: init model
print_params: n_vocab:   32032
print_params: n_ctx:     256
print_params: n_embd:    5120
print_params: n_ff:      13824
print_params: n_head:    40
print_params: n_head_kv: 40
print_params: n_layer:   40
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq             : 4
print_lora_params: n_rank_wk             : 4
print_lora_params: n_rank_wv             : 4
print_lora_params: n_rank_wo             : 4
print_lora_params: n_rank_ffn_norm       : 1
print_lora_params: n_rank_w1             : 4
print_lora_params: n_rank_w2             : 4
print_lora_params: n_rank_w3             : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm           : 1
print_lora_params: n_rank_output         : 4
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: lora_size = 131432032 bytes (125.3 MB)
main: opt_size  = 196306048 bytes (187.2 MB)
main: opt iter 0
main: input_size = 262414368 bytes (250.3 MB)
main: compute_size = 37813245024 bytes (36061.5 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: warning: found 144 samples (max length 567) that exceed context length of 256. samples will be cut off.
tokenize_file: warning: found 4691 samples (min length 35) that are shorter than context length of 256.
tokenize_file: total number of samples: 4836
main: number of training tokens: 634519
main: number of unique tokens: 10054
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 1281928 bytes (1.2 MB)
train_opt_callback: iter=     0 sample=1/4836 sched=0.000000 loss=0.000000 |->

CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered
current device: 0
@jooray
Copy link
Contributor Author

jooray commented Nov 25, 2023

Tried with a smaller model and one GPU, got:

...
main: number of training tokens: 634519
main: number of unique tokens: 10054
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 1281928 bytes (1.2 MB)
train_opt_callback: iter=     0 sample=1/4836 sched=0.000000 loss=0.000000 |->
src0->type: 14  dst->type: 0
GGML_ASSERT: ggml-cuda.cu:6193: false

@Tachyon5
Copy link

Tachyon5 commented Mar 6, 2024

finetune doesn't work with cuda at the moment. it's supposed to dequantize the model weights in the optimizer and for some reason I didn't quite get to the bottom of , it just doesn't happen. #4724

I suspect that supporting that dequantization might be the issue but I had to table the issue. Was hoping someone else could pick it up. I may try again next week if I get time.

@github-actions github-actions bot added the stale label Apr 6, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants