CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered for finetune using cuda #4212

jooray · 2023-11-25T01:55:26Z

Expected Behavior

I built a docker image (with adding #4211) and wanted to do a finetune inside the docker image. Llama.cpp otherwise works in docker for me.

Current Behavior

I ended up with CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered

Environment and Context

I use multiple GPUs (7 3090s with 24GB VRAM). The model does not fit in one, so I could not try if the problem persists with one device.

I built it like this:

edit .devops/full-cuda.Dockerfile, change ARG CUDA_VERSION=11.8.0 to match the machine's cuda version
apply Add finetune option to the docker image. #4211 to enable access to finetune from the docker image

docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .

Then run the finetune:

docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 -v /home/user/llama.cpp/models:/var/model -t local/llama.cpp:full-cuda --finetune \
    --model-base /var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf \
    --checkpoint-in /var/model/chk-in-noushermes-13b-LATEST.gguf \
    --checkpoint-out /var/model/chk-in-noushermes-13b-ITERATION.gguf \
        --lora-out /var/model/lora-noushermes-13b-ITERATION.bin \
        --train-data "/var/model/dataset.txt" \
        --save-every 10 \
        --threads 10 --adam-iter 30 --epochs 1 --batch 8 --ctx 256 \
        --sample-start '<s>' \
        --n-gpu-layers 999 \
        --use-checkpointing(env)

(Tried with different CUDA_VISIBLE_DEVICES setups, such as 0,1,2). This one works with inference using main.

llama.cpp$ git log | head -1
commit e9c13ff78114af6fc6a4f27cc8dcdda0f3d389fb

The run looks like this:

main: seed: 1700876892
main: model base = '/var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf'
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  5120, 32032,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  5120,  5120,     1,     1 ]
[...]
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
llm_load_vocab: special tokens definition check successful ( 291/32032 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32032
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 24.25 GiB (16.00 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.13 MiB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  312.95 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 24514.39 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MiB
llama_new_context_with_model: kv self size  =  400.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 78.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MiB
llama_new_context_with_model: total VRAM used: 24989.40 MiB (model: 24514.39 MiB, context: 475.00 MiB)
main: init model
print_params: n_vocab:   32032
print_params: n_ctx:     256
print_params: n_embd:    5120
print_params: n_ff:      13824
print_params: n_head:    40
print_params: n_head_kv: 40
print_params: n_layer:   40
print_params: norm_rms_eps          : 0.000010
print_params: rope_freq_base        : 10000.000000
print_params: rope_freq_scale       : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq             : 4
print_lora_params: n_rank_wk             : 4
print_lora_params: n_rank_wv             : 4
print_lora_params: n_rank_wo             : 4
print_lora_params: n_rank_ffn_norm       : 1
print_lora_params: n_rank_w1             : 4
print_lora_params: n_rank_w2             : 4
print_lora_params: n_rank_w3             : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm           : 1
print_lora_params: n_rank_output         : 4
main: total train_iterations 0
main: seen train_samples     0
main: seen train_tokens      0
main: completed train_epochs 0
main: lora_size = 131432032 bytes (125.3 MB)
main: opt_size  = 196306048 bytes (187.2 MB)
main: opt iter 0
main: input_size = 262414368 bytes (250.3 MB)
main: compute_size = 37813245024 bytes (36061.5 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: warning: found 144 samples (max length 567) that exceed context length of 256. samples will be cut off.
tokenize_file: warning: found 4691 samples (min length 35) that are shorter than context length of 256.
tokenize_file: total number of samples: 4836
main: number of training tokens: 634519
main: number of unique tokens: 10054
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 1281928 bytes (1.2 MB)
train_opt_callback: iter=     0 sample=1/4836 sched=0.000000 loss=0.000000 |->

CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered
current device: 0

The text was updated successfully, but these errors were encountered:

jooray · 2023-11-25T02:28:21Z

Tried with a smaller model and one GPU, got:

...
main: number of training tokens: 634519
main: number of unique tokens: 10054
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 1281928 bytes (1.2 MB)
train_opt_callback: iter=     0 sample=1/4836 sched=0.000000 loss=0.000000 |->
src0->type: 14  dst->type: 0
GGML_ASSERT: ggml-cuda.cu:6193: false

Tachyon5 · 2024-03-06T20:11:14Z

finetune doesn't work with cuda at the moment. it's supposed to dequantize the model weights in the optimizer and for some reason I didn't quite get to the bottom of , it just doesn't happen. #4724

I suspect that supporting that dequantization might be the issue but I had to table the issue. Was hoping someone else could pick it up. I may try again next week if I get time.

github-actions · 2024-04-20T01:07:22Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

jooray added the bug-unconfirmed label Nov 25, 2023

jooray mentioned this issue Dec 9, 2023

finetune: terminate called after throwing an instance of 'std::bad_alloc' #4389

Closed

4 tasks

github-actions bot added the stale label Apr 6, 2024

github-actions bot closed this as completed Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered for finetune using cuda #4212

CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered for finetune using cuda #4212

jooray commented Nov 25, 2023

jooray commented Nov 25, 2023

Tachyon5 commented Mar 6, 2024

github-actions bot commented Apr 20, 2024

CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered for finetune using cuda #4212

CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered for finetune using cuda #4212

Comments

jooray commented Nov 25, 2023

Expected Behavior

Current Behavior

Environment and Context

jooray commented Nov 25, 2023

Tachyon5 commented Mar 6, 2024

github-actions bot commented Apr 20, 2024