Bug: test run on stories15M-q4_0.gguf result in Segmentation fault. #7711

vt-alt · 2024-06-03T10:53:46Z

What happened?

For b3072 on x86-64 when I run llama-main on stories15M-q4_0.gguf or stories260K.gguf it crashes. It crashes on test-eval-callback test too.

Name and Version

This test is for tag b3072 at 549279d. b3012 had no problem with it.

What operating system are you seeing the problem on?

ALT Linux

Relevant log output

llama.cpp (sisyphus)$ gdb --args llama-main -m stories15M-q4_0.gguf -n 400 -p "Once opon a time"
GNU gdb (GDB) 14.1.0.56.d739d4fd457-alt1 (ALT Sisyphus)
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alt-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from llama-main...
Reading symbols from /usr/lib/debug/usr/bin/llama-main.debug...
(gdb) r
Starting program: /usr/bin/llama-main -m stories15M-q4_0.gguf -n 400 -p Once\ opon\ a\ time
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Log start
main: build = 3072 (alt1.20240603)
main: built with x86_64-alt-linux-gcc (GCC) 13.2.1 20240128 (ALT Sisyphus 13.2.1-alt3) for x86_64-alt-linux
main: seed  = 1717403314
llama_model_loader: loaded meta data with 20 key-value pairs and 57 tensors from stories15M-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv   1:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv   2:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv   3:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv   4:                       general.architecture str              = llama
llama_model_loader: - kv   5:                               general.name str              = llama
llama_model_loader: - kv   6:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv   7:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv   8:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv   9:          tokenizer.ggml.seperator_token_id u32              = 4294967295
llama_model_loader: - kv  10:            tokenizer.ggml.padding_token_id u32              = 4294967295
llama_model_loader: - kv  11:                       llama.context_length u32              = 128
llama_model_loader: - kv  12:                     llama.embedding_length u32              = 288
llama_model_loader: - kv  13:                  llama.feed_forward_length u32              = 768
llama_model_loader: - kv  14:                 llama.attention.head_count u32              = 6
llama_model_loader: - kv  15:                          llama.block_count u32              = 6
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 48
llama_model_loader: - kv  17:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - type  f32:   13 tensors
llama_model_loader: - type q4_0:   43 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: bad special token: 'tokenizer.ggml.seperator_token_id' = 4294967295d, using default id -1
llm_load_vocab: bad special token: 'tokenizer.ggml.padding_token_id' = 4294967295d, using default id -1
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 128
llm_load_print_meta: n_embd           = 288
llm_load_print_meta: n_head           = 6
llm_load_print_meta: n_head_kv        = 6
llm_load_print_meta: n_layer          = 6
llm_load_print_meta: n_rot            = 48
llm_load_print_meta: n_embd_head_k    = 48
llm_load_print_meta: n_embd_head_v    = 48
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 288
llm_load_print_meta: n_embd_v_gqa     = 288
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 768
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 128
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 24.41 M
llm_load_print_meta: model size       = 17.50 MiB (6.01 BPW)
llm_load_print_meta: general.name     = llama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.03 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/7 layers to GPU
llm_load_tensors:        CPU buffer size =    17.50 MiB
.....................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     3.38 MiB
llama_new_context_with_model: KV self size  =    3.38 MiB, K (f16):    1.69 MiB, V (f16):    1.69 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB

Program received signal SIGSEGV, Segmentation fault.
=> 0x55555564c757 <llama_new_context_with_model(llama_model*, llama_context_params)+5943>:      call   *0x28(%rax)
0x000055555564c757 in ggml_backend_buft_supports_backend (backend=0x555555bb3dd0, buft=0x0) at /usr/src/debug/llama.cpp-3072/ggml-backend.c:48
48          return buft->iface.supports_backend(buft, backend);
(gdb)
(gdb) bt
#0  0x000055555564c757 in ggml_backend_buft_supports_backend (backend=0x555555bb3dd0, buft=0x0) at /usr/src/debug/llama.cpp-3072/ggml-backend.c:48
#1  ggml_backend_sched_new (graph_size=8192, parallel=false, n_backends=2, bufts=0x555555bb3d30, backends=0x55555580bcc0)
    at /usr/src/debug/llama.cpp-3072/ggml-backend.c:1750
#2  llama_new_context_with_model (model=<optimized out>, params=...) at /usr/src/debug/llama.cpp-3072/llama.cpp:16504
#3  0x00005555555a6ac2 in llama_init_from_gpt_params (params=...) at /usr/src/debug/llama.cpp-3072/common/common.cpp:1915
#4  0x00005555555814c0 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/llama.cpp-3072/examples/main/main.cpp:199

There is temporary build log with the test run at the end: https://git.altlinux.org/tasks/350238/build/400/x86_64/log

Crash also occurs on aarch64, and if compiled without openblas.

The text was updated successfully, but these errors were encountered:

slaren · 2024-06-03T11:29:24Z

This is caused by the RPC backend. I believe #7640 will fix it, meanwhile you can remove it from the build if you are not using it.

vt-alt · 2024-06-03T11:31:57Z

Thanks! I thought to enable RPM backend for the first time (for the package) but it seems too early.

vt-alt · 2024-06-03T11:40:43Z

Just tested and confirmed that disabling RPC solves the issue.

vt-alt added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jun 3, 2024

slaren mentioned this issue Jun 3, 2024

llama : offload to RPC in addition to other backends #7640

Merged

slaren linked a pull request Jun 3, 2024 that will close this issue

llama : offload to RPC in addition to other backends #7640

Merged

rgerganov closed this as completed in #7640 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: test run on stories15M-q4_0.gguf result in Segmentation fault. #7711

Bug: test run on stories15M-q4_0.gguf result in Segmentation fault. #7711

vt-alt commented Jun 3, 2024 •

edited

Loading

slaren commented Jun 3, 2024 •

edited

Loading

vt-alt commented Jun 3, 2024

vt-alt commented Jun 3, 2024

Bug: test run on stories15M-q4_0.gguf result in Segmentation fault. #7711

Bug: test run on stories15M-q4_0.gguf result in Segmentation fault. #7711

Comments

vt-alt commented Jun 3, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

slaren commented Jun 3, 2024 • edited Loading

vt-alt commented Jun 3, 2024

vt-alt commented Jun 3, 2024

vt-alt commented Jun 3, 2024 •

edited

Loading

slaren commented Jun 3, 2024 •

edited

Loading