Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! #10610

Closed
wangzd0209 opened this issue Dec 1, 2024 · 9 comments

Comments

@wangzd0209
Copy link

Name and Version

I use ollama to run this model but something is wrong. and it show like that

llama_new_context_with_model: graph splits = 2
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add launch_bounds to kernel define or use --gpu-max-threads-per-block recompile program !

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

No response

Problem description & steps to reproduce

I use ollama to run this model but something is wrong. and it show like that

llama_new_context_with_model: graph splits = 2
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add launch_bounds to kernel define or use --gpu-max-threads-per-block recompile program !

First Bad Commit

No response

Relevant log output

No response

@JohannesGaessler
Copy link
Collaborator

This issue needs more information to debug. Please take a look at the "Bug (model use)" template and either re-open the issue using that template or provide the corresponding information here. In particular, please reproduce the issue using llama.cpp only.

@wangzd0209
Copy link
Author

This issue needs more information to debug. Please take a look at the "Bug (model use)" template and either re-open the issue using that template or provide the corresponding information here. In particular, please reproduce the issue using llama.cpp only.

I use a special gpu named dcu ,which you could think it as a variant of gfx906. I change some code in ggml/src/cmakelist like this.
image
Is this change or something make this issue? And thank for your help

@JohannesGaessler
Copy link
Collaborator

I don't know how to fix this issue.

@qnixsynapse
Copy link
Contributor

It is failing in rms_norm_f32 kernel. But which backend? ROCM/HIPBLAS?

@nyl199310
Copy link

I also encountered the same issue. I use HIPBLAS built the code.
If I set the --ngl, it will display below and output messy words. Is there a way to fix this issue?

root@worker-0:/public/home/scnalnwgla/libs/llama.cpp/build/bin# ./llama-cli -m /public/home/scnalnwgla/Llama-3.2-1B-Instruct.Q4_K_M.gguf -p "You are a helpful assistant" -cnv -ngl 16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Device 66a1, compute capability 9.0, VMM: no
build: 4311 (9fdb1243) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device ROCm0 (Device 66a1) - 15852 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 147 tensors from /public/home/scnalnwgla/Llama-3.2-1B-Instruct.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 1.2B
llama_model_loader: - kv   4:                            general.license str              = llama3.2
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   7:                          llama.block_count u32              = 16
llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  16:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q4_K:   96 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 16
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 762.81 MiB (5.18 BPW) 
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/17 layers to GPU
llm_load_tensors:        ROCm0 model buffer size =   557.31 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   762.81 MiB
..............................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:      ROCm0 KV buffer size =   128.00 MiB
llama_new_context_with_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   459.99 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 4 (with bs=512), 3 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! 
main: llama threadpool init, n_threads = 32
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 32 (n_threads_batch = 32) / 32 | ROCm : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 138504786
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system

You are a helpful assistant


> hi
*=C>&C+(>2&!='H.F!>CE*"B#2'E7E%:18&B:-.4;8@C.<,!49$4-@/,#><(G,17<32#)$:E87-D3;--7EE,8B!*7)1$BB92%3,#F&F,0G8+%7B(CG/F>C:6;+=;@8<9<CD7,EEA%*)5C$)4@)-*!F92>C';/-%4)<B"=!C*%%#=-.G;)8==08$81;=5%$E3,)H+/.E/3@!B/<-H8!!4'&80+.&A2:99:(-9'=+E%4+)F=8,068$C6#:+9-6"A*F65"8C:F:&5'="3$98=#9:0<=9'H!0D092E7;)0*"%5&#+2=@//B($0"23/>#!"15$(<D%F4.%@B,C)H,:-=F8&1*E<>G$7B;8H@'A/%F-8H>90'14%30)<+A;(DA:GE30EB0C6):C@@*CC"-0(('.8-,%A)F%5/;9(8:@/!G<7+A@BH79!'6&H/%E%A93H8#1*B'!=G@.9$G6,3B0C/:..)C&B%7A>H9!(:85'!B.3@57)=3E*-6FA"%8C<>*"#"6-<)8BH*)64"D8B.";E:.(31.G/C'%.EC@7'#!42+&="<8E7F;.66;'#17-B>59<FAH840B"1>2:9F
> exit
F=FB6:@B$%("A!<C.2A'=CG$"2C-:""@B@/9&#BD"5=<:H,'*;>E18*!0'H
> 
llama_perf_sampler_print:    sampling time =       6.90 ms /    70 runs   (    0.10 ms per token, 10144.93 tokens per second)
llama_perf_context_print:        load time =    2438.37 ms
llama_perf_context_print: prompt eval time =  178163.45 ms /    33 tokens ( 5398.89 ms per token,     0.19 tokens per second)
llama_perf_context_print:        eval time =   10377.75 ms /   647 runs   (   16.04 ms per token,    62.34 tokens per second)
llama_perf_context_print:       total time =  189030.88 ms /   680 tokens
Interrupted by user

@JohannesGaessler
Copy link
Collaborator

Please open a new issue and fill out the "model use" template.

@JohannesGaessler
Copy link
Collaborator

Actually, if you're also using the same special GPU it will most likely not be possible to make it work unless a developer invests the effort to support it (which is not likely).

@github-actions github-actions bot added the stale label Jan 12, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@fxzjshm
Copy link
Contributor

fxzjshm commented Feb 3, 2025

For some unknown reason, those DCU SDK people selects 256 for default launch bound instead of the common value 1024, which makes some assumptions broken for ops like argmax_f32 (test-backend-ops fails on ARGMAX with size >= 1024).

Workaround: as indicated by the error, simply add --gpu-max-threads-per-block to compile flags:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" HIPFLAGS=" --gpu-max-threads-per-block=1024 " \
cmake .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906

Will report this issue to them; should this be documented in e.g. build.md, chapter HIP ?

fxzjshm added a commit to fxzjshm/llama.cpp that referenced this issue Feb 3, 2025
Related: ggml-org#10610
Signed-off-by: fxzjshm <fxzjshm@163.com>
fxzjshm added a commit to fxzjshm/llama.cpp that referenced this issue Feb 3, 2025
Some old compilers still use 256. Explicitly set it to 1024 to get correct
result from ops like ARGMAX and GROUP_NORM.

Related: ggml-org#10610, ggml-org#11619
Signed-off-by: fxzjshm <fxzjshm@163.com>
fxzjshm added a commit to fxzjshm/llama.cpp that referenced this issue Feb 4, 2025
Some old compilers still use 256. Explicitly set it to 1024 to get correct
result from ops like ARGMAX and GROUP_NORM.

Related: ggml-org#10610, ggml-org#11619
Signed-off-by: fxzjshm <fxzjshm@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants