Misc. bug: Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! #10610

wangzd0209 · 2024-12-01T03:10:44Z

Name and Version

I use ollama to run this model but something is wrong. and it show like that

llama_new_context_with_model: graph splits = 2
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add launch_bounds to kernel define or use --gpu-max-threads-per-block recompile program !

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

No response

Problem description & steps to reproduce

I use ollama to run this model but something is wrong. and it show like that

llama_new_context_with_model: graph splits = 2
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add launch_bounds to kernel define or use --gpu-max-threads-per-block recompile program !

First Bad Commit

No response

Relevant log output

No response

JohannesGaessler · 2024-12-01T08:05:57Z

This issue needs more information to debug. Please take a look at the "Bug (model use)" template and either re-open the issue using that template or provide the corresponding information here. In particular, please reproduce the issue using llama.cpp only.

wangzd0209 · 2024-12-01T13:20:19Z

This issue needs more information to debug. Please take a look at the "Bug (model use)" template and either re-open the issue using that template or provide the corresponding information here. In particular, please reproduce the issue using llama.cpp only.

I use a special gpu named dcu ,which you could think it as a variant of gfx906. I change some code in ggml/src/cmakelist like this.

Is this change or something make this issue? And thank for your help

JohannesGaessler · 2024-12-01T13:26:45Z

I don't know how to fix this issue.

qnixsynapse · 2024-12-03T15:07:07Z

It is failing in rms_norm_f32 kernel. But which backend? ROCM/HIPBLAS?

nyl199310 · 2024-12-12T17:25:40Z

I also encountered the same issue. I use HIPBLAS built the code.
If I set the --ngl, it will display below and output messy words. Is there a way to fix this issue?

root@worker-0:/public/home/scnalnwgla/libs/llama.cpp/build/bin# ./llama-cli -m /public/home/scnalnwgla/Llama-3.2-1B-Instruct.Q4_K_M.gguf -p "You are a helpful assistant" -cnv -ngl 16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Device 66a1, compute capability 9.0, VMM: no
build: 4311 (9fdb1243) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device ROCm0 (Device 66a1) - 15852 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 147 tensors from /public/home/scnalnwgla/Llama-3.2-1B-Instruct.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 1.2B
llama_model_loader: - kv   4:                            general.license str              = llama3.2
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   7:                          llama.block_count u32              = 16
llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  16:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q4_K:   96 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 16
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 762.81 MiB (5.18 BPW) 
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16/17 layers to GPU
llm_load_tensors:        ROCm0 model buffer size =   557.31 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   762.81 MiB
..............................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init:      ROCm0 KV buffer size =   128.00 MiB
llama_new_context_with_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   459.99 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 4 (with bs=512), 3 (with bs=1)
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! 
main: llama threadpool init, n_threads = 32
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 32 (n_threads_batch = 32) / 32 | ROCm : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 138504786
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system

You are a helpful assistant


> hi
*=C>&C+(>2&!='H.F!>CE*"B#2'E7E%:18&B:-.4;8@C.<,!49$4-@/,#><(G,17<32#)$:E87-D3;--7EE,8B!*7)1$BB92%3,#F&F,0G8+%7B(CG/F>C:6;+=;@8<9<CD7,EEA%*)5C$)4@)-*!F92>C';/-%4)<B"=!C*%%#=-.G;)8==08$81;=5%$E3,)H+/.E/3@!B/<-H8!!4'&80+.&A2:99:(-9'=+E%4+)F=8,068$C6#:+9-6"A*F65"8C:F:&5'="3$98=#9:0<=9'H!0D092E7;)0*"%5&#+2=@//B($0"23/>#!"15$(<D%F4.%@B,C)H,:-=F8&1*E<>G$7B;8H@'A/%F-8H>90'14%30)<+A;(DA:GE30EB0C6):C@@*CC"-0(('.8-,%A)F%5/;9(8:@/!G<7+A@BH79!'6&H/%E%A93H8#1*B'!=G@.9$G6,3B0C/:..)C&B%7A>H9!(:85'!B.3@57)=3E*-6FA"%8C<>*"#"6-<)8BH*)64"D8B.";E:.(31.G/C'%.EC@7'#!42+&="<8E7F;.66;'#17-B>59<FAH840B"1>2:9F
> exit
F=FB6:@B$%("A!<C.2A'=CG$"2C-:""@B@/9&#BD"5=<:H,'*;>E18*!0'H
> 
llama_perf_sampler_print:    sampling time =       6.90 ms /    70 runs   (    0.10 ms per token, 10144.93 tokens per second)
llama_perf_context_print:        load time =    2438.37 ms
llama_perf_context_print: prompt eval time =  178163.45 ms /    33 tokens ( 5398.89 ms per token,     0.19 tokens per second)
llama_perf_context_print:        eval time =   10377.75 ms /   647 runs   (   16.04 ms per token,    62.34 tokens per second)
llama_perf_context_print:       total time =  189030.88 ms /   680 tokens
Interrupted by user

JohannesGaessler · 2024-12-12T17:44:06Z

Please open a new issue and fill out the "model use" template.

JohannesGaessler · 2024-12-12T17:55:10Z

Actually, if you're also using the same special GPU it will most likely not be possible to make it work unless a developer invests the effort to support it (which is not likely).

github-actions · 2025-01-27T01:07:15Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

fxzjshm · 2025-02-03T08:33:26Z

For some unknown reason, those DCU SDK people selects 256 for default launch bound instead of the common value 1024, which makes some assumptions broken for ops like argmax_f32 (test-backend-ops fails on ARGMAX with size >= 1024).

Workaround: as indicated by the error, simply add --gpu-max-threads-per-block to compile flags:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" HIPFLAGS=" --gpu-max-threads-per-block=1024 " \
cmake .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906

Will report this issue to them; should this be documented in e.g. build.md, chapter HIP ?

Related: ggml-org#10610 Signed-off-by: fxzjshm <fxzjshm@163.com>

Some old compilers still use 256. Explicitly set it to 1024 to get correct result from ops like ARGMAX and GROUP_NORM. Related: ggml-org#10610, ggml-org#11619 Signed-off-by: fxzjshm <fxzjshm@163.com>

wangzd0209 added the bug-unconfirmed label Dec 1, 2024

github-actions bot added the stale label Jan 12, 2025

github-actions bot closed this as completed Jan 27, 2025

fxzjshm added a commit to fxzjshm/llama.cpp that referenced this issue Feb 3, 2025

HIP: add doc on small default launch bounds

94bc968

Related: ggml-org#10610 Signed-off-by: fxzjshm <fxzjshm@163.com>

fxzjshm mentioned this issue Feb 3, 2025

HIP: add doc on small default launch bounds #11619

Closed

fxzjshm mentioned this issue Feb 3, 2025

HIP: force max threads per block to be 1024 #11621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! #10610

Misc. bug: Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! #10610

wangzd0209 commented Dec 1, 2024

JohannesGaessler commented Dec 1, 2024

wangzd0209 commented Dec 1, 2024

JohannesGaessler commented Dec 1, 2024

qnixsynapse commented Dec 3, 2024

nyl199310 commented Dec 12, 2024

JohannesGaessler commented Dec 12, 2024

JohannesGaessler commented Dec 12, 2024

github-actions bot commented Jan 27, 2025

fxzjshm commented Feb 3, 2025

Misc. bug: Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! #10610

Misc. bug: Launch params (1024, 1, 1) are larger than launch bounds (256) for kernel _ZL12rms_norm_f32ILi1024EEvPKfPfif please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program ! #10610

Comments

wangzd0209 commented Dec 1, 2024

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

First Bad Commit

Relevant log output

JohannesGaessler commented Dec 1, 2024

wangzd0209 commented Dec 1, 2024

JohannesGaessler commented Dec 1, 2024

qnixsynapse commented Dec 3, 2024

nyl199310 commented Dec 12, 2024

JohannesGaessler commented Dec 12, 2024

JohannesGaessler commented Dec 12, 2024

github-actions bot commented Jan 27, 2025

fxzjshm commented Feb 3, 2025