Bug: Docker ROCm crashs, only works on metal compiled. #8213

rudiservo · 2024-06-29T19:45:02Z

What happened?

The docker version with ROCm 5.6 exits after graph splits, I tried building and image with ROCm 5.6, 5.7.1, 6.1.2.

These last ones give me an error that is in the logs.

If I compiled and run it on Metal, it works flawlessly.

I have been trying to run it with several version for the past 7 days.

Name and Version

Latest build, always pulled from the last 7 days.

System is Pop_Os 22.04
ROCm 6.1.2
Kernel 6.9.3

What operating system are you seeing the problem on?

Linux

Relevant log output

llamacpp_1  | INFO [                    main] build info | tid="133799363425664" timestamp=1719689759 build=0 commit="unknown"
llamacpp_1  | INFO [                    main] system info | tid="133799363425664" timestamp=1719689759 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llamacpp_1  | llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/Mistral-7B-Instruct-v0.3-Q8_0.gguf (version GGUF V3 (latest))
llamacpp_1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llamacpp_1  | llama_model_loader: - kv   0:                       general.architecture str              = llama
llamacpp_1  | llama_model_loader: - kv   1:                               general.name str              = Mistral-7B-Instruct-v0.3
llamacpp_1  | llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llamacpp_1  | llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llamacpp_1  | llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llamacpp_1  | llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llamacpp_1  | llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llamacpp_1  | llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llamacpp_1  | llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
llamacpp_1  | llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llamacpp_1  | llama_model_loader: - kv  10:                          general.file_type u32              = 7
llamacpp_1  | llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
llamacpp_1  | llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llamacpp_1  | llama_model_loader: - kv  13:            tokenizer.ggml.add_space_prefix bool             = true
llamacpp_1  | llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llamacpp_1  | llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llamacpp_1  | llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llamacpp_1  | llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
llamacpp_1  | llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llamacpp_1  | llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llamacpp_1  | llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llamacpp_1  | llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llamacpp_1  | llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llamacpp_1  | llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llamacpp_1  | llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llamacpp_1  | llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llamacpp_1  | llama_model_loader: - type  f32:   65 tensors
llamacpp_1  | llama_model_loader: - type q8_0:  226 tensors
llamacpp_1  | llm_load_vocab: special tokens cache size = 1027
llamacpp_1  | llm_load_vocab: token to piece cache size = 0.1731 MB
llamacpp_1  | llm_load_print_meta: format           = GGUF V3 (latest)
llamacpp_1  | llm_load_print_meta: arch             = llama
llamacpp_1  | llm_load_print_meta: vocab type       = SPM
llamacpp_1  | llm_load_print_meta: n_vocab          = 32768
llamacpp_1  | llm_load_print_meta: n_merges         = 0
llamacpp_1  | llm_load_print_meta: n_ctx_train      = 32768
llamacpp_1  | llm_load_print_meta: n_embd           = 4096
llamacpp_1  | llm_load_print_meta: n_head           = 32
llamacpp_1  | llm_load_print_meta: n_head_kv        = 8
llamacpp_1  | llm_load_print_meta: n_layer          = 32
llamacpp_1  | llm_load_print_meta: n_rot            = 128
llamacpp_1  | llm_load_print_meta: n_embd_head_k    = 128
llamacpp_1  | llm_load_print_meta: n_embd_head_v    = 128
llamacpp_1  | llm_load_print_meta: n_gqa            = 4
llamacpp_1  | llm_load_print_meta: n_embd_k_gqa     = 1024
llamacpp_1  | llm_load_print_meta: n_embd_v_gqa     = 1024
llamacpp_1  | llm_load_print_meta: f_norm_eps       = 0.0e+00
llamacpp_1  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llamacpp_1  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llamacpp_1  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llamacpp_1  | llm_load_print_meta: f_logit_scale    = 0.0e+00
llamacpp_1  | llm_load_print_meta: n_ff             = 14336
llamacpp_1  | llm_load_print_meta: n_expert         = 0
llamacpp_1  | llm_load_print_meta: n_expert_used    = 0
llamacpp_1  | llm_load_print_meta: causal attn      = 1
llamacpp_1  | llm_load_print_meta: pooling type     = 0
llamacpp_1  | llm_load_print_meta: rope type        = 0
llamacpp_1  | llm_load_print_meta: rope scaling     = linear
llamacpp_1  | llm_load_print_meta: freq_base_train  = 1000000.0
llamacpp_1  | llm_load_print_meta: freq_scale_train = 1
llamacpp_1  | llm_load_print_meta: n_ctx_orig_yarn  = 32768
llamacpp_1  | llm_load_print_meta: rope_finetuned   = unknown
llamacpp_1  | llm_load_print_meta: ssm_d_conv       = 0
llamacpp_1  | llm_load_print_meta: ssm_d_inner      = 0
llamacpp_1  | llm_load_print_meta: ssm_d_state      = 0
llamacpp_1  | llm_load_print_meta: ssm_dt_rank      = 0
llamacpp_1  | llm_load_print_meta: model type       = 7B
llamacpp_1  | llm_load_print_meta: model ftype      = Q8_0
llamacpp_1  | llm_load_print_meta: model params     = 7.25 B
llamacpp_1  | llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llamacpp_1  | llm_load_print_meta: general.name     = Mistral-7B-Instruct-v0.3
llamacpp_1  | llm_load_print_meta: BOS token        = 1 '<s>'
llamacpp_1  | llm_load_print_meta: EOS token        = 2 '</s>'
llamacpp_1  | llm_load_print_meta: UNK token        = 0 '<unk>'
llamacpp_1  | llm_load_print_meta: LF token         = 781 '<0x0A>'
llamacpp_1  | llm_load_print_meta: max token length = 48
llamacpp_1  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
llamacpp_1  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
llamacpp_1  | ggml_cuda_init: found 1 ROCm devices:
llamacpp_1  |   Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llamacpp_1  | llm_load_tensors: ggml ctx size =    0.27 MiB
llamacpp_1  | llm_load_tensors: offloading 32 repeating layers to GPU
llamacpp_1  | llm_load_tensors: offloading non-repeating layers to GPU
llamacpp_1  | llm_load_tensors: offloaded 33/33 layers to GPU
llamacpp_1  | llm_load_tensors:      ROCm0 buffer size =  7209.02 MiB
llamacpp_1  | llm_load_tensors:        CPU buffer size =   136.00 MiB
llamacpp_1  | ...................................................................................................
llamacpp_1  | llama_new_context_with_model: n_ctx      = 512
llamacpp_1  | llama_new_context_with_model: n_batch    = 512
llamacpp_1  | llama_new_context_with_model: n_ubatch   = 512
llamacpp_1  | llama_new_context_with_model: flash_attn = 0
llamacpp_1  | llama_new_context_with_model: freq_base  = 1000000.0
llamacpp_1  | llama_new_context_with_model: freq_scale = 1
llamacpp_1  | llama_kv_cache_init:      ROCm0 KV buffer size =    64.00 MiB
llamacpp_1  | llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llamacpp_1  | llama_new_context_with_model:  ROCm_Host  output buffer size =     0.25 MiB
llamacpp_1  | llama_new_context_with_model:      ROCm0 compute buffer size =    81.00 MiB
llamacpp_1  | llama_new_context_with_model:  ROCm_Host compute buffer size =     9.01 MiB
llamacpp_1  | llama_new_context_with_model: graph nodes  = 1030
llamacpp_1  | llama_new_context_with_model: graph splits = 2
llamacpp_1  | ggml_cuda_compute_forward: RMS_NORM failed
llamacpp_1  | CUDA error: invalid device function
llamacpp_1  |   current device: 0, in function ggml_cuda_compute_forward at ggml/src/ggml-cuda.cu:2285
llamacpp_1  |   err
llamacpp_1  | GGML_ASSERT: ggml/src/ggml-cuda.cu:100: !"CUDA error"

Arvamer · 2024-07-01T15:42:06Z

I had similar error on Archlinux (rocm 6.0.2) and RX 6700 XT and what helped for me is compiling with AMDGPU_TARGETS=gfx1030. Looking at makefile, when AMDGPU_TARGETS is not set, it will auto detect arch as gfx1031. However as gfx1031 is not officially supported. I have to set HSA_OVERRIDE_GFX_VERSION=10.3.0 and I guess it doesn’t like that llama.cpp was compiled for "different" GPU arch.

rudiservo · 2024-07-01T15:54:15Z

@Arvamer Oh... in the Dockerfile in .devops, the ENV variable that is set is GPU_TARGETS, not AMDGPU_TARGETS.

Going to try and change it, I'll report my findings.

rudiservo · 2024-07-05T22:34:40Z

Found the issues on the Dockerfile for Rocm.
GPU_TARGETS have to be AMDGPU_TARGETS

and
ARG ROCM_DOCKER_ARCH is missing " ".
So it becomes
ARG ROCM_DOCKER_ARCH="\ gfx803 \ gfx900 \ gfx906 \ gfx908 \ gfx90a \ gfx1010 \ gfx1030 \ gfx1100 \ gfx1101 \ gfx1102"
There also needs to be 2 different types of rocm versions, Rocm5 and Rocm6.
There is a noticeable performance improvement on rocm 6.1.2.

GFX803 and GFX900 are not supported, and GFX906 is deprecated on rocm6.

Should I make a PR?

github-actions · 2024-08-19T01:06:50Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

rudiservo added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Jun 29, 2024

rudiservo mentioned this issue Jul 12, 2024

rocm docker github action build failed TabbyML/tabby#2408

Open

ccidral mentioned this issue Jul 18, 2024

Bug: ROCm CUDA error #8504

Closed

github-actions bot added the stale label Aug 5, 2024

amakropoulos mentioned this issue Aug 13, 2024

Crash on AMD graphics card on Windows undreamai/LLMUnity#202

Closed

github-actions bot closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Docker ROCm crashs, only works on metal compiled. #8213

Bug: Docker ROCm crashs, only works on metal compiled. #8213

rudiservo commented Jun 29, 2024

Arvamer commented Jul 1, 2024

rudiservo commented Jul 1, 2024

rudiservo commented Jul 5, 2024

github-actions bot commented Aug 19, 2024

Bug: Docker ROCm crashs, only works on metal compiled. #8213

Bug: Docker ROCm crashs, only works on metal compiled. #8213

Comments

rudiservo commented Jun 29, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Arvamer commented Jul 1, 2024

rudiservo commented Jul 1, 2024

rudiservo commented Jul 5, 2024

github-actions bot commented Aug 19, 2024