Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Docker ROCm crashs, only works on metal compiled. #8213

Closed
rudiservo opened this issue Jun 29, 2024 · 4 comments
Closed

Bug: Docker ROCm crashs, only works on metal compiled. #8213

rudiservo opened this issue Jun 29, 2024 · 4 comments
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) stale

Comments

@rudiservo
Copy link
Contributor

What happened?

The docker version with ROCm 5.6 exits after graph splits, I tried building and image with ROCm 5.6, 5.7.1, 6.1.2.

These last ones give me an error that is in the logs.

If I compiled and run it on Metal, it works flawlessly.

I have been trying to run it with several version for the past 7 days.

Name and Version

Latest build, always pulled from the last 7 days.

System is Pop_Os 22.04
ROCm 6.1.2
Kernel 6.9.3

What operating system are you seeing the problem on?

Linux

Relevant log output

llamacpp_1  | INFO [                    main] build info | tid="133799363425664" timestamp=1719689759 build=0 commit="unknown"
llamacpp_1  | INFO [                    main] system info | tid="133799363425664" timestamp=1719689759 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llamacpp_1  | llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /models/Mistral-7B-Instruct-v0.3-Q8_0.gguf (version GGUF V3 (latest))
llamacpp_1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llamacpp_1  | llama_model_loader: - kv   0:                       general.architecture str              = llama
llamacpp_1  | llama_model_loader: - kv   1:                               general.name str              = Mistral-7B-Instruct-v0.3
llamacpp_1  | llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llamacpp_1  | llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llamacpp_1  | llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llamacpp_1  | llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llamacpp_1  | llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llamacpp_1  | llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llamacpp_1  | llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
llamacpp_1  | llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llamacpp_1  | llama_model_loader: - kv  10:                          general.file_type u32              = 7
llamacpp_1  | llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
llamacpp_1  | llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llamacpp_1  | llama_model_loader: - kv  13:            tokenizer.ggml.add_space_prefix bool             = true
llamacpp_1  | llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llamacpp_1  | llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llamacpp_1  | llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llamacpp_1  | llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
llamacpp_1  | llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llamacpp_1  | llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llamacpp_1  | llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llamacpp_1  | llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llamacpp_1  | llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llamacpp_1  | llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llamacpp_1  | llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llamacpp_1  | llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llamacpp_1  | llama_model_loader: - type  f32:   65 tensors
llamacpp_1  | llama_model_loader: - type q8_0:  226 tensors
llamacpp_1  | llm_load_vocab: special tokens cache size = 1027
llamacpp_1  | llm_load_vocab: token to piece cache size = 0.1731 MB
llamacpp_1  | llm_load_print_meta: format           = GGUF V3 (latest)
llamacpp_1  | llm_load_print_meta: arch             = llama
llamacpp_1  | llm_load_print_meta: vocab type       = SPM
llamacpp_1  | llm_load_print_meta: n_vocab          = 32768
llamacpp_1  | llm_load_print_meta: n_merges         = 0
llamacpp_1  | llm_load_print_meta: n_ctx_train      = 32768
llamacpp_1  | llm_load_print_meta: n_embd           = 4096
llamacpp_1  | llm_load_print_meta: n_head           = 32
llamacpp_1  | llm_load_print_meta: n_head_kv        = 8
llamacpp_1  | llm_load_print_meta: n_layer          = 32
llamacpp_1  | llm_load_print_meta: n_rot            = 128
llamacpp_1  | llm_load_print_meta: n_embd_head_k    = 128
llamacpp_1  | llm_load_print_meta: n_embd_head_v    = 128
llamacpp_1  | llm_load_print_meta: n_gqa            = 4
llamacpp_1  | llm_load_print_meta: n_embd_k_gqa     = 1024
llamacpp_1  | llm_load_print_meta: n_embd_v_gqa     = 1024
llamacpp_1  | llm_load_print_meta: f_norm_eps       = 0.0e+00
llamacpp_1  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llamacpp_1  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llamacpp_1  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llamacpp_1  | llm_load_print_meta: f_logit_scale    = 0.0e+00
llamacpp_1  | llm_load_print_meta: n_ff             = 14336
llamacpp_1  | llm_load_print_meta: n_expert         = 0
llamacpp_1  | llm_load_print_meta: n_expert_used    = 0
llamacpp_1  | llm_load_print_meta: causal attn      = 1
llamacpp_1  | llm_load_print_meta: pooling type     = 0
llamacpp_1  | llm_load_print_meta: rope type        = 0
llamacpp_1  | llm_load_print_meta: rope scaling     = linear
llamacpp_1  | llm_load_print_meta: freq_base_train  = 1000000.0
llamacpp_1  | llm_load_print_meta: freq_scale_train = 1
llamacpp_1  | llm_load_print_meta: n_ctx_orig_yarn  = 32768
llamacpp_1  | llm_load_print_meta: rope_finetuned   = unknown
llamacpp_1  | llm_load_print_meta: ssm_d_conv       = 0
llamacpp_1  | llm_load_print_meta: ssm_d_inner      = 0
llamacpp_1  | llm_load_print_meta: ssm_d_state      = 0
llamacpp_1  | llm_load_print_meta: ssm_dt_rank      = 0
llamacpp_1  | llm_load_print_meta: model type       = 7B
llamacpp_1  | llm_load_print_meta: model ftype      = Q8_0
llamacpp_1  | llm_load_print_meta: model params     = 7.25 B
llamacpp_1  | llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW) 
llamacpp_1  | llm_load_print_meta: general.name     = Mistral-7B-Instruct-v0.3
llamacpp_1  | llm_load_print_meta: BOS token        = 1 '<s>'
llamacpp_1  | llm_load_print_meta: EOS token        = 2 '</s>'
llamacpp_1  | llm_load_print_meta: UNK token        = 0 '<unk>'
llamacpp_1  | llm_load_print_meta: LF token         = 781 '<0x0A>'
llamacpp_1  | llm_load_print_meta: max token length = 48
llamacpp_1  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
llamacpp_1  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
llamacpp_1  | ggml_cuda_init: found 1 ROCm devices:
llamacpp_1  |   Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llamacpp_1  | llm_load_tensors: ggml ctx size =    0.27 MiB
llamacpp_1  | llm_load_tensors: offloading 32 repeating layers to GPU
llamacpp_1  | llm_load_tensors: offloading non-repeating layers to GPU
llamacpp_1  | llm_load_tensors: offloaded 33/33 layers to GPU
llamacpp_1  | llm_load_tensors:      ROCm0 buffer size =  7209.02 MiB
llamacpp_1  | llm_load_tensors:        CPU buffer size =   136.00 MiB
llamacpp_1  | ...................................................................................................
llamacpp_1  | llama_new_context_with_model: n_ctx      = 512
llamacpp_1  | llama_new_context_with_model: n_batch    = 512
llamacpp_1  | llama_new_context_with_model: n_ubatch   = 512
llamacpp_1  | llama_new_context_with_model: flash_attn = 0
llamacpp_1  | llama_new_context_with_model: freq_base  = 1000000.0
llamacpp_1  | llama_new_context_with_model: freq_scale = 1
llamacpp_1  | llama_kv_cache_init:      ROCm0 KV buffer size =    64.00 MiB
llamacpp_1  | llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llamacpp_1  | llama_new_context_with_model:  ROCm_Host  output buffer size =     0.25 MiB
llamacpp_1  | llama_new_context_with_model:      ROCm0 compute buffer size =    81.00 MiB
llamacpp_1  | llama_new_context_with_model:  ROCm_Host compute buffer size =     9.01 MiB
llamacpp_1  | llama_new_context_with_model: graph nodes  = 1030
llamacpp_1  | llama_new_context_with_model: graph splits = 2
llamacpp_1  | ggml_cuda_compute_forward: RMS_NORM failed
llamacpp_1  | CUDA error: invalid device function
llamacpp_1  |   current device: 0, in function ggml_cuda_compute_forward at ggml/src/ggml-cuda.cu:2285
llamacpp_1  |   err
llamacpp_1  | GGML_ASSERT: ggml/src/ggml-cuda.cu:100: !"CUDA error"
@rudiservo rudiservo added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Jun 29, 2024
@Arvamer
Copy link

Arvamer commented Jul 1, 2024

I had similar error on Archlinux (rocm 6.0.2) and RX 6700 XT and what helped for me is compiling with AMDGPU_TARGETS=gfx1030. Looking at makefile, when AMDGPU_TARGETS is not set, it will auto detect arch as gfx1031. However as gfx1031 is not officially supported. I have to set HSA_OVERRIDE_GFX_VERSION=10.3.0 and I guess it doesn’t like that llama.cpp was compiled for "different" GPU arch.

@rudiservo
Copy link
Contributor Author

@Arvamer Oh... in the Dockerfile in .devops, the ENV variable that is set is GPU_TARGETS, not AMDGPU_TARGETS.

Going to try and change it, I'll report my findings.

@rudiservo
Copy link
Contributor Author

Found the issues on the Dockerfile for Rocm.
GPU_TARGETS have to be AMDGPU_TARGETS

and
ARG ROCM_DOCKER_ARCH is missing " ".
So it becomes
ARG ROCM_DOCKER_ARCH="\ gfx803 \ gfx900 \ gfx906 \ gfx908 \ gfx90a \ gfx1010 \ gfx1030 \ gfx1100 \ gfx1101 \ gfx1102"
There also needs to be 2 different types of rocm versions, Rocm5 and Rocm6.
There is a noticeable performance improvement on rocm 6.1.2.

GFX803 and GFX900 are not supported, and GFX906 is deprecated on rocm6.

Should I make a PR?

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) stale
Projects
None yet
Development

No branches or pull requests

2 participants