Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama_generate_text: error: unable to load model #14

Open
TechnicalParadox opened this issue May 24, 2024 · 5 comments
Open

llama_generate_text: error: unable to load model #14

TechnicalParadox opened this issue May 24, 2024 · 5 comments

Comments

@TechnicalParadox
Copy link

I point the file to the .gguf but only receive this output when attempting to generate text

@Adriankhl
Copy link
Owner

Adriankhl commented May 24, 2024

Hi, can you run godot on a terminal and paste the terminal output here, there should be a bit more information.

@TechnicalParadox
Copy link
Author

Meant to make this an issue under the addon github but this is the console output. It actually works fine with the CPU build of the addon but the vulkan build fails to load the model.

`Vulkan API 1.3.277 - Forward Mobile - Using Vulkan Device #0: NVIDIA - NVIDIA GeForce RTX 4080 Laptop GPU

test1
D:/Godot_v4.2.2-stable_win64.exe/projects/SpaceRoguelike/assets/Meta-Llama-3-8B-Instruct-Q8_0.gguf
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from D:/Godot_v4.2.2-stable_win64.exe/proj)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct-imx
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", .
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1,.
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["─á ─á", "─á ─á─á─á", "─á─.
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = mess.
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct-imatrix
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ggml_vulkan: Found 2 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 4080 Laptop GPU | uma: 0 | fp16: 1 | warp size: 32
Vulkan1: Microsoft Direct3D12 (NVIDIA GeForce RTX 4080 Laptop GPU) | uma: 0 | fp16: 1 | warp size: 32
llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorOutOfHostMemory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'D:/Godot_v4.2.2-stable_win64.exe/projects/SpaceRoguelike/asset'
Full generation:llama_generate_text: error: unable to load model
Godot Engine v4.2.2.stable.official.15073afe3 - https://godotengine.org
Vulkan API 1.3.277 - Forward Mobile - Using Vulkan Device #0: NVIDIA - NVIDIA GeForce RTX 4080 Laptop GPU

test1
D:/llms/Meta-Llama-3-8B-Instruct-Q8_0.gguf
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from D:/llms/Meta-Llama-3-8B-Instruct-Q8_0)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct-imx
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", .
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1,.
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["─á ─á", "─á ─á─á─á", "─á─.
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = mess.
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct-imatrix
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ggml_vulkan: Found 2 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 4080 Laptop GPU | uma: 0 | fp16: 1 | warp size: 32
Vulkan1: Microsoft Direct3D12 (NVIDIA GeForce RTX 4080 Laptop GPU) | uma: 0 | fp16: 1 | warp size: 32
llama_model_load: error loading model: vk::Device::createComputePipeline: ErrorOutOfHostMemory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'D:/llms/Meta-Llama-3-8B-Instruct-Q8_0.gguf'
Full generation:llama_generate_text: error: unable to load model
`

@TechnicalParadox
Copy link
Author

GPU Has 12GB of VRAM so it shouldn't be out of memory. I also tried with the 5gb model, same issue

@Adriankhl Adriankhl transferred this issue from Adriankhl/godot-llm-template May 27, 2024
@Adriankhl
Copy link
Owner

Adriankhl commented May 27, 2024

@TechnicalParadox I have transferred the issue to this addon repo.

This is very likely to be an upstream bug (2 vulkan devices for the same gpu). Can you try this new build godot_windows_release.zip, set Split Mode to None (in gdscript $Llama.split_mode = 0) and try setting Main GPU to either 0 or 1 (in gdscript $Llama.main_gpu = 1) to see if it works?

Be aware that the should_output_bos and should_output_eos now become a single should_output_special, it may produce some errors if you are using the released llm template.

@Adriankhl Adriankhl reopened this May 27, 2024
@TechnicalParadox
Copy link
Author

The new build works with split mode set to none and the main gpu at default of 0. Setting main gpu to 1 fails to load model. Both vulkan devices still show in the command prompt. Thank you! Much faster than CPU generation, once it loads onto the gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants