Support importing GGUF files #1187

richardanaya · 2024-01-29T20:24:08Z

I apologize if this seems too far fetched, but it seemed in line with how ONNX generation works.

antimora · 2024-01-29T20:43:22Z

If gguf contains the model graph information, then we can use what burn-import ONNX facility. In our burn-import, we convert ONNX graph to IR (intermediate representation) (see this doc). So, it would possible to convert the model graph to IR and generate source code + weights.

If gguf contains only weights, we can go burn-import pytorch route, where we only download weights.

antimora · 2024-01-29T20:48:49Z

From my brief research, GGUF format contains metadata + tensor weights. This aligns with burn-import pytorch route and not burn-import/ONNX. This will mean model needs to be constructed in Burn first and use the weights to load.

Here is one Rust lib to parse GGUF file: https://github.com/Jimexist/gguf

antimora · 2024-03-15T19:26:39Z

GGUF spec: ggerganov/ggml#302

antimora · 2024-03-15T19:33:44Z

Parser in Rust: https://github.com/Jimexist/gguf

leflambeur · 2025-01-18T14:18:53Z

Hi, it has been about a year since this was last updated, since then pre-existing models on HF often come in GGUF for quantised or Safetensor formats for non-quantised.

I think it would be useful to people new to the space to understand how burn can be leveraged with these formats, as they seem to be the most common formats available to start from scratch.

Specifically importing quantised GGUF models as I couldn't see much in the docs.

Candle is okay for this, but its support for models is a little spotty with quantised models which are more accessible to people with fewer resources.

I saw in #1323 some added pieces were available for reconstructing config files but I am wondering about simply ingesting a gguf model and using it with burn directly similar to the import options for onnx or pytorch without people needing to figure out how to reverse engineer what gguf is doing under the hood with little guidance.

GGUF's single file format seems like an ideal target for burn's use-case to me and the format is much more universally accessible, similar to ONNX on paper.

I am happy to contribute docs, I just need a bit of direction to start testing with the current capabilities or indicators that it is even possible.

Edit Ref to the Candle issue I am seeing with Mistral-Nemo Quantizations:

huggingface/candle#2727

antimora · 2025-01-20T06:08:04Z

I'll be happy to assist if you decide to submit a pr.

We can leverage Candle's reader similar PyTorch pt reader. We can use the existing burn-import infrastructure. It should be somewhat easier now that PyTorch pt import works.

leflambeur · 2025-01-20T10:35:39Z

I actually made a start last night using candle_core::quantized::gguf_file::Content as this is decidedly quicker for building out the metadata and also gets you the layer/tensor structures and weights without loading the whole model, from there I figured you could infer details like MQA vs GQA vs MHA from attention.head_count and attention.head_count_kv and then map a consistent set of burn modules or blocks (which I am also trying ot figure out which ones are appropriate) to the layers described in the GGUF spec with the correct weights (all of which are consistently named in GGUF (mostly)) without needing to do too much more

Example names from the gguf spec that could be mapped:

tok_embd
attn_norm
attn_k
attn_q
attn_v
attn_output
ffn_norm
ffn_gate
ffn_up
ffn_down
output_norm
output

I am very new to rust so it's taking me a bit of time to figure out how to transform the format Content creates as rather than treat things directly as u32 or String everything is stored in 'U32('VALUE')' first and being able to transform those and then map them to the right places to create burn modules etc is a bit of time and effort

leflambeur · 2025-01-20T10:44:11Z

When I say stored as an example of the K, V structure it uses:

"llama.attention.head_count_kv": U32(8)
"llama.context_length": U32(1024000)
"llama.attention.key_length": U32(128)
"llama.block_count": U32(40)
"general.size_label": String("12B")
"general.file_type": U32(7)
"general.type": String("model")
"llama.attention.value_length": U32(128)
"llama.attention.layer_norm_rms_epsilon": F32(1e-5)
"general.version": String("2407")
"llama.rope.dimension_count": U32(128)
"llama.vocab_size": U32(131072)
"llama.rope.freq_base": F32(1000000.0)
"llama.attention.head_count": U32(32)
"llama.embedding_length": U32(5120)
"llama.feed_forward_length": U32(14336)
"general.quantization_version": U32(2)

Rather than say:

llama.attention.head_count_kv: 8

leflambeur · 2025-01-20T12:47:08Z

I actually haven't used Burn at all until now, I only learnt detailed information about transformer architecture after posting my original comment two days ago, and I started with Rust like 3-4 weeks ago so I will try my best but I apologise in advance if I can't see it through.

It's partly my motivation for commenting, as someone new to the whole space all I see is gguf really, I would love to make it more accessible to those of us who want to get started, and from what I can tell Burn is well placed for doing that - I also love you have built in WGPU support - and my ambition for learning Rust to do this is because years ago I did a lot of embedded and I have a load of RPi Picos and various other devices lying around so love you guys have the demo for it, and also I think your approach is fantastic for my goals.

Most of my career until now has been more devops oriented, and even then I have been more on the infrastructure and networking side than development so I am out of my depth but trying.

I can figure out most things on my own but any general pointers are always welcome, I will try and figure it out.

antimora changed the title ~~Support generating burn models from GGUF files?~~ Support importing GGUF files Mar 28, 2024

antimora added the feature The feature request label Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support importing GGUF files #1187

Support importing GGUF files #1187

richardanaya commented Jan 29, 2024 •

edited by antimora

Loading

antimora commented Jan 29, 2024

antimora commented Jan 29, 2024 •

edited

Loading

antimora commented Mar 15, 2024

antimora commented Mar 15, 2024

leflambeur commented Jan 18, 2025 •

edited

Loading

antimora commented Jan 20, 2025

leflambeur commented Jan 20, 2025 •

edited

Loading

leflambeur commented Jan 20, 2025

leflambeur commented Jan 20, 2025

Support importing GGUF files #1187

Support importing GGUF files #1187

Comments

richardanaya commented Jan 29, 2024 • edited by antimora Loading

antimora commented Jan 29, 2024

antimora commented Jan 29, 2024 • edited Loading

antimora commented Mar 15, 2024

antimora commented Mar 15, 2024

leflambeur commented Jan 18, 2025 • edited Loading

antimora commented Jan 20, 2025

leflambeur commented Jan 20, 2025 • edited Loading

leflambeur commented Jan 20, 2025

leflambeur commented Jan 20, 2025

richardanaya commented Jan 29, 2024 •

edited by antimora

Loading

antimora commented Jan 29, 2024 •

edited

Loading

leflambeur commented Jan 18, 2025 •

edited

Loading

leflambeur commented Jan 20, 2025 •

edited

Loading