Add support for DBRX models: dbrx-base and dbrx-instruct #6344

maziyarpanahi · 2024-03-27T12:34:45Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Databricks just released 2 new models called DBRX (base and instruct). They have their own architecture:

{
  "architectures": [
    "DbrxForCausalLM"
  ],
  "attn_config": {
    "clip_qkv": 8,
    "kv_n_heads": 8,
    "model_type": "",
    "rope_theta": 500000
  },
  "auto_map": {
    "AutoConfig": "configuration_dbrx.DbrxConfig",
    "AutoModelForCausalLM": "modeling_dbrx.DbrxForCausalLM"
  },
  "d_model": 6144,
  "emb_pdrop": 0.0,
  "ffn_config": {
    "ffn_hidden_size": 10752,
    "model_type": "",
    "moe_jitter_eps": 0,
    "moe_loss_weight": 0.05,
    "moe_num_experts": 16,
    "moe_top_k": 4
  },
  "initializer_range": 0.02,
  "max_seq_len": 32768,
  "model_type": "dbrx",
  "n_heads": 48,
  "n_layers": 40,
  "output_router_logits": false,
  "resid_pdrop": 0.0,
  "router_aux_loss_coef": 0.05,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.38.2",
  "use_cache": true,
  "vocab_size": 100352
}

Motivation

These models are superior to the predecessors like Llama-2 or Mixtral (even though they are larger), the community can really benefit from these two and the fine-tuned models that come after.

https://huggingface.co/databricks/dbrx-instruct

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

python llama.cpp/convert-hf-to-gguf.py

Traceback (most recent call last):
  File "/llama.cpp/convert-hf-to-gguf.py", line 2099, in <module>
    main()
  File "/llama.cpp/convert-hf-to-gguf.py", line 2079, in main
    model_class = Model.from_model_architecture(hparams["architectures"][0])
  File "/llama.cpp/convert-hf-to-gguf.py", line 215, in from_model_architecture
    raise NotImplementedError(f'Architecture {arch!r} not supported!') from None
NotImplementedError: Architecture 'DbrxForCausalLM' not supported!

python llama.cpp/convert.py

  File "/llama.cpp/convert.py", line 1486, in <module>
    main()
  File "/llama.cpp/convert.py", line 1422, in main
    model_plus = load_some_model(args.model)
  File "/llama.cpp/convert.py", line 1291, in load_some_model
    model_plus = merge_multifile_models(models_plus)
  File "/llama.cpp/convert.py", line 747, in merge_multifile_models
    model = merge_sharded([mp.model for mp in models_plus])
  File "/llama.cpp/convert.py", line 726, in merge_sharded
    return {name: convert(name) for name in names}
  File "/llama.cpp/convert.py", line 726, in <dictcomp>
    return {name: convert(name) for name in names}
  File "/llama.cpp/convert.py", line 701, in convert
    lazy_tensors: list[LazyTensor] = [model[name] for model in models]
  File "/llama.cpp/convert.py", line 701, in <listcomp>
    lazy_tensors: list[LazyTensor] = [model[name] for model in models]
KeyError: 'transformer.blocks.0.ffn.experts.mlp.w1'

Dbrx is a mixture-of-experts model, which each FFN is divided into 16 experts and only 4 are activated at any given time. We build on MegaBlocks
https://github.com/databricks/megablocks

The text was updated successfully, but these errors were encountered:

abhi-mosaic · 2024-03-27T15:10:36Z

Hi! DBRX researcher here, happy to help out however I can!

The architecture is quite similar to Mixtral, which is already supported in this framework. The modeling source code for DBRX is available on the HF Hub here: https://huggingface.co/databricks/dbrx-instruct/blob/main/modeling_dbrx.py

The main differences vs. Mixtral as far as I can tell:

QKV clipping: https://huggingface.co/databricks/dbrx-instruct/blob/464e701f50aef4c1b59c81fb5667819a5d08e108/modeling_dbrx.py#L317-L318
LayerNorm rather than RMSNorm: https://huggingface.co/databricks/dbrx-instruct/blob/464e701f50aef4c1b59c81fb5667819a5d08e108/modeling_dbrx.py#L642
the weights of the DbrxExpertGLU layer are fused along the experts , aka transformer.blocks.12.ffn.experts.mlp.w1, not transformer.blocks.12.ffn.experts.mlp.list_w1.0.weight. This is what you will find in the .safetensors weight files, you can see the mapping explicitly here in the model.safetensors.index.json: https://huggingface.co/databricks/dbrx-instruct/blob/main/model.safetensors.index.json
- So if the llama.cpp weight loader expects to load individual tensors for each expert, I think there needs to be some kind of sharding/renaming of the fused tensors (e.g. ...mlp.w1) after they are loaded from disk.
- DbrxExpertGLU: https://huggingface.co/databricks/dbrx-instruct/blob/464e701f50aef4c1b59c81fb5667819a5d08e108/modeling_dbrx.py#L749

Please let me know if you have any questions!

abhi-mosaic · 2024-03-27T15:14:47Z

The model is ~132B params so I think the expected memory usage is:

264GB with 16-bit, already possible using our HF quickstart code: https://huggingface.co/databricks/dbrx-instruct#quickstart-guide
expected 132GB with 8-bit
expected 99GB with 6-bit
expected 66GB with 4-bit

ggerganov · 2024-03-27T17:14:14Z

@abhi-mosaic Thanks for the pointers. We do split the experts in separate tensors at the moment, but it is something that we planned to change: #6082

Seems like now is the time do that

maziyarpanahi · 2024-03-27T17:14:28Z

Thanks @abhi-mosaic for all the complete and detailed explanations.

@ggerganov I have a big server, I can test any PR from 16bit all the way down to 2bit. (I have the model already downloaded and ready)

moshemalawach · 2024-03-27T21:11:09Z

Same, got big servers with very fast and plentyful ram channels, so can try on CPU all the sizes.

sirus20x6 · 2024-03-28T01:05:16Z

Put me in coach, I'm ready to play, today.

veryvanya · 2024-03-28T13:30:47Z

happy to test on my server

simsim314 · 2024-03-28T20:45:25Z

@abhi-mosaic While the llama.cpp guys are working on solving their issue with 16 experts and not 8, I was thinking to quantize 4 bits with the native huggingface BitsAndBytes, still getting an error
P.S. This will enable many people with much smaller computational power still, 66GB to run the model. Single H100 or A100, instead of the current 4.

abhi-mosaic · 2024-03-28T23:15:05Z

@simsim314 take a look at this comment, I think someone found a workaround by:

remapping the safetensor weight files to look like separate linear weights, 1 per expert
edit modeling source code to use a List[nn.Linear]
new model should work with bitsnbytes

https://huggingface.co/databricks/dbrx-instruct/discussions/10#660566f14f41c0c7c0e54ab9

peterhgruber · 2024-03-29T00:03:56Z

The model is ~132B params so I think the expected memory usage is:

264GB with 16-bit, already possible using our HF quickstart code: https://huggingface.co/databricks/dbrx-instruct#quickstart-guide

expected 132GB with 8-bit

expected 99GB with 6-bit

expected 66GB with 4-bit

Not quite ... they say only 36B parameters are "active on any input", as it is a mixture of experts model.

wrapss · 2024-03-29T00:19:27Z

Not quite ... they say only 36B parameters are "active on any input", as it is a mixture of experts model.

but the entire model needs to be loaded into memory even if the parameters are not activated

MohamedAliRashad · 2024-03-29T20:42:08Z

I have the model downloaded in my server if something is added, i can help testing.

RodriMora · 2024-04-01T12:27:58Z

I have the model downloaded too and can help testing:

nkeilar · 2024-04-02T04:35:17Z

Have a dual 3090 setup and interested in quant to 2bit to see if it will fit in 48GB VRAM, could also test with CPU layers offloaded as running 14900KS. Eric was able to get a The Professor, a 155 Billion parameter model into being able to run on a dual 3090.

ehartford · 2024-04-02T22:28:48Z

I'll be very excited to see this working

ehartford · 2024-04-02T22:29:58Z

Is anyone actively working on this issue? If not I can work my network to try to find someone

slaren · 2024-04-02T23:00:26Z

MoE models will need to be exported with the experts fused into a single tensor after #6387, so it may be better to wait until that is merged before adding new MoE models (it should be soon).

maziyarpanahi · 2024-04-03T07:09:27Z

MoE models will need to be exported with the experts fused into a single tensor after #6387, so it may be better to wait until that is merged before adding new MoE models (it should be soon).

Many thanks for the ETA and explanation. I actually have couple of MoE models made by MergeKit that behave badly when quantized in GGUF, I am hoping this also can fix that.

That said, I am going to test that PR to see how it works so far. Thanks again.

maziyarpanahi · 2024-04-03T13:19:08Z

@ggerganov @slaren I can see the PRs are merged, thank you so much for your work.

I have pulled the changes from the master, but I still get KeyError: 'transformer.blocks.0.ffn.experts.mlp.w1' error for convert and DbrxForCausalLM' not supported! for converting hf to gguf.

Is the MoE support for DBRX will be added in another PR?

ggerganov · 2024-04-03T13:22:20Z

DBRX requires a convert script (convert-hf-to-gguf.py) + graph implementation as usual. See #6074 as an example of what needs to be done for DBRX

maziyarpanahi · 2024-04-03T13:26:08Z

DBRX requires a convert script (convert-hf-to-gguf.py) + graph implementation as usual. See #6074 as an example of what needs to be done for DBRX

Thank you, I'll see if I can have a look at the Qwen MoE PR and make one for DBRX if I am not beat to it.

phymbert · 2024-04-05T12:08:43Z

Is someone actively working on this? Any help needed ?

ehartford · 2024-04-05T19:12:24Z

For the mean time, if you are on mac there is https://huggingface.co/mlx-community/dbrx-instruct-4bit

nkeilar · 2024-04-05T19:23:22Z

For the mean time, if you are on mac there is https://huggingface.co/mlx-community/dbrx-instruct-4bit

@ehartford Looks like about 70Gb of unified memory. What do you think we could expect the memory requirements to be on cuda in 2bit? My sense is that larger model at lower bitrate seems like a good trade off. Thanks for your insights in advance.

KnutJaegersberg · 2024-04-05T19:42:08Z

there are already 2 bit exllama weights
https://huggingface.co/turboderp/dbrx-instruct-exl2

ehartford · 2024-04-05T19:51:31Z

on a VRAM-constrained GPU deployment, I'd go with exl2

phymbert · 2024-04-06T08:47:17Z

@ggerganov or @slaren it looks DBRX has a special tokenizer:

https://huggingface.co/databricks/dbrx-instruct/blob/main/tiktoken.py

Are we currently supporting this somehow ?

#6344

maziyarpanahi · 2024-04-06T13:18:29Z

@ggerganov or @slaren it looks DBRX has a special tokenizer:

https://huggingface.co/databricks/dbrx-instruct/blob/main/tiktoken.py

Are we currently supporting this somehow ?

Many thanks for starting this and having a brach for it. I got badly stuck in that tiktoken tokenization! I just don't know how to make a custom tokenization work in Llama.cpp. (I'll contribute to your PR if you need any testing)

FYI: https://github.com/ggerganov/llama.cpp/compare/hp/model/support-dbrx

#6344

phymbert · 2024-04-06T13:26:03Z

@ggerganov or @slaren it looks DBRX has a special tokenizer:

https://huggingface.co/databricks/dbrx-instruct/blob/main/tiktoken.py

Are we currently supporting this somehow ?

Many thanks for starting this and having a brach for it. I got badly stuck in that tiktoken tokenization! I just don't know how to make a custom tokenization work in Llama.cpp. (I'll contribute to your PR if you need any testing)

FYI: https://github.com/ggerganov/llama.cpp/compare/hp/model/support-dbrx

Yes I dont know how our tokenizer will behave at the moment. We will see if I am able to reach the draft PR step. Thanks

phymbert · 2024-04-06T16:22:08Z

DBRX License clarification for GGUF

@maziyarpanahi @ggerganov As I have done the conversion to gguf (not tested yet), I am wondering what are the exacts conditions to meet the DBRX License.

Can we upload the GGUF quants on HF, if yes how ? I see few approaches, but I am not a lawyer:

Set the HF license model to databricks-open-model-license (other) as the original model
Set another opensource license (I am not sure if it is allowed) and Attach a Notice file with DBRX is provided under and subject to the Databricks Open Model License, Copyright © Databricks, Inc. All rights reserved."
Do not distribute on HF

I have some concerns especially about:

Any additional or different terms and conditions you impose must not conflict with the terms of this Agreement and in the event of a conflict, the terms and conditions of this Agreement shall govern over any such additional or different terms and conditions.
You will not use DBRX or DBRX Derivatives or any Output to improve any other large language model (excluding DBRX or DBRX Derivatives).

Probably need help from Databricks, @abhi-mosaic ?

ehartford · 2024-04-06T18:10:32Z

You just copy the original license exactly how they did it in their model card

phymbert · 2024-04-06T18:32:17Z

You just copy the original license exactly how they did it in their model card

@ggerganov please confirm I can upload it on ggml-org with the above ?

maziyarpanahi · 2024-04-07T07:30:23Z

DBRX License clarification for GGUF

@maziyarpanahi @ggerganov As I have done the conversion to gguf (not tested yet), I am wondering what are the exacts conditions to meet the DBRX License.

Can we upload the GGUF quants on HF, if yes how ? I see few approaches, but I am not a lawyer:

Set the HF license model to databricks-open-model-license (other) as the original model

Set another opensource license (I am not sure if it is allowed) and Attach a Notice file with DBRX is provided under and subject to the Databricks Open Model License, Copyright © Databricks, Inc. All rights reserved."

Do not distribute on HF

I have some concerns especially about:

Any additional or different terms and conditions you impose must not conflict with the terms of this Agreement and in the event of a conflict, the terms and conditions of this Agreement shall govern over any such additional or different terms and conditions.

You will not use DBRX or DBRX Derivatives or any Output to improve any other large language model (excluding DBRX or DBRX Derivatives).

Probably need help from Databricks, @abhi-mosaic ?

Thanks @phymbert for your work.

Do you have a PR ready so I can also test it locally? (I have a GPTQ of the base, so I can compare their quality)
For the license, I also recommend what @ehartford suggested, leaving the original license is totally fine and should be enough. (they restrict the use of DBRX to improve other LLMs, quantizing won't fall under that clause)

phymbert · 2024-04-07T08:22:52Z

Do you have a PR ready so I can also test it locally?

model: support arch DbrxForCausalLM #6515

ggerganov · 2024-04-08T12:13:40Z

You just copy the original license exactly how they did it in their model card

@ggerganov please confirm I can upload it on ggml-org with the above ?

No need to upload it - in ggml-org we only want to have models that are used by the CI or for other kind of test/demo purposes

phymbert · 2024-04-08T12:26:41Z

No need to upload it - in ggml-org we only want to have models that are used by the CI or for other kind of test/demo purposes

Noted, deleted

* model: dbrx convert to gguf #6344 * llama: support dbrx #6344 * doc: dbrx: add the model as supported * scripts: get-wikitext-2 add unzip * llama: increase maximum experts allowed * llama: factorize moe graph implementation between grok, mixtral and dbrx --------- Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>

* model: dbrx convert to gguf ggerganov#6344 * llama: support dbrx ggerganov#6344 * doc: dbrx: add the model as supported * scripts: get-wikitext-2 add unzip * llama: increase maximum experts allowed * llama: factorize moe graph implementation between grok, mixtral and dbrx --------- Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>

maziyarpanahi added the enhancement New feature or request label Mar 27, 2024

Kavan72 mentioned this issue Mar 28, 2024

databricks-dbrx ollama/ollama#3370

Closed

phymbert added a commit that referenced this issue Apr 6, 2024

model: dbrx convert to gguf

1d8de31

#6344

phymbert added a commit that referenced this issue Apr 6, 2024

llama: support dbrx

0722ad1

#6344

phymbert added a commit that referenced this issue Apr 6, 2024

llama: support dbrx

ed582c1

#6344

phymbert mentioned this issue Apr 6, 2024

model: support arch DbrxForCausalLM #6515

Merged

13 tasks

phymbert added the model Model specific label Apr 8, 2024

jart mentioned this issue Apr 9, 2024

dbrx support Mozilla-Ocho/llamafile#330

Open

phymbert closed this as completed in #6515 Apr 13, 2024

Add support for DBRX models: dbrx-base and dbrx-instruct #6344

Add support for DBRX models: dbrx-base and dbrx-instruct #6344

Comments

maziyarpanahi commented Mar 27, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

abhi-mosaic commented Mar 27, 2024

abhi-mosaic commented Mar 27, 2024 • edited Loading

ggerganov commented Mar 27, 2024

maziyarpanahi commented Mar 27, 2024

moshemalawach commented Mar 27, 2024

sirus20x6 commented Mar 28, 2024

veryvanya commented Mar 28, 2024

simsim314 commented Mar 28, 2024 • edited Loading

abhi-mosaic commented Mar 28, 2024

peterhgruber commented Mar 29, 2024

wrapss commented Mar 29, 2024

MohamedAliRashad commented Mar 29, 2024

RodriMora commented Apr 1, 2024

nkeilar commented Apr 2, 2024

ehartford commented Apr 2, 2024

ehartford commented Apr 2, 2024

slaren commented Apr 2, 2024

maziyarpanahi commented Apr 3, 2024

maziyarpanahi commented Apr 3, 2024

ggerganov commented Apr 3, 2024

maziyarpanahi commented Apr 3, 2024

phymbert commented Apr 5, 2024

ehartford commented Apr 5, 2024

nkeilar commented Apr 5, 2024

KnutJaegersberg commented Apr 5, 2024

ehartford commented Apr 5, 2024

phymbert commented Apr 6, 2024

maziyarpanahi commented Apr 6, 2024 • edited Loading

phymbert commented Apr 6, 2024

phymbert commented Apr 6, 2024

DBRX License clarification for GGUF

ehartford commented Apr 6, 2024

phymbert commented Apr 6, 2024

maziyarpanahi commented Apr 7, 2024

DBRX License clarification for GGUF

phymbert commented Apr 7, 2024 • edited Loading

ggerganov commented Apr 8, 2024

phymbert commented Apr 8, 2024

maziyarpanahi commented Mar 27, 2024 •

edited

Loading

abhi-mosaic commented Mar 27, 2024 •

edited

Loading

simsim314 commented Mar 28, 2024 •

edited

Loading

maziyarpanahi commented Apr 6, 2024 •

edited

Loading

phymbert commented Apr 7, 2024 •

edited

Loading