-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DBRX models: dbrx-base and dbrx-instruct #6344
Comments
Hi! DBRX researcher here, happy to help out however I can! The architecture is quite similar to Mixtral, which is already supported in this framework. The modeling source code for DBRX is available on the HF Hub here: https://huggingface.co/databricks/dbrx-instruct/blob/main/modeling_dbrx.py The main differences vs. Mixtral as far as I can tell:
Please let me know if you have any questions! |
The model is ~132B params so I think the expected memory usage is:
|
@abhi-mosaic Thanks for the pointers. We do split the experts in separate tensors at the moment, but it is something that we planned to change: #6082 Seems like now is the time do that |
Thanks @abhi-mosaic for all the complete and detailed explanations. @ggerganov I have a big server, I can test any PR from 16bit all the way down to 2bit. (I have the model already downloaded and ready) |
Same, got big servers with very fast and plentyful ram channels, so can try on CPU all the sizes. |
happy to test on my server |
@abhi-mosaic While the llama.cpp guys are working on solving their issue with 16 experts and not 8, I was thinking to quantize 4 bits with the native huggingface BitsAndBytes, still getting an error |
@simsim314 take a look at this comment, I think someone found a workaround by:
https://huggingface.co/databricks/dbrx-instruct/discussions/10#660566f14f41c0c7c0e54ab9 |
Not quite ... they say only 36B parameters are "active on any input", as it is a mixture of experts model. |
but the entire model needs to be loaded into memory even if the parameters are not activated |
I have the model downloaded in my server if something is added, i can help testing. |
Have a dual 3090 setup and interested in quant to 2bit to see if it will fit in 48GB VRAM, could also test with CPU layers offloaded as running 14900KS. Eric was able to get a The Professor, a 155 Billion parameter model into being able to run on a dual 3090. |
I'll be very excited to see this working |
Is anyone actively working on this issue? If not I can work my network to try to find someone |
MoE models will need to be exported with the experts fused into a single tensor after #6387, so it may be better to wait until that is merged before adding new MoE models (it should be soon). |
Many thanks for the ETA and explanation. I actually have couple of MoE models made by MergeKit that behave badly when quantized in GGUF, I am hoping this also can fix that. That said, I am going to test that PR to see how it works so far. Thanks again. |
@ggerganov @slaren I can see the PRs are merged, thank you so much for your work. I have pulled the changes from the master, but I still get Is the MoE support for DBRX will be added in another PR? |
DBRX requires a convert script ( |
Thank you, I'll see if I can have a look at the Qwen MoE PR and make one for DBRX if I am not beat to it. |
Is someone actively working on this? Any help needed ? |
For the mean time, if you are on mac there is https://huggingface.co/mlx-community/dbrx-instruct-4bit |
@ehartford Looks like about 70Gb of unified memory. What do you think we could expect the memory requirements to be on cuda in 2bit? My sense is that larger model at lower bitrate seems like a good trade off. Thanks for your insights in advance. |
there are already 2 bit exllama weights |
on a VRAM-constrained GPU deployment, I'd go with exl2 |
@ggerganov or @slaren it looks DBRX has a special tokenizer: Are we currently supporting this somehow ? |
Many thanks for starting this and having a brach for it. I got badly stuck in that tiktoken tokenization! I just don't know how to make a custom tokenization work in Llama.cpp. (I'll contribute to your PR if you need any testing) FYI: https://github.com/ggerganov/llama.cpp/compare/hp/model/support-dbrx |
Yes I dont know how our tokenizer will behave at the moment. We will see if I am able to reach the draft PR step. Thanks |
DBRX License clarification for GGUF@maziyarpanahi @ggerganov As I have done the conversion to Can we upload the GGUF quants on HF, if yes how ? I see few approaches, but I am not a lawyer:
I have some concerns especially about:
Probably need help from Databricks, @abhi-mosaic ? |
You just copy the original license exactly how they did it in their model card |
@ggerganov please confirm I can upload it on |
Thanks @phymbert for your work.
|
|
No need to upload it - in |
Noted, deleted |
* model: dbrx convert to gguf #6344 * llama: support dbrx #6344 * doc: dbrx: add the model as supported * scripts: get-wikitext-2 add unzip * llama: increase maximum experts allowed * llama: factorize moe graph implementation between grok, mixtral and dbrx --------- Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
* model: dbrx convert to gguf ggerganov#6344 * llama: support dbrx ggerganov#6344 * doc: dbrx: add the model as supported * scripts: get-wikitext-2 add unzip * llama: increase maximum experts allowed * llama: factorize moe graph implementation between grok, mixtral and dbrx --------- Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Databricks just released 2 new models called DBRX (base and instruct). They have their own architecture:
Motivation
These models are superior to the predecessors like Llama-2 or Mixtral (even though they are larger), the community can really benefit from these two and the fine-tuned models that come after.
https://huggingface.co/databricks/dbrx-instruct
Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
python llama.cpp/convert-hf-to-gguf.py
python llama.cpp/convert.py
The text was updated successfully, but these errors were encountered: