-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DeepSeek V3 #11049
Add support for DeepSeek V3 #11049
Conversation
src/llama.cpp
Outdated
// add experts selection bias - introduced in DeepSeek V3 | ||
ggml_tensor * selection_probs = probs; | ||
if (expert_weights_b != nullptr) { | ||
selection_probs = ggml_add(ctx, probs, expert_weights_b); | ||
cb(selection_probs, "ffn_moe_sigm_biased", il); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to:
// add experts selection bias - introduced in DeepSeek V3 | |
ggml_tensor * selection_probs = probs; | |
if (expert_weights_b != nullptr) { | |
selection_probs = ggml_add(ctx, probs, expert_weights_b); | |
cb(selection_probs, "ffn_moe_sigm_biased", il); | |
} | |
// add experts selection bias - introduced in DeepSeek V3 | |
if (expert_weights_b != nullptr) { | |
probs = ggml_add(ctx, probs, expert_weights_b); | |
cb(probs, "ffn_moe_sigm_b", il); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this won't work correctly, as the original unmodified weights are still needed for multiplication with the experts output at the end of the function. Biased weights are used only for expert selection. See the DeepSeek V3 technical report:
Note that the bias term is only used for routing. The gating value, which will be multiplied with
the FFN output, is still derived from the original affinity score
Edit: I'm going to add a comment in the code to make it clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - I missed this.
gguf-py/gguf/constants.py
Outdated
@@ -312,6 +314,7 @@ class MODEL_TENSOR(IntEnum): | |||
FFN_GATE_SHEXP = auto() | |||
FFN_DOWN_SHEXP = auto() | |||
FFN_UP_SHEXP = auto() | |||
FFN_EXPERT_WEIGHTS_B = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more consistency in the names, lets change the EXPERT
to EXP
. Also, it seems that PROBS
is better name since this is a bias for the computed expert probabilities:
FFN_EXPERT_WEIGHTS_B = auto() | |
FFN_EXP_PROBS_B = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gguf-py/gguf/constants.py
Outdated
@@ -496,6 +499,7 @@ class MODEL_TENSOR(IntEnum): | |||
MODEL_TENSOR.FFN_GATE_EXP: "blk.{bid}.ffn_gate_exps", | |||
MODEL_TENSOR.FFN_DOWN_EXP: "blk.{bid}.ffn_down_exps", | |||
MODEL_TENSOR.FFN_UP_EXP: "blk.{bid}.ffn_up_exps", | |||
MODEL_TENSOR.FFN_EXPERT_WEIGHTS_B: "blk.{bid}.expert_weights_b", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MODEL_TENSOR.FFN_EXPERT_WEIGHTS_B: "blk.{bid}.expert_weights_b", | |
MODEL_TENSOR.FFN_EXP_PROBS_B: "blk.{bid}.exp_probs_b", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/llama.cpp
Outdated
@@ -2912,6 +2934,7 @@ struct llama_layer { | |||
struct ggml_tensor * ffn_down_b = nullptr; // b2 | |||
struct ggml_tensor * ffn_up_b = nullptr; // b3 | |||
struct ggml_tensor * ffn_act = nullptr; | |||
struct ggml_tensor * ffn_expert_weights_bias = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct ggml_tensor * ffn_expert_weights_bias = nullptr; | |
struct ggml_tensor * ffn_exp_probs_b = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/llama.cpp
Outdated
ggml_tensor * probs = nullptr; | ||
switch (gating_op) { | ||
case LLM_EXPERT_GATING_FUNC_SOFTMAX: | ||
{ | ||
probs = ggml_soft_max(ctx, logits); // [n_expert, n_tokens] | ||
cb(probs, "ffn_moe_probs", il); | ||
} break; | ||
case LLM_EXPERT_GATING_FUNC_SIGMOID: | ||
{ | ||
probs = ggml_sigmoid(ctx, logits); // [n_expert, n_tokens] | ||
cb(probs, "ffn_moe_sigm", il); | ||
} break; | ||
default: | ||
GGML_ABORT("fatal error"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't set names here:
ggml_tensor * probs = nullptr; | |
switch (gating_op) { | |
case LLM_EXPERT_GATING_FUNC_SOFTMAX: | |
{ | |
probs = ggml_soft_max(ctx, logits); // [n_expert, n_tokens] | |
cb(probs, "ffn_moe_probs", il); | |
} break; | |
case LLM_EXPERT_GATING_FUNC_SIGMOID: | |
{ | |
probs = ggml_sigmoid(ctx, logits); // [n_expert, n_tokens] | |
cb(probs, "ffn_moe_sigm", il); | |
} break; | |
default: | |
GGML_ABORT("fatal error"); | |
} | |
ggml_tensor * probs = nullptr; | |
switch (gating_op) { | |
case LLM_EXPERT_GATING_FUNC_SOFTMAX: | |
{ | |
probs = ggml_soft_max(ctx, logits); // [n_expert, n_tokens] | |
} break; | |
case LLM_EXPERT_GATING_FUNC_SIGMOID: | |
{ | |
probs = ggml_sigmoid(ctx, logits); // [n_expert, n_tokens] | |
} break; | |
default: | |
GGML_ABORT("fatal error"); | |
} |
Instead, after applying the probs bias, call cb(probs, "ffn_moe_probs", il);
for the final probs
result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved name setting after switch, but I kept it separate from biased probs for reasons mentioned earlier.
…ode categories in pre-tokenization regex
@ggerganov I extended your "collapsed" regex workaround with \p{M} and \p{S} - DeepSeek V3 has these in pre-tokenizer regex. Take a look if it looks sane when you have a moment. I checked with test-tokenizer-0 and tokenization of wiki.test.raw now matches the original. |
@ggerganov Also since you merged #10902 I had to put expert_gating_func enum in a file included in both |
Let's place it in We can merge after you move the |
* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type * vocab : add DeepSeek V3 pre-tokenizer regexes * unicode : handle ACCENT_MARK and SYMBOL categories in regex * llama : add DeepSeek V3 chat template, handle new model parameters and tensor types --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Related to #11141 While DeepSeek V3 support has been added, there appears to be an ongoing issue specifically with the ROCm backend. When attempting to run DeepSeek models (both V2 and V3) with ROCm:
This behavior is consistent across both DeepSeek V2 and V3 models. Would appreciate if this ROCm-specific issue could be investigated. |
@emuchogu Does it happen even with DeepSeek-V2-Lite? |
Yes. Same behavior with deepseek-v2-16b-lite-chat-q4_K_M. |
This PR adds support for recently released DeepSeek V3 model. (MoE, 671B)
The model is architecturally very similar to DeepSeek V2, there are only minor changes in expert weights calculation.
Summary of changes:
expert_weights_norm
model parameter indicating whether expert weights shall be normalized or not - they were not normalized in DeepSeek V2 but they are in DeepSeek V3,expert_gating_func
model parameter corresponding to enum value indicating a function used to calculate expert probs - usually it's softmax, but DeepSeek V3 uses sigmoid for this purpose,expert_weights_b
exp_probs_b
tensor type containing expert weights bias tensors - DeepSeek V3 introduced bias term added to calculated expert probs, biased probs are the input to the top k experts selection process,llm_build_moe_ffn()
API and implementation to handle the mentioned differences,Note: DeepSeek V3 also introduced multi-token prediction (MTP), but I decided to skip this feature for now. MTP layer is ignored during model conversion and is not present in resulting GGUF file.