-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative Decoding is slower than expected on A100 #3649
Comments
Thank you for the detailed report - very useful information!
I'll try to do the same test today and see if I can find the bottleneck. |
I might be missing something, but I think there is an error in the number representations of the equation.
because I did some more testing on a V100 16GB GPU using the same models. import numpy as np
# case 0, alpha = 0.8, gamma = 5
sd_tg = 619.4
st_tg = 52.5
st_pp = 121.7 # pp 5
s_avg = (sd_tg + st_pp)/2
a = 0.80
c = st_tg/s_avg
g = 5
speed = (1 - a**(g + 1))/((1 - a)*(c*g + 1))
print(speed)
# case 0, alpha = 0.875, gamma = 8
sd_tg = 625
st_tg = 52.5
st_pp = 194.6 # pp 8
s_avg = (sd_tg + st_pp)/2
a = 0.875
c = st_tg/s_avg
g = 8
speed = (1 - a**(g + 1))/((1 - a)*(c*g + 1))
print(speed) In make -j && ./bin/speculative -ngl 1000 -ngld 100 -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -md /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p "${prompt}" -e --temp -1 -n 256 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1
encoded 58 tokens in 0.111 seconds, speed: 521.414 t/s
decoded 261 tokens in 2.967 seconds, speed: 87.954 t/s
n_draft = 5
n_predict = 261
n_drafted = 260
n_accept = 208
accept = 80.000%
draft:
llama_print_timings: load time = 125.62 ms
llama_print_timings: sample time = 9.58 ms / 260 runs ( 0.04 ms per token, 27128.55 tokens per second)
llama_print_timings: prompt eval time = 10.95 ms / 58 tokens ( 0.19 ms per token, 5296.80 tokens per second)
llama_print_timings: eval time = 480.14 ms / 261 runs ( 1.84 ms per token, 543.59 tokens per second)
llama_print_timings: total time = 3079.20 ms
target:
llama_print_timings: load time = 2827.23 ms
llama_print_timings: sample time = 9.84 ms / 261 runs ( 0.04 ms per token, 26513.61 tokens per second)
llama_print_timings: prompt eval time = 2387.92 ms / 369 tokens ( 6.47 ms per token, 154.53 tokens per second)
llama_print_timings: eval time = 19.31 ms / 1 runs ( 19.31 ms per token, 51.80 tokens per second)
llama_print_timings: total time = 3219.78 ms In make -j && ./bin/speculative -ngl 1000 -ngld 100 -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -md /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p "${prompt}" -e --temp -1 -n 256 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1
encoded 58 tokens in 0.110 seconds, speed: 525.272 t/s
decoded 257 tokens in 2.083 seconds, speed: 123.394 t/s
n_draft = 8
n_predict = 257
n_drafted = 256
n_accept = 224
accept = 87.500%
draft:
llama_print_timings: load time = 125.64 ms
llama_print_timings: sample time = 9.42 ms / 256 runs ( 0.04 ms per token, 27190.65 tokens per second)
llama_print_timings: prompt eval time = 10.97 ms / 58 tokens ( 0.19 ms per token, 5286.66 tokens per second)
llama_print_timings: eval time = 446.25 ms / 257 runs ( 1.74 ms per token, 575.91 tokens per second)
llama_print_timings: total time = 2193.53 ms
target:
llama_print_timings: load time = 2770.13 ms
llama_print_timings: sample time = 11.02 ms / 257 runs ( 0.04 ms per token, 23323.35 tokens per second)
llama_print_timings: prompt eval time = 1543.14 ms / 345 tokens ( 4.47 ms per token, 223.57 tokens per second)
llama_print_timings: eval time = 19.56 ms / 1 runs ( 19.56 ms per token, 51.12 tokens per second)
llama_print_timings: total time = 2334.47 ms I am using the branch in #3624 with --- a/examples/speculative/speculative.cpp
+++ b/examples/speculative/speculative.cpp
@@ -174,7 +174,7 @@ int main(int argc, char ** argv) {
continue;
}
- if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+ if (i_dft < (int) drafts[s].tokens.size() - 1) {
LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
s_keep = s;
@@ -273,11 +273,11 @@ int main(int argc, char ** argv) {
}
// TODO: make this configurable
- if (cur_p[0].p < 0.4) {
- LOG("stopping drafting for seq %3d, probability too low: %.3f < 2*%.3f\n", s, cur_p[0].p, cur_p[1].p);
- drafts[s].drafting = false;
- continue;
- }
+ //if (cur_p[0].p < 0.4) {
+ // LOG("stopping drafting for seq %3d, probability too low: %.3f < 2*%.3f\n", s, cur_p[0].p, cur_p[1].p);
+ // drafts[s].drafting = false;
+ // continue;
+ //}
std::vector<int> sa(1, s); To determine # draft model text-generation bench
./bin/llama-bench -m /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p 0 -n 128 -ngl 99
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama ?B mostly F16 | 309.82 MiB | 162.42 M | CUDA | 99 | tg 128 | 619.43 ± 20.03 | # target model PP and TG bench
./bin/llama-bench -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -p 1,2,3,4,5,6,7,8,64,128,256,512 -n 128 -ngl 99
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 1 | 30.72 ± 3.88 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 2 | 51.22 ± 0.43 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 3 | 75.92 ± 0.64 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 4 | 101.66 ± 0.29 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 5 | 121.74 ± 0.55 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 6 | 146.37 ± 0.35 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 7 | 163.65 ± 0.46 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 8 | 194.58 ± 0.72 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 64 | 899.08 ± 23.64 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 128 | 1708.83 ± 56.78 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 256 | 2515.20 ± 10.46 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | pp 512 | 2866.48 ± 1.45 |
| llama 7B mostly F16 | 12.55 GiB | 6.74 B | CUDA | 99 | tg 128 | 52.49 ± 0.13 | For For My understanding is that The theoretical results this way are as follows: python3 speed.py
2.1594861448542773
2.7629832057356403 While the observed are:
|
Thanks for the correction, yeah I plugin the wrong numbers, your calculation is correct. I will also try to benchmark on V100 today/tmr, and will let you know the numbers. Thanks for the detailed experiment! |
Hi
when token acceptance rate ~ 60%
I'm still confused about the performance on A100....
|
# p - probability to accept a token
#
# probability to accept M tokens from a draft of N:
#
# M probability
# 0 (1-p)
# 1 (1-p)*p
# 2 (1-p)*p^2
# ...
# N-1 (1-p)*p^(N-1)
# N p^N
#
# expectation:
#
# E[X] = 0*(1-p) + 1*p*(1-p) + 2*p^2*(1-p) + 3*p^3*(1-p) + ... + (N-1)*p^(N-1)*(1-p) + N*p^N
#
import numpy as np
import sys
N = int(sys.argv[1])
p = float(sys.argv[2])
print("N = ", N)
print("p = ", p)
E = 0
for i in range(N):
E += i * p**i * (1-p)
E += N * p**N
print("E = ", round(E, 2), " (", round(100*(E/N), 2), "% )") So for a draft size of 8, you can use $ ▶ python3 expect.py 8 0.95
N = 8
p = 0.95
E = 6.4 ( 79.94 % ) Here is the diff on --- a/examples/speculative/speculative.cpp
+++ b/examples/speculative/speculative.cpp
@@ -8,6 +8,10 @@
#include <string>
#include <vector>
+static float frand() {
+ return (float) rand() / RAND_MAX;
+}
+
struct seq_draft {
bool active = false;
bool drafting = false;
@@ -37,7 +41,7 @@ int main(int argc, char ** argv) {
const int n_seq_dft = params.n_parallel;
// TODO: make this configurable
- const float p_accept = 0.80f;
+ const float p_accept = -1.0f; // always draft n_draft tokens
const float p_split = 0.10f;
#ifndef LOG_DISABLE_LOGS
@@ -178,7 +182,7 @@ int main(int argc, char ** argv) {
continue;
}
- if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+ if (i_dft < (int) drafts[s].tokens.size() && frand() < 0.95) { // use the python script to find the value that you will give you the desired acceptance rate
LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
s_keep = s; Alternatively, you can do what I did in my previous comment - simply accept the first const int n_seq_dft = params.n_parallel;
// TODO: make this configurable
- const float p_accept = 0.80f;
+ const float p_accept = -1.0f; // always draft n_draft tokens
const float p_split = 0.10f;
#ifndef LOG_DISABLE_LOGS
@@ -178,7 +182,7 @@ int main(int argc, char ** argv) {
continue;
}
- if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+ if (i_dft < 0.8*(int) drafts[s].tokens.size()) {
LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
s_keep = s; |
With #3749 now merged, the batched decoding performance for F16 models has been significantly improved. A few speculative decoding tests on A100 from today achieve 2-3x speed-up using Codellama 34B Target + Codellama 7B Q4_0 Draft: https://twitter.com/ggerganov/status/1716727296269193702 Here are some examples:
@LiuXiaoxuanPKU Let us know if you attempt more A100 experiments and make sure to use the latest version of |
Hi @ggerganov I have some more data for you. I tried to speed up llama-2 70b with either 13b or 7b. In both cases, to no avail: llama-13b-chat as draft model:./speculative -m ../../../llama-2-70b-chat.Q6_K.gguf --threads 1 --n-gpu-layers 999 -md ../../../llama-2-13b-chat.Q8_0.gguf --n-gpu-layers-draft 999 -n 500 --prompt "<s>[INST]\nTell me about Joe Biden. [/INST] " --draft 8 --ctx-size 4096
<s>[INST]\nTell me about Joe Biden. [/INST] Joe Biden is the 46th President of the United States. He was born on November 20, 1942, in Scranton, Pennsylvania. He served as Vice President under Barack Obama from 2009 to 2017 and was elected President in 2020.
Biden earned a bachelor's degree from the University of Delaware and a law degree from Syracuse University. Before entering politics, he worked as a lawyer and served on the Senate staff. In 1970, he was elected to the New Castle County Council, and in 1972, he was elected to the United States Senate, where he served for six terms until 2009.
During his time in the Senate, Biden focused on issues related to criminal justice, foreign policy, and the rights of people with disabilities. He also served as chair of the Senate Foreign Relations Committee and was a strong advocate for the Violence Against Women Act.
In 2008, Biden was chosen by Barack Obama as his running mate in the presidential election. They won the election and served two terms together, during which time Biden focused on issues related to foreign policy and national security.
After leaving office, Biden established the Biden Foundation, a nonprofit organization focused on issues related to education, LGBTQ rights, and the prevention of sexual assault. He also began teaching at the University of Pennsylvania and authored several books.
In 2019, Biden announced his candidacy for the 2020 presidential election. He ran as a moderate Democrat, focusing on issues related to healthcare, education, and the economy. He won the nomination and went on to defeat incumbent President Donald Trump in the general election.
Biden's presidency has been marked by several significant accomplishments, including the passage of the American Rescue Plan, a $1.9 trillion stimulus package aimed at addressing the COVID-19 pandemic and economic downturn. He has also taken executive action to address climate change, expand access to healthcare, and protect the rights of LGBTQ individuals.
Biden has also pursued a number of foreign policy initiatives, including the
encoded 20 tokens in 0.607 seconds, speed: 32.959 t/s
decoded 503 tokens in 33.435 seconds, speed: 15.044 t/s
n_draft = 8
n_predict = 503
n_drafted = 471
n_accept = 399
accept = 84.713%
draft:
llama_print_timings: load time = 5963.01 ms
llama_print_timings: sample time = 1645.45 ms / 1 runs ( 1645.45 ms per token, 0.61 tokens per second)
llama_print_timings: prompt eval time = 67.36 ms / 20 tokens ( 3.37 ms per token, 296.92 tokens per second)
llama_print_timings: eval time = 9664.61 ms / 575 runs ( 16.81 ms per token, 59.50 tokens per second)
llama_print_timings: total time = 34042.43 ms
target:
llama_print_timings: load time = 16015.05 ms
llama_print_timings: sample time = 275.97 ms / 503 runs ( 0.55 ms per token, 1822.64 tokens per second)
llama_print_timings: prompt eval time = 21001.56 ms / 578 tokens ( 36.33 ms per token, 27.52 tokens per second)
llama_print_timings: eval time = 1092.52 ms / 16 runs ( 68.28 ms per token, 14.65 tokens per second) llama 7b chat as draft model:./speculative -m ../../../llama-2-70b-chat.Q6_K.gguf --threads 1 --n-gpu-layers 999 -md ../../../llama-2-7b-chat.Q8_0.gguf --n-gpu-layers-draft 999 -n 500 --prompt "<s>[INST]\nTell me about Joe Biden. [/INST] " --draft 8 --ctx-size 4096
<s>[INST]\nTell me about Joe Biden. [/INST] Joe Biden is the 46th President of the United States. He was born on November 20, 1942, in Scranton, Pennsylvania. Biden served as Vice President under Barack Obama from 2009 to 2017 and represented Delaware in the United States Senate from 1973 to 2009. He is a member of the Democratic Party.
Biden graduated from the University of Delaware and Syracuse University College of Law. Before entering politics, he worked as a lawyer and served on the Senate staff. In 1972, Biden was elected to the New Castle County Council, and in 1970, he ran for the United States Senate, but he lost to incumbent Senator J. Caleb Boggs.
Biden was first elected to the Senate in 1972, at the age of 29, making him the youngest person to be elected to the Senate at the time. He served in the Senate for six terms, becoming one of the longest-serving Senators in American history. During his time in the Senate, Biden focused on issues related to criminal justice, foreign policy, and the rights of people with disabilities.
In 2008, Biden was chosen by Barack Obama as his running mate in the presidential election. They won the election, and Biden served as Vice President from 2009 to 2017. As Vice President, Biden focused on issues related to foreign policy, national security, and the economy.
In 2015, Biden announced that he would not run for President in the 2016 election, but he remained a prominent figure in the Democratic Party. In 2019, he announced his candidacy for the 2020 presidential election, and he won the nomination at the 2020 Democratic National Convention. Biden went on to defeat incumbent President Donald Trump in the general election, becoming the oldest person to be elected President of the United States at the age of 78.
Biden's presidency has focused on issues such as COVID-19 pandemic response, economic recovery, and addressing climate change. He has also taken steps to reform the immigration system,
encoded 20 tokens in 0.595 seconds, speed: 33.617 t/s
decoded 503 tokens in 30.448 seconds, speed: 16.520 t/s
n_draft = 8
n_predict = 503
n_drafted = 426
n_accept = 384
accept = 90.141%
draft:
llama_print_timings: load time = 3155.88 ms
llama_print_timings: sample time = 1575.41 ms / 1 runs ( 1575.41 ms per token, 0.63 tokens per second)
llama_print_timings: prompt eval time = 36.19 ms / 20 tokens ( 1.81 ms per token, 552.72 tokens per second)
llama_print_timings: eval time = 6042.37 ms / 545 runs ( 11.09 ms per token, 90.20 tokens per second)
llama_print_timings: total time = 31043.23 ms
target:
llama_print_timings: load time = 15269.15 ms
llama_print_timings: sample time = 266.29 ms / 503 runs ( 0.53 ms per token, 1888.89 tokens per second)
llama_print_timings: prompt eval time = 21196.78 ms / 540 tokens ( 39.25 ms per token, 25.48 tokens per second)
llama_print_timings: eval time = 1635.22 ms / 24 runs ( 68.13 ms per token, 14.68 tokens per second)
llama_print_timings: total time = 34226.27 ms This is running the latest llama cpp as of now. Also tried to adjust Also, the |
Here is how to read the numbers:
Therefore we have:
So this should give expected total speed for You don't see a significant speedup likely because the drafts are evaluated in very small batches. Maybe try to reduce the acceptance threshold: llama.cpp/examples/speculative/speculative.cpp Lines 42 to 43 in 207b519
It's currently hardcoded, so you will have to edit the code and recompile. Maybe try something like Btw, thanks for looking into this. Definitely let us know you observations. I'm interested in this technique, and probably the current example can be improved in many ways. |
Could you also post the result from this command: LLAMA_CUBLAS=1 make -j && ./batched-bench ../../../llama-2-70b-chat.Q6_K.gguf 4096 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32,64 |
Thanks for explaining the numbers to me!
@ggerganov I use cmake. Can you give me the equivalent? And I will run it. |
It's not really important, it just enables parallel builds (so it makes compiling faster, the results are exactly the same). The cmake build step is something like |
@ggerganov @KerfuffleV2 Here's the result of the batched-bench:
llama_print_timings: load time = 16006.36 ms |
Hm, yeah - the batched decoding performance for |
@ggerganov |
I think we will invest more efforts in improving the quantum batched decoding performance after we finish some improvements to the GPU interface that are currently being developed. The solution will likely require implementing custom kernels and ops, so it will need some deeper understanding of the CUDA implementation and will probably involve some significant changes. As I mentioned, you can test speculative sampling using an F16 target model with any quantum model and see how this performs. This would at least give confidence that the implemented strategy is working as it should (which is the main problem discussed in this issue) and later when the quantum batched decoding is improved, similar gains would be expected. |
Hi @ggerganov I did some tests with the speculative example, and some quantizations appear to be fine when running 72% of the model in vram. In the case of running this setup in the speculative example with a 70B Q3_K_S I get 1.3x speedup on all chat formats, offloading 57 layers to a 3090, with This is the same speedup factor I am getting on pure cpu speculative sampling with the model. (1.3x, where 1.5 t/s goes to 2 t/s) It's a general speedup, and shouldn't be limited to coding examples, it works for instruct/chat formats. I also tried exllamav2's speculative sampling examples and sampling parameters. These give a speedup of mostly 1.5x on the chat responses. When running the Quicksort code example, I get 2-3x. I also played with the 7B fp16 medusa model via their commandline interface, with default settings. This gave a consistent speedup of mostly 2x on the chat responses, compared to the original transformers. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Thanks for the great project! I am benchmarking the performance of llamacpp with speculative decoding.
When I benchmark on Mac M1 chip, the results look great: speculative decoding increases the speed from ~12 tokens/s to ~16 tokens/s.
However, the performance is not very good on A100. Concretely, the speed of the target model and draft model are:
I am using greedy decoding and disabling all the heuristics (fix
n_draft
, always proposen_draft
tokens and avoid early stopping). My execution cmd is:When token acceptance rate is 0.44, speculative decoding is actually slower (notice 50 tokens/s < 75 tokens/s)
However, based on the original speculative paper, the speedup should be:
where
alpha
is the token acceptance rate,gamma
is the number of tokens proposed each step, andc
is the ratio between the execution times of the draft and target models. In the example above,c
is roughly76/669=0.11
.Plugin in the numbers above, the expected speedup should be:
(1-0.44^6)/[(1-0.44)*(0.11*0.44+1)]=1.69x
.However, the benchmarking results show that it's actually
50/76=0.66x
.To debug this, I set the token acceptance rate to 100% by removing the
id==draft_id[i_dft]
here. After doing this, I observe that the speed is ~90tokens/s, which brings90/76=1.18x
speedup. However, this is much smaller than the calculation with the formula above (I use 0.99 as the token acceptance rate instead of 1):(1-0.99^6)/[(1-0.99)*(0.11*0.99+1)]=5.27x
.I wonder which part of the speculative decoding might cause big overhead, any comments are highly appreciated! Thanks!
The text was updated successfully, but these errors were encountered: