Speculative Decoding is slower than expected on A100 #3649

LiuXiaoxuanPKU · 2023-10-17T04:22:20Z

Thanks for the great project! I am benchmarking the performance of llamacpp with speculative decoding.

model setting: draft model llama-160m, target model llama7b.
When I benchmark on Mac M1 chip, the results look great: speculative decoding increases the speed from ~12 tokens/s to ~16 tokens/s.
However, the performance is not very good on A100. Concretely, the speed of the target model and draft model are:

draft:

llama_print_timings:        load time =      65.11 ms
llama_print_timings:      sample time =     524.95 ms /     1 runs   (  524.95 ms per token,     1.90 tokens per second)
llama_print_timings: prompt eval time =       8.59 ms /    94 tokens (    0.09 ms per token, 10946.78 tokens per second)
llama_print_timings:        eval time =     322.80 ms /   216 runs   (    1.49 ms per token,   669.15 tokens per second)
llama_print_timings:       total time =    2924.72 ms

target:

llama_print_timings:        load time =    1144.77 ms
llama_print_timings:      sample time =       4.02 ms /   259 runs   (    0.02 ms per token, 64411.84 tokens per second)
llama_print_timings: prompt eval time =    1939.02 ms /   351 tokens (    5.52 ms per token,   181.02 tokens per second)
llama_print_timings:        eval time =      13.19 ms /     1 runs   (   13.19 ms per token,    75.82 tokens per second)
llama_print_timings:       total time =    2999.59 ms

I am using greedy decoding and disabling all the heuristics (fix n_draft, always propose n_draft tokens and avoid early stopping). My execution cmd is:

./build/bin/speculative \
-ngl 1000 \
-ngld 100 \
-m /data/model/llama-7b/ggml-model-f16.gguf \
-md /data/model/lama-160m/ggml-model-f16.gguf \
-p "${prompt}" \
-e --temp "-1" -n 256 -s 1 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5

When token acceptance rate is 0.44, speculative decoding is actually slower (notice 50 tokens/s < 75 tokens/s)

encoded   94 tokens in    0.076 seconds, speed: 1231.914 t/s
decoded  108 tokens in    2.145 seconds, speed:   50.341 t/s

n_draft   = 5
n_predict = 108
n_drafted = 165
n_accept  = 74
accept    = 44.848%

However, based on the original speculative paper, the speedup should be:

where alpha is the token acceptance rate, gamma is the number of tokens proposed each step, and c is the ratio between the execution times of the draft and target models. In the example above, c is roughly 76/669=0.11.
Plugin in the numbers above, the expected speedup should be:
(1-0.44^6)/[(1-0.44)*(0.11*0.44+1)]=1.69x.
However, the benchmarking results show that it's actually 50/76=0.66x.

To debug this, I set the token acceptance rate to 100% by removing the id==draft_id[i_dft] here. After doing this, I observe that the speed is ~90tokens/s, which brings 90/76=1.18x speedup. However, this is much smaller than the calculation with the formula above (I use 0.99 as the token acceptance rate instead of 1):
(1-0.99^6)/[(1-0.99)*(0.11*0.99+1)]=5.27x.

I wonder which part of the speculative decoding might cause big overhead, any comments are highly appreciated! Thanks!

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-10-17T05:03:18Z

Thank you for the detailed report - very useful information!

~~If you add -nommq CLI arg, do the numbers improve?~~ (Edit: nvm, -nommq does not make a difference for F16 models)

I'll try to do the same test today and see if I can find the bottleneck.

ggerganov · 2023-10-17T08:51:20Z

I might be missing something, but I think there is an error in the number representations of the equation.
In the first case, for example, it should be:

(1-0.44^6)/[(1-0.44)*(0.11*5+1)]= 1.14x

because gamma in the denominator is 5 - not 0.44

I did some more testing on a V100 16GB GPU using the same models.
Here is my script to determine the theoretical speed-up according to the paper:

import numpy as np

# case 0, alpha = 0.8, gamma = 5
sd_tg = 619.4
st_tg = 52.5
st_pp = 121.7 # pp 5

s_avg = (sd_tg + st_pp)/2

a = 0.80
c = st_tg/s_avg
g = 5

speed = (1 - a**(g + 1))/((1 - a)*(c*g + 1))

print(speed)

# case 0, alpha = 0.875, gamma = 8
sd_tg = 625
st_tg = 52.5
st_pp = 194.6 # pp 8

s_avg = (sd_tg + st_pp)/2

a = 0.875
c = st_tg/s_avg
g = 8

speed = (1 - a**(g + 1))/((1 - a)*(c*g + 1))

print(speed)

In case 0 I run the following:

make -j && ./bin/speculative -ngl 1000 -ngld 100 -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -md /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p "${prompt}" -e --temp -1 -n 256 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1

encoded   58 tokens in    0.111 seconds, speed:  521.414 t/s
decoded  261 tokens in    2.967 seconds, speed:   87.954 t/s

n_draft   = 5
n_predict = 261
n_drafted = 260
n_accept  = 208
accept    = 80.000%

draft:

llama_print_timings:        load time =     125.62 ms
llama_print_timings:      sample time =       9.58 ms /   260 runs   (    0.04 ms per token, 27128.55 tokens per second)
llama_print_timings: prompt eval time =      10.95 ms /    58 tokens (    0.19 ms per token,  5296.80 tokens per second)
llama_print_timings:        eval time =     480.14 ms /   261 runs   (    1.84 ms per token,   543.59 tokens per second)
llama_print_timings:       total time =    3079.20 ms

target:

llama_print_timings:        load time =    2827.23 ms
llama_print_timings:      sample time =       9.84 ms /   261 runs   (    0.04 ms per token, 26513.61 tokens per second)
llama_print_timings: prompt eval time =    2387.92 ms /   369 tokens (    6.47 ms per token,   154.53 tokens per second)
llama_print_timings:        eval time =      19.31 ms /     1 runs   (   19.31 ms per token,    51.80 tokens per second)
llama_print_timings:       total time =    3219.78 ms

In case 1 I run this:

make -j && ./bin/speculative -ngl 1000 -ngld 100 -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -md /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p "${prompt}" -e --temp -1 -n 256 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 -np 1

encoded   58 tokens in    0.110 seconds, speed:  525.272 t/s
decoded  257 tokens in    2.083 seconds, speed:  123.394 t/s

n_draft   = 8
n_predict = 257
n_drafted = 256
n_accept  = 224
accept    = 87.500%

draft:

llama_print_timings:        load time =     125.64 ms
llama_print_timings:      sample time =       9.42 ms /   256 runs   (    0.04 ms per token, 27190.65 tokens per second)
llama_print_timings: prompt eval time =      10.97 ms /    58 tokens (    0.19 ms per token,  5286.66 tokens per second)
llama_print_timings:        eval time =     446.25 ms /   257 runs   (    1.74 ms per token,   575.91 tokens per second)
llama_print_timings:       total time =    2193.53 ms

target:

llama_print_timings:        load time =    2770.13 ms
llama_print_timings:      sample time =      11.02 ms /   257 runs   (    0.04 ms per token, 23323.35 tokens per second)
llama_print_timings: prompt eval time =    1543.14 ms /   345 tokens (    4.47 ms per token,   223.57 tokens per second)
llama_print_timings:        eval time =      19.56 ms /     1 runs   (   19.56 ms per token,    51.12 tokens per second)
llama_print_timings:       total time =    2334.47 ms

I am using the branch in #3624 with -np 1 it should be equivalent to master.
I have applied the following patch to simulate high-acceptance rate (similar to your changes):

--- a/examples/speculative/speculative.cpp
+++ b/examples/speculative/speculative.cpp
@@ -174,7 +174,7 @@ int main(int argc, char ** argv) {
                         continue;
                     }
 
-                    if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+                    if (i_dft < (int) drafts[s].tokens.size() - 1) {
                         LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
 
                         s_keep = s;
@@ -273,11 +273,11 @@ int main(int argc, char ** argv) {
                 }
 
                 // TODO: make this configurable
-                if (cur_p[0].p < 0.4) {
-                    LOG("stopping drafting for seq %3d, probability too low: %.3f < 2*%.3f\n", s, cur_p[0].p, cur_p[1].p);
-                    drafts[s].drafting = false;
-                    continue;
-                }
+                //if (cur_p[0].p < 0.4) {
+                //    LOG("stopping drafting for seq %3d, probability too low: %.3f < 2*%.3f\n", s, cur_p[0].p, cur_p[1].p);
+                //    drafts[s].drafting = false;
+                //    continue;
+                //}
 
                 std::vector<int> sa(1, s);

To determine sd_tg, st_tg and st_pp I run the following benchmark commands:

# draft model text-generation bench
./bin/llama-bench -m /mnt/llama.cpp/models/llama-160m/ggml-model-f16.gguf -p 0 -n 128 -ngl 99

  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama ?B mostly F16            | 309.82 MiB |   162.42 M | CUDA       |  99 | tg 128     |   619.43 ± 20.03 |

# target model PP and TG bench
./bin/llama-bench -m /mnt/llama.cpp/models/open-llama/7B-v2/ggml-model-f16.gguf -p 1,2,3,4,5,6,7,8,64,128,256,512 -n 128 -ngl 99

  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 1       |     30.72 ± 3.88 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 2       |     51.22 ± 0.43 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 3       |     75.92 ± 0.64 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 4       |    101.66 ± 0.29 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 5       |    121.74 ± 0.55 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 6       |    146.37 ± 0.35 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 7       |    163.65 ± 0.46 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 8       |    194.58 ± 0.72 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 64      |   899.08 ± 23.64 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 128     |  1708.83 ± 56.78 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 256     |  2515.20 ± 10.46 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | pp 512     |   2866.48 ± 1.45 |
| llama 7B mostly F16            |  12.55 GiB |     6.74 B | CUDA       |  99 | tg 128     |     52.49 ± 0.13 |

For sd_tg I pick the result from the first bench.

For st_tg I pick the tg 128 result from the second bench.
For st_pp I pick the respective pp g value based on the value of g.

My understanding is that s_avg is the average speed in speculative mode where we take into account the speed both for drafting and evaluating the drafted batch on the target model.

The theoretical results this way are as follows:

python3 speed.py
2.1594861448542773
2.7629832057356403

While the observed are:

87.954 / 52.49 = 1.675633453991236
123.394 / 52.49 = 2.350809678033911

LiuXiaoxuanPKU · 2023-10-17T23:33:05Z

Thanks for the correction, yeah I plugin the wrong numbers, your calculation is correct.

I will also try to benchmark on V100 today/tmr, and will let you know the numbers. Thanks for the detailed experiment!

LiuXiaoxuanPKU · 2023-10-19T03:35:00Z

Hi

I benchmarked on V100 and A100. My numbers on V100 are promising, concretely:
Speed (generation phase) in tokens/s of draft and target models:

GPU	Draft (160M)	Target (7B)
A100	616	74
V100	630	51

when token acceptance rate ~ 60%

	Speculative decoding with greedy sampling	Speedup
A100	63	0.85
V100	69	1.35

I'm still confused about the performance on A100....

For the formula:

The paper makes the assumption that:
for the target model, the time to verify gamma+1 tokens is the same as generating a single token.
Therefore, c is the ratio between the time for a single run of the draft model and the time for a single run of the target model. I think your way of calculation (My understanding is that s_avg is the average speed in speculative mode where we take into account the speed both for drafting and evaluating the drafted batch on the target model.) will underestimate the speedup a bit, but I guess it's good for now.
I want to confirm that my understanding of variables is correct:st_tg is speed of target model generation phase, st_pp is the speed of target model prompt phase.
Could you demonstrate how you set the token acceptance rate to 80%? I try to generate some random values between 0-1 and accept the token when the random value is < 0.8, still modify the condition here, but it cannot strictly fix the token acceptance rate to 0.8.

ggerganov · 2023-10-19T09:00:54Z

Yes, my assumption might be incorrect. At least in llama.cpp, currently the time TN to verify N tokens in a batch is not the same as the time T1 for 1 token. Typically, we have T1 < TN < N*T1, except for very small values of N (e.g. ~2, 3)
Yes. st_pp is the prompt processing speed with a certain batch size. Also known as prefill phase. We assume that when we verify a draft with size gamma, the speed is the same as the prompt processing speed with batch size gamma
If you perform a sequential rejection with acceptance probability per token p, your total acceptance rate will not be equal to p. Here is a script to compute the acceptance rate for a given draft size N and acceptance probability p:

# p - probability to accept a token
#
# probability to accept M tokens from a draft of N:
#
# M   probability
# 0   (1-p)
# 1   (1-p)*p
# 2   (1-p)*p^2
# ...
# N-1 (1-p)*p^(N-1)
# N   p^N
#
# expectation:
#
# E[X] = 0*(1-p) + 1*p*(1-p) + 2*p^2*(1-p) + 3*p^3*(1-p) + ... + (N-1)*p^(N-1)*(1-p) + N*p^N
#

import numpy as np
import sys

N = int(sys.argv[1])
p = float(sys.argv[2])

print("N = ", N)
print("p = ", p)

E = 0

for i in range(N):
    E += i * p**i * (1-p)

E += N * p**N

print("E = ", round(E, 2), " (", round(100*(E/N), 2), "% )")

So for a draft size of 8, you can use p = 0.95 to get ~80% total acceptance rate:

$ ▶ python3 expect.py 8 0.95
N =  8
p =  0.95
E =  6.4  ( 79.94 % )

Here is the diff on master:

--- a/examples/speculative/speculative.cpp
+++ b/examples/speculative/speculative.cpp
@@ -8,6 +8,10 @@
 #include <string>
 #include <vector>
 
+static float frand() {
+    return (float) rand() / RAND_MAX;
+}
+
 struct seq_draft {
     bool active   = false;
     bool drafting = false;
@@ -37,7 +41,7 @@ int main(int argc, char ** argv) {
     const int n_seq_dft = params.n_parallel;
 
     // TODO: make this configurable
-    const float p_accept = 0.80f;
+    const float p_accept = -1.0f; // always draft n_draft tokens
     const float p_split  = 0.10f;
 
 #ifndef LOG_DISABLE_LOGS
@@ -178,7 +182,7 @@ int main(int argc, char ** argv) {
                         continue;
                     }
 
-                    if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+                    if (i_dft < (int) drafts[s].tokens.size() && frand() < 0.95) { // use the python script to find the value that you will give you the desired acceptance rate
                         LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
 
                         s_keep = s;

Alternatively, you can do what I did in my previous comment - simply accept the first 0.8*N tokens unconditionally:

    const int n_seq_dft = params.n_parallel;
 
     // TODO: make this configurable
-    const float p_accept = 0.80f;
+    const float p_accept = -1.0f; // always draft n_draft tokens
     const float p_split  = 0.10f;
 
 #ifndef LOG_DISABLE_LOGS
@@ -178,7 +182,7 @@ int main(int argc, char ** argv) {
                         continue;
                     }
 
-                    if (i_dft < (int) drafts[s].tokens.size() && id == drafts[s].tokens[i_dft]) {
+                    if (i_dft < 0.8*(int) drafts[s].tokens.size()) {
                         LOG("the sampled target token matches the %dth drafted token of sequence %d (%d, '%s') - accepted\n", i_dft, s, id, token_str.c_str());
 
                         s_keep = s;

ggerganov · 2023-10-24T13:55:38Z

With #3749 now merged, the batched decoding performance for F16 models has been significantly improved.

A few speculative decoding tests on A100 from today achieve 2-3x speed-up using Codellama 34B Target + Codellama 7B Q4_0 Draft:

https://twitter.com/ggerganov/status/1716727296269193702

Here are some examples:

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-f16.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 # Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

# Time complexity: O(V^2)
# Space complexity: O(V)

def dijkstra(graph, start):
    distances = {vertex: float('inf') for vertex in graph}
    previous_vertices = {vertex: None for vertex in graph}
    distances[start] = 0
    vertices = list(graph.keys())

    while len(vertices) > 0:
        current_vertex = min(vertices, key=lambda vertex: distances[vertex])
        vertices.remove(current_vertex)

        if distances[current_vertex] == float('inf'):
            break

        for neighbor in graph[current_vertex]:
            distance = distances[current_vertex] + graph[current_vertex][neighbor]

            if distance < distances[neighbor]:
                distances[neighbor] = distance
                previous_vertices[neighbor] = current_vertex

    return distances, previous_vertices

encoded   25 tokens in    0.129 seconds, speed:  193.979 t/s
decoded  238 tokens in    6.033 seconds, speed:   39.453 t/s

n_draft   = 16
n_predict = 238
n_drafted = 323
n_accept  = 215
accept    = 66.563%

draft:

llama_print_timings:        load time =    2186.21 ms
llama_print_timings:      sample time =      45.08 ms /   326 runs   (    0.14 ms per token,  7231.91 tokens per second)
llama_print_timings: prompt eval time =      19.28 ms /    25 tokens (    0.77 ms per token,  1296.88 tokens per second)
llama_print_timings:        eval time =    4599.92 ms /   346 runs   (   13.29 ms per token,    75.22 tokens per second)
llama_print_timings:       total time =    6161.42 ms

target:

llama_print_timings:        load time =    6532.70 ms
llama_print_timings:      sample time =      28.60 ms /   238 runs   (    0.12 ms per token,  8320.81 tokens per second)
llama_print_timings: prompt eval time =    1301.73 ms /   369 tokens (    3.53 ms per token,   283.47 tokens per second)
llama_print_timings:        eval time =      50.88 ms /     1 runs   (   50.88 ms per token,    19.65 tokens per second)
llama_print_timings:       total time =    8365.12 ms

---

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-q4_0.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 # Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:

# Time complexity: O(V^2)
# Space complexity: O(V)

def dijkstra(graph, start):
    distances = {vertex: float('inf') for vertex in graph}
    previous_vertices = {vertex: None for vertex in graph}
    distances[start] = 0
    vertices = list(graph.keys())

    while len(vertices) > 0:
        current_vertex = min(vertices, key=lambda vertex: distances[vertex])
        vertices.remove(current_vertex)

        if distances[current_vertex] == float('inf'):
            break

        for neighbor in graph[current_vertex]:
            distance = distances[current_vertex] + graph[current_vertex][neighbor]

            if distance < distances[neighbor]:
                distances[neighbor] = distance
                previous_vertices[neighbor] = current_vertex

    return distances, previous_vertices

encoded   25 tokens in    0.199 seconds, speed:  125.459 t/s
decoded  238 tokens in    4.270 seconds, speed:   55.736 t/s

n_draft   = 16
n_predict = 238
n_drafted = 365
n_accept  = 214
accept    = 58.630%

draft:

llama_print_timings:        load time =    1195.22 ms
llama_print_timings:      sample time =      49.97 ms /   366 runs   (    0.14 ms per token,  7323.95 tokens per second)
llama_print_timings: prompt eval time =      89.48 ms /    25 tokens (    3.58 ms per token,   279.41 tokens per second)
llama_print_timings:        eval time =    2766.19 ms /   389 runs   (    7.11 ms per token,   140.63 tokens per second)
llama_print_timings:       total time =    4469.67 ms

target:

llama_print_timings:        load time =    6555.17 ms
llama_print_timings:      sample time =      28.34 ms /   238 runs   (    0.12 ms per token,  8397.14 tokens per second)
llama_print_timings: prompt eval time =    1361.73 ms /   412 tokens (    3.31 ms per token,   302.56 tokens per second)
llama_print_timings:        eval time =      51.10 ms /     1 runs   (   51.10 ms per token,    19.57 tokens per second)
llama_print_timings:       total time =    5682.48 ms

---

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-q4_0.gguf -p "// Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 // Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:

// https://en.wikipedia.org/wiki/Graham_scan

#include <iostream>
#include <vector>
#include <algorithm>
#include <cmath>
using namespace std;

struct Point {
    double x, y;
};

double cross(const Point &o, const Point &a, const Point &b) {
    return (a.x - o.x) * (b.y - o.y) - (a.y - o.y) * (b.x - o.x);
}

bool cmp(const Point &a, const Point &b) {
    if (a.x != b.x) return a.x < b.x;
    return a.y < b.y;
}

vector<Point> convex_hull(vector<Point> points) {
    int n = points.size(), k = 0;
    vector<Point> hull(2 * n);
    sort(points.begin(), points.end(), cmp);
    for (int i = 0; i < n; hull[k++] = points[i++]) {
        while (k >= 2 && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    for (int i = n - 2, t = k + 1; i >= 0; hull[k++] = points[i--]) {
        while (k >= t && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    return vector<Point>(hull.begin(), hull.begin() + k - (k > 1));
}

int main() {
    int n; cin >> n;
    vector<Point> points(n);
    for (int i = 0; i < n; ++i) {
        cin >> points[i].x >> points[i].y;
    }
    auto hull = convex_hull(points);
    cout << "The convex hull has " << hull.size() << " points\n";
}

encoded   29 tokens in    0.200 seconds, speed:  144.745 t/s
decoded  510 tokens in    7.867 seconds, speed:   64.824 t/s

n_draft   = 16
n_predict = 510
n_drafted = 642
n_accept  = 467
accept    = 72.741%

draft:

llama_print_timings:        load time =    1163.29 ms
llama_print_timings:      sample time =      87.45 ms /   644 runs   (    0.14 ms per token,  7363.96 tokens per second)
llama_print_timings: prompt eval time =      89.61 ms /    29 tokens (    3.09 ms per token,   323.64 tokens per second)
llama_print_timings:        eval time =    5088.07 ms /   685 runs   (    7.43 ms per token,   134.63 tokens per second)
llama_print_timings:       total time =    8067.89 ms

target:

llama_print_timings:        load time =    6548.10 ms
llama_print_timings:      sample time =      60.25 ms /   510 runs   (    0.12 ms per token,  8464.17 tokens per second)
llama_print_timings: prompt eval time =    2405.81 ms /   711 tokens (    3.38 ms per token,   295.53 tokens per second)
llama_print_timings:        eval time =     103.98 ms /     2 runs   (   51.99 ms per token,    19.23 tokens per second)
llama_print_timings:       total time =    9248.23 ms

---

LLAMA_CUBLAS=1 make -j && ./speculative -m ./models/codellama-34b/ggml-model-f16.gguf -md ./models/codellama-7b/ggml-model-f16.gguf -p "// Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:\n\n" -e -ngl 999 -ngld 999 -t 4 -n 512 -c 4096 -s 21 --draft 16 -np 1 --temp 0.0

 // Below is an implementation in C++ of an algorithm that finds the convex hull of a set of 2D points:

// https://en.wikipedia.org/wiki/Graham_scan

#include <iostream>
#include <vector>
#include <algorithm>
#include <cmath>
using namespace std;

struct Point {
    double x, y;
};

double cross(const Point &o, const Point &a, const Point &b) {
    return (a.x - o.x) * (b.y - o.y) - (a.y - o.y) * (b.x - o.x);
}

bool cmp(const Point &a, const Point &b) {
    if (a.x != b.x) return a.x < b.x;
    return a.y < b.y;
}

vector<Point> convex_hull(vector<Point> points) {
    int n = points.size(), k = 0;
    vector<Point> hull(2 * n);
    sort(points.begin(), points.end(), cmp);
    for (int i = 0; i < n; hull[k++] = points[i++]) {
        while (k >= 2 && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    for (int i = n - 2, t = k + 1; i >= 0; hull[k++] = points[i--]) {
        while (k >= t && cross(hull[k - 2], hull[k - 1], points[i]) <= 0) k--;
    }
    return vector<Point>(hull.begin(), hull.begin() + k - (k > 0));
}

int main() {
    int n; cin >> n;
    vector<Point> points(n);
    for (int i = 0; i < n; ++i) {
        cin >> points[i].x >> points[i].y;
    }
    auto hull = convex_hull(points);
    cout << "The convex hull has " << hull.size() << " points\n";
}

encoded   29 tokens in    0.132 seconds, speed:  219.056 t/s
decoded  510 tokens in   13.127 seconds, speed:   38.850 t/s

n_draft   = 16
n_predict = 510
n_drafted = 703
n_accept  = 465
accept    = 66.145%

draft:

llama_print_timings:        load time =    1860.53 ms
llama_print_timings:      sample time =      95.55 ms /   704 runs   (    0.14 ms per token,  7367.72 tokens per second)
llama_print_timings: prompt eval time =      19.90 ms /    29 tokens (    0.69 ms per token,  1457.43 tokens per second)
llama_print_timings:        eval time =   10197.04 ms /   748 runs   (   13.63 ms per token,    73.35 tokens per second)
llama_print_timings:       total time =   13260.31 ms

target:

llama_print_timings:        load time =    6642.24 ms
llama_print_timings:      sample time =      60.96 ms /   510 runs   (    0.12 ms per token,  8365.87 tokens per second)
llama_print_timings: prompt eval time =    2588.94 ms /   775 tokens (    3.34 ms per token,   299.35 tokens per second)
llama_print_timings:        eval time =      50.84 ms /     1 runs   (   50.84 ms per token,    19.67 tokens per second)
llama_print_timings:       total time =   15140.22 ms

@LiuXiaoxuanPKU Let us know if you attempt more A100 experiments and make sure to use the latest version of llama.cpp to get the best performance. Hope the numbers match better with the expectation now.

calvintwr · 2023-10-31T12:33:56Z

Hi @ggerganov I have some more data for you. I tried to speed up llama-2 70b with either 13b or 7b. In both cases, to no avail:

llama-13b-chat as draft model:

./speculative -m ../../../llama-2-70b-chat.Q6_K.gguf --threads 1 --n-gpu-layers 999 -md ../../../llama-2-13b-chat.Q8_0.gguf --n-gpu-layers-draft 999 -n 500 --prompt "<s>[INST]\nTell me about Joe Biden. [/INST] " --draft 8 --ctx-size 4096

 <s>[INST]\nTell me about Joe Biden. [/INST]  Joe Biden is the 46th President of the United States. He was born on November 20, 1942, in Scranton, Pennsylvania. He served as Vice President under Barack Obama from 2009 to 2017 and was elected President in 2020.

Biden earned a bachelor's degree from the University of Delaware and a law degree from Syracuse University. Before entering politics, he worked as a lawyer and served on the Senate staff. In 1970, he was elected to the New Castle County Council, and in 1972, he was elected to the United States Senate, where he served for six terms until 2009.

During his time in the Senate, Biden focused on issues related to criminal justice, foreign policy, and the rights of people with disabilities. He also served as chair of the Senate Foreign Relations Committee and was a strong advocate for the Violence Against Women Act.

In 2008, Biden was chosen by Barack Obama as his running mate in the presidential election. They won the election and served two terms together, during which time Biden focused on issues related to foreign policy and national security.

After leaving office, Biden established the Biden Foundation, a nonprofit organization focused on issues related to education, LGBTQ rights, and the prevention of sexual assault. He also began teaching at the University of Pennsylvania and authored several books.

In 2019, Biden announced his candidacy for the 2020 presidential election. He ran as a moderate Democrat, focusing on issues related to healthcare, education, and the economy. He won the nomination and went on to defeat incumbent President Donald Trump in the general election.

Biden's presidency has been marked by several significant accomplishments, including the passage of the American Rescue Plan, a $1.9 trillion stimulus package aimed at addressing the COVID-19 pandemic and economic downturn. He has also taken executive action to address climate change, expand access to healthcare, and protect the rights of LGBTQ individuals.

Biden has also pursued a number of foreign policy initiatives, including the

encoded   20 tokens in    0.607 seconds, speed:   32.959 t/s
decoded  503 tokens in   33.435 seconds, speed:   15.044 t/s

n_draft   = 8
n_predict = 503
n_drafted = 471
n_accept  = 399
accept    = 84.713%

draft:

llama_print_timings:        load time =    5963.01 ms
llama_print_timings:      sample time =    1645.45 ms /     1 runs   ( 1645.45 ms per token,     0.61 tokens per second)
llama_print_timings: prompt eval time =      67.36 ms /    20 tokens (    3.37 ms per token,   296.92 tokens per second)
llama_print_timings:        eval time =    9664.61 ms /   575 runs   (   16.81 ms per token,    59.50 tokens per second)
llama_print_timings:       total time =   34042.43 ms

target:

llama_print_timings:        load time =   16015.05 ms
llama_print_timings:      sample time =     275.97 ms /   503 runs   (    0.55 ms per token,  1822.64 tokens per second)
llama_print_timings: prompt eval time =   21001.56 ms /   578 tokens (   36.33 ms per token,    27.52 tokens per second)
llama_print_timings:        eval time =    1092.52 ms /    16 runs   (   68.28 ms per token,    14.65 tokens per second)

llama 7b chat as draft model:

./speculative -m ../../../llama-2-70b-chat.Q6_K.gguf --threads 1 --n-gpu-layers 999 -md ../../../llama-2-7b-chat.Q8_0.gguf --n-gpu-layers-draft 999 -n 500 --prompt "<s>[INST]\nTell me about Joe Biden. [/INST] " --draft 8 --ctx-size 4096

 <s>[INST]\nTell me about Joe Biden. [/INST]  Joe Biden is the 46th President of the United States. He was born on November 20, 1942, in Scranton, Pennsylvania. Biden served as Vice President under Barack Obama from 2009 to 2017 and represented Delaware in the United States Senate from 1973 to 2009. He is a member of the Democratic Party.

Biden graduated from the University of Delaware and Syracuse University College of Law. Before entering politics, he worked as a lawyer and served on the Senate staff. In 1972, Biden was elected to the New Castle County Council, and in 1970, he ran for the United States Senate, but he lost to incumbent Senator J. Caleb Boggs.

Biden was first elected to the Senate in 1972, at the age of 29, making him the youngest person to be elected to the Senate at the time. He served in the Senate for six terms, becoming one of the longest-serving Senators in American history. During his time in the Senate, Biden focused on issues related to criminal justice, foreign policy, and the rights of people with disabilities.

In 2008, Biden was chosen by Barack Obama as his running mate in the presidential election. They won the election, and Biden served as Vice President from 2009 to 2017. As Vice President, Biden focused on issues related to foreign policy, national security, and the economy.

In 2015, Biden announced that he would not run for President in the 2016 election, but he remained a prominent figure in the Democratic Party. In 2019, he announced his candidacy for the 2020 presidential election, and he won the nomination at the 2020 Democratic National Convention. Biden went on to defeat incumbent President Donald Trump in the general election, becoming the oldest person to be elected President of the United States at the age of 78.

Biden's presidency has focused on issues such as COVID-19 pandemic response, economic recovery, and addressing climate change. He has also taken steps to reform the immigration system,

encoded   20 tokens in    0.595 seconds, speed:   33.617 t/s
decoded  503 tokens in   30.448 seconds, speed:   16.520 t/s

n_draft   = 8
n_predict = 503
n_drafted = 426
n_accept  = 384
accept    = 90.141%

draft:

llama_print_timings:        load time =    3155.88 ms
llama_print_timings:      sample time =    1575.41 ms /     1 runs   ( 1575.41 ms per token,     0.63 tokens per second)
llama_print_timings: prompt eval time =      36.19 ms /    20 tokens (    1.81 ms per token,   552.72 tokens per second)
llama_print_timings:        eval time =    6042.37 ms /   545 runs   (   11.09 ms per token,    90.20 tokens per second)
llama_print_timings:       total time =   31043.23 ms

target:

llama_print_timings:        load time =   15269.15 ms
llama_print_timings:      sample time =     266.29 ms /   503 runs   (    0.53 ms per token,  1888.89 tokens per second)
llama_print_timings: prompt eval time =   21196.78 ms /   540 tokens (   39.25 ms per token,    25.48 tokens per second)
llama_print_timings:        eval time =    1635.22 ms /    24 runs   (   68.13 ms per token,    14.68 tokens per second)
llama_print_timings:       total time =   34226.27 ms

This is running the latest llama cpp as of now.

Also tried to adjust --draft from 4 all the way to 32, but still not able to get a speed up.

Also, the accept computations looks wrong. For example, we see here that accept rate for 7b model is 90%, whereaby 384 tokens are accepted. This would mean that given that llama-7b generates at 90 tokens/s, having 90% of its tokens accept would meant that the speed should be significantly close to 90tps. But we see that the resultant is only 16.5 tps.

ggerganov · 2023-10-31T13:45:32Z

Here is how to read the numbers:

n_predict = 503 is the total number of generated tokens.
Having n_accept = 384 means that 503-384 = 119 tokens were generated on the target model at speed of 14.68 t/s - i.e. regular single-batch sampling on the big model.

n_drafted = 426 are the total number of tokens generated on the draft model at speed 90.20 t/s
All of those drafted tokens were additionally evaluated with the target model at speed of about ~25.48 t/s (this number depends on the batch size so it is an approximation). So the average speed for these n_drafted tokens is: ~19.86 t/s (this takes into account the time to draft them and the time to evaluate them as batches on the target model).

Therefore we have:

119 tokens at speed 14.68 t/s
426 tokens at speed ~19.86 t/s

So this should give expected total speed for n_predict = 503 tokens of about ~17.02 t/s.
We measure 16.52 t/s. The difference might be explained by the variable batch size speed of the target model

You don't see a significant speedup likely because the drafts are evaluated in very small batches. Maybe try to reduce the acceptance threshold:

llama.cpp/examples/speculative/speculative.cpp

Lines 42 to 43 in 207b519

    
           // TODO: make this configurable 
        
           const float p_accept = 0.80f;

It's currently hardcoded, so you will have to edit the code and recompile. Maybe try something like 0.2 - 0.3. This would make the draft batches larger in size and combined with the high acceptance rate that you have, it might help to increase the batch utilization of the target model and give you better performance.

Btw, thanks for looking into this. Definitely let us know you observations. I'm interested in this technique, and probably the current example can be improved in many ways.

ggerganov · 2023-10-31T14:23:26Z

Could you also post the result from this command:

LLAMA_CUBLAS=1 make -j && ./batched-bench ../../../llama-2-70b-chat.Q6_K.gguf 4096 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32,64

calvintwr · 2023-10-31T14:36:10Z

Thanks for explaining the numbers to me!

Could you also post the result from this command:

LLAMA_CUBLAS=1 make -j && ./batched-bench ../../../llama-2-70b-chat.Q6_K.gguf 4096 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32,64

@ggerganov
what does make -j do? I don't use make because it gives me `nvcc fatal : Value 'native' is not defined for option 'gpu-architecture' error. This is on GCP.

I use cmake. Can you give me the equivalent? And I will run it.

KerfuffleV2 · 2023-10-31T15:10:34Z

I use cmake. Can you give me the equivalent? And I will run it.

It's not really important, it just enables parallel builds (so it makes compiling faster, the results are exactly the same).

The cmake build step is something like cmake --build build --config Release -j8 (for both make and cmake -j by itself just automatically chooses how many jobs to use. You can also specify it like in my example: -j8 - 8 jobs)

calvintwr · 2023-10-31T15:54:38Z

@ggerganov @KerfuffleV2 Here's the result of the batched-bench:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	0.654	782.53	8.804	14.54	9.458	67.66
512	128	2	768	0.643	796.25	22.753	11.25	23.396	32.83
512	128	3	896	0.643	796.43	23.024	16.68	23.667	37.86
512	128	4	1024	0.643	796.19	23.220	22.05	23.863	42.91
512	128	5	1152	0.643	796.49	28.660	22.33	29.303	39.31
512	128	6	1280	0.643	796.28	28.843	26.63	29.486	43.41
512	128	7	1408	0.643	796.40	29.105	30.78	29.748	47.33
512	128	8	1536	0.643	796.40	29.271	34.98	29.914	51.35
512	128	16	2560	0.643	795.71	47.656	42.97	48.300	53.00

llama_print_timings: load time = 16006.36 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 238681.05 ms / 11152 tokens ( 21.40 ms per token, 46.72 tokens per second)
llama_print_timings: eval time = 8803.86 ms / 128 runs ( 68.78 ms per token, 14.54 tokens per second)
llama_print_timings: total time = 263141.52 ms

ggerganov · 2023-10-31T17:22:49Z

Hm, yeah - the batched decoding performance for Q6_K seems terrible (the S_TG column).
I guess we have still some work to do to improve this. If you try with an F16 target model, the gains will be much higher since the batch decoding speed with F16 scales much better with the batch size (the draft model can be quantum).

calvintwr · 2023-10-31T23:45:16Z

@ggerganov
Got it. I would like to see if I can help. This is an amazing project. If you would be so kind to point me in the right direction in terms of source code.

ggerganov · 2023-11-02T09:11:49Z

I think we will invest more efforts in improving the quantum batched decoding performance after we finish some improvements to the GPU interface that are currently being developed. The solution will likely require implementing custom kernels and ops, so it will need some deeper understanding of the CUDA implementation and will probably involve some significant changes.

As I mentioned, you can test speculative sampling using an F16 target model with any quantum model and see how this performs. This would at least give confidence that the implemented strategy is working as it should (which is the main problem discussed in this issue) and later when the quantum batched decoding is improved, similar gains would be expected.

BarfingLemurs · 2023-11-08T19:16:10Z

Hi @ggerganov

I did some tests with the speculative example, and some quantizations appear to be fine when running 72% of the model in vram.

In the case of running this setup in the speculative example with a 70B Q3_K_S I get 1.3x speedup on all chat formats, offloading 57 layers to a 3090, with top-k 1 and all layers of the 1.5T tinyllama base Q4_K_M model. Where my 5.4 t/s goes up to 7.1 t/s.

This is the same speedup factor I am getting on pure cpu speculative sampling with the model. (1.3x, where 1.5 t/s goes to 2 t/s)

It's a general speedup, and shouldn't be limited to coding examples, it works for instruct/chat formats.

I also tried exllamav2's speculative sampling examples and sampling parameters. These give a speedup of mostly 1.5x on the chat responses. When running the Quicksort code example, I get 2-3x.
(EDIT: the total may be mixed with prompt processing)

I also played with the 7B fp16 medusa model via their commandline interface, with default settings. This gave a consistent speedup of mostly 2x on the chat responses, compared to the original transformers.

github-actions · 2024-04-04T01:08:06Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov mentioned this issue Oct 17, 2023

speculative : add tree-based sampling example #3624

Merged

ggerganov mentioned this issue Oct 22, 2023

perf : study batched decoding bottleneck #3726

Closed

ggerganov mentioned this issue Nov 7, 2023

Possible Speculative Decoding Bug #3919

Closed

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 4, 2024

bong-furiosa mentioned this issue Jul 16, 2024

Bug: Weird output from llama-speculative #8499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding is slower than expected on A100 #3649

Speculative Decoding is slower than expected on A100 #3649

LiuXiaoxuanPKU commented Oct 17, 2023

ggerganov commented Oct 17, 2023 •

edited

Loading

ggerganov commented Oct 17, 2023

LiuXiaoxuanPKU commented Oct 17, 2023

LiuXiaoxuanPKU commented Oct 19, 2023

ggerganov commented Oct 19, 2023 •

edited

Loading

ggerganov commented Oct 24, 2023

calvintwr commented Oct 31, 2023 •

edited

Loading

ggerganov commented Oct 31, 2023 •

edited

Loading

ggerganov commented Oct 31, 2023

calvintwr commented Oct 31, 2023 •

edited

Loading

KerfuffleV2 commented Oct 31, 2023 •

edited

Loading

calvintwr commented Oct 31, 2023

ggerganov commented Oct 31, 2023

calvintwr commented Oct 31, 2023

ggerganov commented Nov 2, 2023

BarfingLemurs commented Nov 8, 2023 •

edited

Loading

github-actions bot commented Apr 4, 2024

Speculative Decoding is slower than expected on A100 #3649

Speculative Decoding is slower than expected on A100 #3649

Comments

LiuXiaoxuanPKU commented Oct 17, 2023

ggerganov commented Oct 17, 2023 • edited Loading

ggerganov commented Oct 17, 2023

LiuXiaoxuanPKU commented Oct 17, 2023

LiuXiaoxuanPKU commented Oct 19, 2023

ggerganov commented Oct 19, 2023 • edited Loading

ggerganov commented Oct 24, 2023

calvintwr commented Oct 31, 2023 • edited Loading

llama-13b-chat as draft model:

llama 7b chat as draft model:

ggerganov commented Oct 31, 2023 • edited Loading

ggerganov commented Oct 31, 2023

calvintwr commented Oct 31, 2023 • edited Loading

KerfuffleV2 commented Oct 31, 2023 • edited Loading

calvintwr commented Oct 31, 2023

ggerganov commented Oct 31, 2023

calvintwr commented Oct 31, 2023

ggerganov commented Nov 2, 2023

BarfingLemurs commented Nov 8, 2023 • edited Loading

github-actions bot commented Apr 4, 2024

ggerganov commented Oct 17, 2023 •

edited

Loading

ggerganov commented Oct 19, 2023 •

edited

Loading

calvintwr commented Oct 31, 2023 •

edited

Loading

ggerganov commented Oct 31, 2023 •

edited

Loading

calvintwr commented Oct 31, 2023 •

edited

Loading

KerfuffleV2 commented Oct 31, 2023 •

edited

Loading

BarfingLemurs commented Nov 8, 2023 •

edited

Loading