-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heuristic algorithm for speculative #3006
Add heuristic algorithm for speculative #3006
Conversation
examples/speculative/speculative.cpp
Outdated
} else { | ||
n_draft -= 1; | ||
LOG("drafted token rejected, n_draft = %d\n", n_draft); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does n_draft
go up by 2
when all drafted tokens are accepted, but decrease by 1
when a drafted token is rejected? Is there a more efficient algorithm to handle this? The current approach seems similar to a simplified version of TCP Friendly Rate Control algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty much borrowed from Hugging Face's code. We could fine-tune it by tweaking some parameters. Since getting all tokens right is challenging, it seems reasonable to bump up n_draft
by 2 when everything aligns and decrease it by 1 otherwise.
Yes, this is concerning - it should not happen, so probably a bug somewhere.
I don't understand - currently the entire sequence of drafted tokens is submitted as a single batch to the target model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't reproduce the issue
Can you send the logs where you observe the different output between speculation and regular sampling?
examples/speculative/speculative.cpp
Outdated
if (drafted.size() > 0 && all_accepted) { | ||
n_draft += 2; | ||
LOG("all drafted tokens accepted, n_draft = %d\n", n_draft); | ||
} else { | ||
n_draft -= 1; | ||
LOG("drafted token rejected, n_draft = %d\n", n_draft); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_draft
should be constrained to not go below 2 for example and
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I got after restricting the minimum n_draft
to 2:
Outputs
// Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage:
// 1. Add all nodes to the graph, and add edges between them with distances/weights.
// 2. Call dijkstra(start_node) to get shortest paths from start_node to all other nodes.
// It returns a map of <node, distance>.
// 3. Use path_exists(node) to check if there is a path between start_node and node.
// 4. Use get_shortest_distance(node) to get the shortest distance from start_node to node.
// If no path exists, it returns -1.
// 5. Use reconstruct_path(node) to reconstruct the shortest path between start_node and node.
// It returns a vector of nodes that forms the path (including both start_node and node).
// If no path exists, it returns an empty vector.
#include <iostream>
#include <vector>
#include <map>
#include <set>
#include <queue>
using namespace std;
class Graph {
public:
struct Edge {
int node;
int distance;
};
private:
// adjacency list of the graph (to store all edges)
vector<vector<Edge>> adj_list;
// map to keep track of visited nodes during dijkstra()
map<int, bool> visited;
// map to keep track of distances from start node to other nodes during dijkstra()
map<int, int> distance;
public:
Graph(int n) {
adj_list.resize(n);
}
void add_edge(int src, int dest, int dist) {
Edge edge = {dest, dist};
adj_list[src].push_back(edge);
}
map<int, int> dijkstra(int start) {
// initialize all distances as -1 (inf). 0 is the starting node.
for (int i = 0; i < adj_list.size(); ++i) {
distance[i] = -1;
}
distance[start] = 0;
// create a priority queue to get the minimum distance node,
// visited helps avoid processing same node again
priority_queue<pair<int, int>, vector<pair<int, int>>, greater<pair<int, int>>> pq;
pq.push({0, start});
while (!pq.empty()) {
// get the node with minimum distance from priority queue
auto top = pq.top();
int u = top.second;
pq.pop();
if (visited[u]) continue; // skip already visited nodes
// mark the node as visited
visited[u] = true;
// process all neighbours of the current node 'u'
for (auto edge : adj_list[u]) {
int v = edge.node;
int dist_v = distance[u] + edge.distance; // distance from 'u' to 'v'
// check if there is shorted path to 'v' through 'u'.
if (distance[v] == -1 || distance[v] > dist_v) {
// update the distance of 'v' only if it is not in visited,
// or can be reached with shorter distance from 'u'.
distance[v] = dist_v;
pq.push({dist_v, v}); // add 'v' to priority queue
}
}
}
return distance;
}
bool path_exists(int dest) {
// return true if the destination node is in visited map.
return visited[dest];
}
int get_shortest_distance(int dest) {
// return the shortest distance from start to destination node.
return distance[dest];
}
vector<int> reconstruct_path(int dest) {
// vector to store the path
vector<int> path;
if (!visited[dest]) return path; // no path exists from start node to destination node.
for (int v = dest; v != 0; v = distance[v]) {
path.push_back(v);
}
reverse(path.begin(), path.end());
return path;
}
};
int main() {
// create a graph with 9 nodes (0 to 8)
Graph g(9);
g.add_edge(0, 1, 4);
g.add_edge(0, 7, 8);
g.add_edge(1, 2, 8);
g.add_edge(1, 7, 11);
g.add_edge(2, 3, 7);
g.add_edge(2, 8, 2);
g.add_edge(2, 5, 4);
g.add_edge(3, 4, 9);
g.add_edge(3, 5, 14);
g.add_edge(4, 5, 10);
g.add_edge(5, 6, 2);
g.add_edge(6, 7, 1);
g.add_edge(6, 8, 6);
g.add_edge(7, 8, 7);
// call dijkstra() with start node as 0
auto distance = g.dijkstra(0);
for (int i = 1; i < distance.size(); ++i) {
cout << "Distance from 0 to " << i << ": ";
if (!g.path_exists(i)) cout << "No path exists.";
else cout << g.get_shortest_distance(i);
cout << endl;
}
// print the shortest path from 0 to 8
auto path = g.reconstruct_path(8);
for (int node : path) {
cout << node << " ";
}
cout << endl;
return 0;
}
encoded 24 tokens in 0.418 seconds, speed: 57.356 t/s
decoded 1546 tokens in 109.633 seconds, speed: 14.102 t/s
n_draft = 75
n_predict = 1546
n_drafted = 2261
n_accept = 1263
accept = 55.860%
draft:
llama_print_timings: load time = 725.46 ms
llama_print_timings: sample time = 3855.13 ms / 1 runs ( 3855.13 ms per token, 0.26 tokens per second)
llama_print_timings: prompt eval time = 101.84 ms / 24 tokens ( 4.24 ms per token, 235.67 tokens per second)
llama_print_timings: eval time = 47541.13 ms / 2529 runs ( 18.80 ms per token, 53.20 tokens per second)
llama_print_timings: total time = 110050.98 ms
target:
llama_print_timings: load time = 2122.55 ms
llama_print_timings: sample time = 501.15 ms / 1546 runs ( 0.32 ms per token, 3084.92 tokens per second)
llama_print_timings: prompt eval time = 54614.28 ms / 2495 tokens ( 21.89 ms per token, 45.68 tokens per second)
llama_print_timings: eval time = 2831.66 ms / 72 runs ( 39.33 ms per token, 25.43 tokens per second)
llama_print_timings: total time = 110779.95 ms
Full log: speculative.139912996806656.log
It shows that n_draft
never goes under 10 in this case.
As a comparison, this one doesn't include heuristic algorithm:
Outputs
// Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage:
// 1. Add all nodes to the graph, and add edges between them with distances/weights.
// 2. Call dijkstra(start_node) to get shortest paths from start_node to all other nodes.
// It returns a map of <node, distance>.
// 3. Use path_exists(node) to check if there is a path between start_node and node.
// 4. Use get_shortest_distance(node) to get the shortest distance from start_node to node.
// If no path exists, it returns -1.
// 5. Use reconstruct_path(node) to get the shortest path from start_node to node.
// It returns a vector of nodes that make up the path.
#include <iostream>
#include <vector>
#include <map>
#include <set>
#include <queue>
using namespace std;
class Graph {
public:
struct Edge {
int node, distance;
};
// Adds a directed edge between "from" and "to" with the given "distance".
void add_edge(int from, int to, int distance) {
edges[from].push_back({to, distance});
}
// Returns true if there is an edge between "from" and "to", false otherwise.
bool has_edge(int from, int to) const {
for (const Edge& e : edges.at(from)) {
if (e.node == to) return true;
}
return false;
}
// Returns the distance between "from" and "to". If there is no edge, returns -1.
int get_distance(int from, int to) const {
for (const Edge& e : edges.at(from)) {
if (e.node == to) return e.distance;
}
return -1;
}
// Returns a map of <node, distance> representing the shortest paths from "start_node" to all other nodes.
map<int, int> dijkstra(int start_node) const {
priority_queue<pair<int, int>, vector<pair<int, int>>, greater<pair<int, int>>> pq; // (distance, node)
map<int, bool> visited;
map<int, int> distances; // <node, distance>
pq.push({0, start_node});
distances[start_node] = 0;
while (!pq.empty()) {
auto top = pq.top();
int node = top.second;
int distance = top.first;
pq.pop();
if (visited[node]) continue; // already visited this node
visited[node] = true;
// update distances of neighbors
for (const Edge& edge : edges.at(node)) {
int neighbor_node = edge.node;
int new_distance = distance + edge.distance;
if (!distances.count(neighbor_node) || distances[neighbor_node] > new_distance) {
pq.push({new_distance, neighbor_node});
distances[neighbor_node] = new_distance;
}
}
}
return distances;
}
// Returns true if there is a path between "start_node" and "node", false otherwise.
bool path_exists(int start_node, int node) const {
map<int, bool> visited;
queue<int> q;
q.push(start_node);
while (!q.empty()) {
int current = q.front();
q.pop();
if (current == node) return true;
visited[current] = true;
for (const Edge& edge : edges.at(current)) {
int neighbor_node = edge.node;
if (!visited[neighbor_node]) q.push(neighbor_node);
}
}
return false;
}
// Returns the shortest distance from "start_node" to "node". If there is no path, returns -1.
int get_shortest_distance(int start_node, int node) const {
map<int, bool> visited;
queue<pair<int, int>> q; // (distance, node)
q.push({0, start_node});
while (!q.empty()) {
auto top = q.front();
int distance = top.first;
int current = top.second;
q.pop();
if (current == node) return distance;
visited[current] = true;
for (const Edge& edge : edges.at(current)) {
int neighbor_node = edge.node;
int new_distance = distance + edge.distance;
if (!visited[neighbor_node]) q.push({new_distance, neighbor_node});
}
}
return -1;
}
// Returns the shortest path from "start_node" to "node". If there is no path, returns an empty vector.
vector<int> reconstruct_path(int start_node, int node) const {
map<int, bool> visited;
queue<pair<int, int>> q; // (distance, node)
q.push({0, start_node});
// parents[i] is the parent of i in the shortest path from start_node to i.
map<int, int> parents;
while (!q.empty()) {
auto top = q.front();
int distance = top.first;
int current = top.second;
q.pop();
if (current == node) break;
visited[current] = true;
for (const Edge& edge : edges.at(current)) {
int neighbor_node = edge.node;
int new_distance = distance + edge.distance;
if (!visited[neighbor_node]) {
q.push({new_distance, neighbor_node});
parents[neighbor_node] = current;
}
}
}
vector<int> path;
for (int n = node; n != start_node; n = parents.at(n)) {
path.push_back(n);
}
path.push_back(start_node);
reverse(path.begin(), path.end());
return path;
}
private:
// Map of <node, list of edges> representing the graph.
map<int, vector<Edge>> edges;
};
int main() {
Graph g;
g.add_edge(0, 1, 2);
g.add_edge(0, 3, 4);
g.add_edge(1, 2, 5);
g.add_edge(1, 3, 6);
g.add_edge(2, 3, 7);
g.add_edge(2, 4, 8);
g.add_edge(3, 4, 9);
map<int, int> distances = g.dijkstra(0);
for (auto [node, distance] : distances) {
cout << "Distance from 0 to " << node << ": " << distance << endl;
}
cout << boolalpha;
cout << "Path exists between 0 and 4: " << g.path_exists(0, 4) << endl;
cout << "Shortest distance from 0 to 4: " << g.get_shortest_distance(0, 4) << endl;
vector<int> path = g.reconstruct_path(0, 4);
for (int node : path) {
cout << node << " ";
}
cout << endl;
}
encoded 24 tokens in 0.419 seconds, speed: 57.252 t/s
decoded 2071 tokens in 126.799 seconds, speed: 16.333 t/s
n_draft = 16
n_predict = 2071
n_drafted = 2400
n_accept = 1749
accept = 72.875%
draft:
llama_print_timings: load time = 723.91 ms
llama_print_timings: sample time = 4027.20 ms / 1 runs ( 4027.20 ms per token, 0.25 tokens per second)
llama_print_timings: prompt eval time = 101.86 ms / 24 tokens ( 4.24 ms per token, 235.61 tokens per second)
llama_print_timings: eval time = 51436.25 ms / 2638 runs ( 19.50 ms per token, 51.29 tokens per second)
llama_print_timings: total time = 127217.59 ms
target:
llama_print_timings: load time = 2103.87 ms
llama_print_timings: sample time = 687.69 ms / 2071 runs ( 0.33 ms per token, 3011.53 tokens per second)
llama_print_timings: prompt eval time = 67933.80 ms / 2687 tokens ( 25.28 ms per token, 39.55 tokens per second)
llama_print_timings: eval time = 2321.62 ms / 58 runs ( 40.03 ms per token, 24.98 tokens per second)
llama_print_timings: total time = 127945.07 ms
Full log: speculative.140657253044224.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The models I am using are: codellama-7b.Q4_K_M.gguf and codellama-34b.Q4_K_M.gguf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you try instead of using --top_k 1
, to use --temp -1
and see if the problem persists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using cuBLAS backend, and I got same output after changing --top_k 1
to --temp -1
.
With heuristic: speculative-a.140629255983104.log
Without heuristic: speculative-b.140469514129408.log
Thanks for clarifying! I initially thought the current setup created a sequence using |
I should mention that as of right now the CUDA code is very much not optimized for the use of draft models so it may be better to benchmark the performance using other backends. Also I would intuitively assume that given enough optimization there will be no benefit to using a draft shorter than 8 tokens because that is the minimum length to fully utilize tensor cores (as of right now tensor cores are not used at all). |
I did the following experiment: Run # Q4_0 7B
# batch sizes: 16, 32, 64, 128, 256, 512
# Metal:
[1]4.3263,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3261,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3669,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8290,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3264,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356
# CPU (M2, LLAMA_ACCELERATE=OFF):
[1]4.3233,[2]4.8256,[3]5.4456,[4]6.0456,[5]6.1772,[6]6.0762 # SIMD is off for n_batch = 16 (ggml_vec_dot_f16)
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
# CPU (M2, LLAMA_ACCELERATE=ON):
[1]4.3233,[2]4.8256,[3]5.4456,[4]6.0456,[5]6.1772
[1]4.3256,[2]4.8287,[3]5.4475,[4]6.0515,[5]6.1813
[1]4.3258,[2]4.8288,[3]5.4475,[4]6.0515,[5]6.1813
[1]4.3253,[2]4.8284,[3]5.4470,[4]6.0511,[5]6.1810
[1]4.3256,[2]4.8286,[3]5.4472,[4]6.0511,[5]6.1810
[1]4.3257,[2]4.8286,[3]5.4471,[4]6.0511,[5]6.1810
# CUDA:
[1]4.3283,[2]4.8268,[3]5.4451,[4]6.0526,[5]6.1871,[6]6.0874,[7]6.2609,[8]6.3685,[9]6.7238
[1]4.3329,[2]4.8348,[3]5.4534,[4]6.0545,[5]6.1855,[6]6.0867,[7]6.2617,[8]6.3744,[9]6.7305
[1]4.3303,[2]4.8109,[3]5.4355,[4]6.0431,[5]6.1755,[6]6.0727,[7]6.2414,[8]6.3526,[9]6.7111
[1]4.3264,[2]4.8292,[3]5.4521,[4]6.0559,[5]6.1865,[6]6.0894,[7]6.2580,[8]6.3652,[9]6.7194
[1]4.3666,[2]4.8513,[3]5.4581,[4]6.0586,[5]6.1911,[6]6.0899,[7]6.2577,[8]6.3674,[9]6.7188
[1]4.3307,[2]4.8364,[3]5.4609,[4]6.0671,[5]6.1965,[6]6.0940,[7]6.2651,[8]6.3749,[9]6.7282 The CUDA results are much more sensitive to the batch size. Any ideas why is that? I'm also a bit surprised that the CPU results are not identical across batch size. In any case, this explains the effect that @leng-yue observes with CUDA speculative decoding. |
I don't know why the results vary depending on batch size. Unless I'm missing something the values for different tokens should not be interacting with each other. |
I guess this op makes the graph not invariant to the batch size, since the accumulated values in the dot products depends on the Line 2430 in bd33e5a
Is this the only source of the variation and can we improve it somehow? |
The tokens used depend on context size, not batch size. |
CPU results are actually invariant to the batch size. I just forgot to disable The question now is why the GPU results are not invariant. |
After changing to the CPU backend and disabling LLAMA_ACCELERATE (it likely doesn't work for me because I'm using Linux, not MacOS), I still encountered varying results. |
If you limit |
It's still non-identical. speculative-e.140190271293248.log |
n_draft = std::max(2, n_draft - 1); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored the implementation to be more contained.
Also, we were rewarding the draft even when it hasn't sampled all n_draft
tokens, which does not seem correct.
For example, let's say n_draft
is 16 currently, but the draft samples just 3 tokens because the "low-probability" check has been triggered. Even if all 3 tokens were accepted, we should not reward the draft model, because this is just a small part of what we asked it to do.
Regarding the reproducibility - we will study this more in #3014
My guess is that this behavior would occur even without the heuristic - probably it's just less likely to happen for some reason when the heuristic is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my earlier implementation, rewards were given only when all tokens were accepted. So if only 3 out of 16 tokens are accepted, the n_tokens value would be decreased by 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. This check does not guarantee you have n_draft
tokens accepted:
llama.cpp/examples/speculative/speculative.cpp
Lines 146 to 148 in 98230ef
if (i_dft == (int) drafted.size()) { | |
all_accepted = true; | |
} |
The reason is because drafted.size() <= n_draft
due to another heuristic of not drafting more tokens if the drafter becomes "unsure":
llama.cpp/examples/speculative/speculative.cpp
Lines 192 to 196 in 98230ef
// too low probability, stop drafting | |
if (cur_p.data[0].p < 2*cur_p.data[1].p) { | |
break; | |
} |
So in the majority of cases, when all_accepted == true
, you would have accepted less than n_draft
tokens. That's why your n_draft
would increase so much even beyond 70 in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Should you be expecting the exact same results in the first place? If I understand the paper correctly they only guarantee equivalent results within hardware numerics. This is a direct quote from the paper:
|
The non-identical issue is discussing in #3014, I think it's not an issue of heuristic algorithm. |
Yes, if our results are different for different batch size, we cannot expect 100% reproducible when using speculative decoding with greedy sampling, because the target model processes the sequence in different batch sizes based on what the draft model provides. The only source of numerical differences for different batch sizes that I currently see (and I'm still not 100% sure about it) is the Let's move the discussion to #3014 |
If I understand the previous discussion correctly the issue is that when limiting the draft length the results change. My point is that this is to be expected independently from any potential bugs in the code that change results depending on batch size. @charliexchen can you weigh in on this? |
Assuming magically precise and deterministic hardware, Spec Sampling should be identical for greedy target models (this is true even if the drafter is stochastic!). However, if the target model is stochastic then it will deterministically give a different result to normal sampling since it processes the pseudorandom numbers differently. In practice, 8-16bit numerics are annoying and things will likely diverge after a few tokens. If you have any numeric deltas between batches, that will show up for both spec sampling and normal sampling. |
Should we merge this PR then? |
* Add heuristic algo for speculative * Constrain minimum n_draft to 2 * speculative : improve heuristic impl * speculative : be more rewarding upon guessing max drafted tokens * speculative : fix typos --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Based on Hugging Face's assisted generation blog, we've implemented a simple heuristic to determine the number of draft tokens. Specifically, if all draft tokens are accepted, we increase n_draft_token by 2; otherwise, we decrease it by 1. Check out some examples using the original 3 samples from issue #2926.
target model: Code Llama 34B Q4_K_M
draft model: Code Llama 7B Q4_K_M
device: 2x3090, NVLINK, cublas
In Example 2, you'll see that varying n_draft produces different output tokens. This suggests a potential issue, which you can find detailed here:
llama.cpp/examples/speculative/speculative.cpp
Line 196 in e4386f4
To my mind, the optimal approach would be to batch the drafts in groups (e.g., a batch size of 16) and feed them all at once to the target model, rather than concatenating single draft with the original prompt and feeding it as a sequence.