Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heuristic algorithm for speculative #3006

Merged
merged 6 commits into from
Sep 14, 2023

Conversation

leng-yue
Copy link
Contributor

@leng-yue leng-yue commented Sep 4, 2023

Based on Hugging Face's assisted generation blog, we've implemented a simple heuristic to determine the number of draft tokens. Specifically, if all draft tokens are accepted, we increase n_draft_token by 2; otherwise, we decrease it by 1. Check out some examples using the original 3 samples from issue #2926.

target model: Code Llama 34B Q4_K_M
draft model: Code Llama 7B Q4_K_M
device: 2x3090, NVLINK, cublas

# example 0
./speculative \
-m models/codellama-34b.Q4_K_M.gguf \
-md models/codellama-7b.Q4_K_M.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -ngl 1000 -t 4 -n 256 -c 4096 -s 8 --top_k 1 --draft 16

## With heuristic
n_draft   = 53
n_predict = 266
n_drafted = 245
n_accept  = 231
accept    = 94.286%

decoded  266 tokens in   11.990 seconds, speed:   22.185 t/s

## Without heuristic
n_draft   = 16
n_predict = 266
n_drafted = 254
n_accept  = 223
accept    = 87.795%

decoded  266 tokens in   13.234 seconds, speed:   20.099 t/s
# example 1
./speculative \
-m models/codellama-34b.Q4_K_M.gguf \
-md models/codellama-7b.Q4_K_M.gguf \
-p "// Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage:\n\n" \
-e -ngl 1000 -t 4 -n 4096 -c 4096 -s 20 --top_k 1 --draft 16

## With heuristic
n_draft   = 75
n_predict = 1546
n_drafted = 2261
n_accept  = 1263
accept    = 55.860%
decoded 1546 tokens in  109.811 seconds, speed:   14.079 t/s

## Without heuristic
n_draft   = 16
n_predict = 2071
n_drafted = 2400
n_accept  = 1749
accept    = 72.875%

decoded 2071 tokens in  126.508 seconds, speed:   16.370 t/s
# example 2
./speculative \
-m models/codellama-34b.Q4_K_M.gguf \
-md models/codellama-7b.Q4_K_M.gguf \
-p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" \
-e -ngl 1000 -t 4 -n 512 -c 4096 -s 20 --top_k 1 --draft 16

## With heuristic
n_draft   = 38
n_predict = 219
n_drafted = 203
n_accept  = 193
accept    = 95.074%

decoded  219 tokens in    9.449 seconds, speed:   23.177 t/s

## Without heuristic
n_draft   = 16
n_predict = 219
n_drafted = 229
n_accept  = 187
accept    = 81.659%

decoded  219 tokens in   11.285 seconds, speed:   19.406 t/s

In Example 2, you'll see that varying n_draft produces different output tokens. This suggests a potential issue, which you can find detailed here:

llama_eval(ctx_tgt, drafted.data(), drafted.size(), n_past_tgt, params.n_threads);

To my mind, the optimal approach would be to batch the drafts in groups (e.g., a batch size of 16) and feed them all at once to the target model, rather than concatenating single draft with the original prompt and feeding it as a sequence.

} else {
n_draft -= 1;
LOG("drafted token rejected, n_draft = %d\n", n_draft);
}
Copy link
Contributor

@bobqianic bobqianic Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does n_draft go up by 2 when all drafted tokens are accepted, but decrease by 1 when a drafted token is rejected? Is there a more efficient algorithm to handle this? The current approach seems similar to a simplified version of TCP Friendly Rate Control algorithm.

Copy link
Contributor Author

@leng-yue leng-yue Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty much borrowed from Hugging Face's code. We could fine-tune it by tweaking some parameters. Since getting all tokens right is challenging, it seems reasonable to bump up n_draft by 2 when everything aligns and decrease it by 1 otherwise.

@ggerganov
Copy link
Owner

In Example 2, you'll see that varying n_draft produces different output tokens.

Yes, this is concerning - it should not happen, so probably a bug somewhere.

To my mind, the optimal approach would be to batch the drafts in groups (e.g., a batch size of 16) and feed them all at once to the target model, rather than concatenating single draft with the original prompt and feeding it as a sequence.

I don't understand - currently the entire sequence of drafted tokens is submitted as a single batch to the target model.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't reproduce the issue

Can you send the logs where you observe the different output between speculation and regular sampling?

Comment on lines 162 to 168
if (drafted.size() > 0 && all_accepted) {
n_draft += 2;
LOG("all drafted tokens accepted, n_draft = %d\n", n_draft);
} else {
n_draft -= 1;
LOG("drafted token rejected, n_draft = %d\n", n_draft);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_draft should be constrained to not go below 2 for example and

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what I got after restricting the minimum n_draft to 2:

Outputs
 // Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage:

// 1. Add all nodes to the graph, and add edges between them with distances/weights.
// 2. Call dijkstra(start_node) to get shortest paths from start_node to all other nodes.
//    It returns a map of <node, distance>.
// 3. Use path_exists(node) to check if there is a path between start_node and node.
// 4. Use get_shortest_distance(node) to get the shortest distance from start_node to node.
//    If no path exists, it returns -1.
// 5. Use reconstruct_path(node) to reconstruct the shortest path between start_node and node.
//    It returns a vector of nodes that forms the path (including both start_node and node).
//    If no path exists, it returns an empty vector.

#include <iostream>
#include <vector>
#include <map>
#include <set>
#include <queue>
using namespace std;

class Graph {
public:
        struct Edge {
                int node;
                int distance;
        };

private:
        // adjacency list of the graph (to store all edges)
        vector<vector<Edge>> adj_list;

        // map to keep track of visited nodes during dijkstra()
        map<int, bool> visited;

        // map to keep track of distances from start node to other nodes during dijkstra()
        map<int, int> distance;

public:
        Graph(int n) {
                adj_list.resize(n);
        }

        void add_edge(int src, int dest, int dist) {
                Edge edge = {dest, dist};
                adj_list[src].push_back(edge);
        }

        map<int, int> dijkstra(int start) {
                // initialize all distances as -1 (inf). 0 is the starting node.
                for (int i = 0; i < adj_list.size(); ++i) {
                        distance[i] = -1;
                }
                distance[start] = 0;

                // create a priority queue to get the minimum distance node,
                // visited helps avoid processing same node again
                priority_queue<pair<int, int>, vector<pair<int, int>>, greater<pair<int, int>>> pq;
                pq.push({0, start});

                while (!pq.empty()) {
                        // get the node with minimum distance from priority queue
                        auto top = pq.top();
                        int u = top.second;
                        pq.pop();

                        if (visited[u]) continue; // skip already visited nodes

                        // mark the node as visited
                        visited[u] = true;

                        // process all neighbours of the current node 'u'
                        for (auto edge : adj_list[u]) {
                                int v = edge.node;
                                int dist_v = distance[u] + edge.distance; // distance from 'u' to 'v'

                                // check if there is shorted path to 'v' through 'u'.
                                if (distance[v] == -1 || distance[v] > dist_v) {
                                        // update the distance of 'v' only if it is not in visited,
                                        // or can be reached with shorter distance from 'u'.
                                        distance[v] = dist_v;
                                        pq.push({dist_v, v}); // add 'v' to priority queue
                                }
                        }
                }

                return distance;
        }

        bool path_exists(int dest) {
                // return true if the destination node is in visited map.
                return visited[dest];
        }

        int get_shortest_distance(int dest) {
                // return the shortest distance from start to destination node.
                return distance[dest];
        }

        vector<int> reconstruct_path(int dest) {
                // vector to store the path
                vector<int> path;

                if (!visited[dest]) return path; // no path exists from start node to destination node.

                for (int v = dest; v != 0; v = distance[v]) {
                        path.push_back(v);
                }
                reverse(path.begin(), path.end());

                return path;
        }
};

int main() {
        // create a graph with 9 nodes (0 to 8)
        Graph g(9);

        g.add_edge(0, 1, 4);
        g.add_edge(0, 7, 8);
        g.add_edge(1, 2, 8);
        g.add_edge(1, 7, 11);
        g.add_edge(2, 3, 7);
        g.add_edge(2, 8, 2);
        g.add_edge(2, 5, 4);
        g.add_edge(3, 4, 9);
        g.add_edge(3, 5, 14);
        g.add_edge(4, 5, 10);
        g.add_edge(5, 6, 2);
        g.add_edge(6, 7, 1);
        g.add_edge(6, 8, 6);
        g.add_edge(7, 8, 7);

        // call dijkstra() with start node as 0
        auto distance = g.dijkstra(0);

        for (int i = 1; i < distance.size(); ++i) {
                cout << "Distance from 0 to " << i << ": ";
                if (!g.path_exists(i)) cout << "No path exists.";
                else cout << g.get_shortest_distance(i);
                cout << endl;
        }

        // print the shortest path from 0 to 8
        auto path = g.reconstruct_path(8);
        for (int node : path) {
                cout << node << " ";
        }
        cout << endl;

        return 0;
}

encoded   24 tokens in    0.418 seconds, speed:   57.356 t/s
decoded 1546 tokens in  109.633 seconds, speed:   14.102 t/s

n_draft   = 75
n_predict = 1546
n_drafted = 2261
n_accept  = 1263
accept    = 55.860%

draft:

llama_print_timings:        load time =   725.46 ms
llama_print_timings:      sample time =  3855.13 ms /     1 runs   ( 3855.13 ms per token,     0.26 tokens per second)
llama_print_timings: prompt eval time =   101.84 ms /    24 tokens (    4.24 ms per token,   235.67 tokens per second)
llama_print_timings:        eval time = 47541.13 ms /  2529 runs   (   18.80 ms per token,    53.20 tokens per second)
llama_print_timings:       total time = 110050.98 ms

target:

llama_print_timings:        load time =  2122.55 ms
llama_print_timings:      sample time =   501.15 ms /  1546 runs   (    0.32 ms per token,  3084.92 tokens per second)
llama_print_timings: prompt eval time = 54614.28 ms /  2495 tokens (   21.89 ms per token,    45.68 tokens per second)
llama_print_timings:        eval time =  2831.66 ms /    72 runs   (   39.33 ms per token,    25.43 tokens per second)
llama_print_timings:       total time = 110779.95 ms

Full log: speculative.139912996806656.log
It shows that n_draft never goes under 10 in this case.


As a comparison, this one doesn't include heuristic algorithm:

Outputs
 // Dijkstra algorithm in C++ (4 spaces indentation + detailed comments) + sample usage:

// 1. Add all nodes to the graph, and add edges between them with distances/weights.
// 2. Call dijkstra(start_node) to get shortest paths from start_node to all other nodes.
//    It returns a map of <node, distance>.
// 3. Use path_exists(node) to check if there is a path between start_node and node.
// 4. Use get_shortest_distance(node) to get the shortest distance from start_node to node.
//    If no path exists, it returns -1.
// 5. Use reconstruct_path(node) to get the shortest path from start_node to node.
//    It returns a vector of nodes that make up the path.

#include <iostream>
#include <vector>
#include <map>
#include <set>
#include <queue>
using namespace std;

class Graph {
public:
        struct Edge {
                int node, distance;
        };

        // Adds a directed edge between "from" and "to" with the given "distance".
        void add_edge(int from, int to, int distance) {
                edges[from].push_back({to, distance});
        }

        // Returns true if there is an edge between "from" and "to", false otherwise.
        bool has_edge(int from, int to) const {
                for (const Edge& e : edges.at(from)) {
                        if (e.node == to) return true;
                }
                return false;
        }

        // Returns the distance between "from" and "to". If there is no edge, returns -1.
        int get_distance(int from, int to) const {
                for (const Edge& e : edges.at(from)) {
                        if (e.node == to) return e.distance;
                }
                return -1;
        }

        // Returns a map of <node, distance> representing the shortest paths from "start_node" to all other nodes.
        map<int, int> dijkstra(int start_node) const {
                priority_queue<pair<int, int>, vector<pair<int, int>>, greater<pair<int, int>>> pq; // (distance, node)
                map<int, bool> visited;
                map<int, int> distances; // <node, distance>

                pq.push({0, start_node});
                distances[start_node] = 0;

                while (!pq.empty()) {
                        auto top = pq.top();
                        int node = top.second;
                        int distance = top.first;
                        pq.pop();

                        if (visited[node]) continue; // already visited this node
                        visited[node] = true;

                        // update distances of neighbors
                        for (const Edge& edge : edges.at(node)) {
                                int neighbor_node = edge.node;
                                int new_distance = distance + edge.distance;
                                if (!distances.count(neighbor_node) || distances[neighbor_node] > new_distance) {
                                        pq.push({new_distance, neighbor_node});
                                        distances[neighbor_node] = new_distance;
                                }
                        }
                }

                return distances;
        }

        // Returns true if there is a path between "start_node" and "node", false otherwise.
        bool path_exists(int start_node, int node) const {
                map<int, bool> visited;
                queue<int> q;
                q.push(start_node);

                while (!q.empty()) {
                        int current = q.front();
                        q.pop();

                        if (current == node) return true;
                        visited[current] = true;

                        for (const Edge& edge : edges.at(current)) {
                                int neighbor_node = edge.node;
                                if (!visited[neighbor_node]) q.push(neighbor_node);
                        }
                }

                return false;
        }

        // Returns the shortest distance from "start_node" to "node". If there is no path, returns -1.
        int get_shortest_distance(int start_node, int node) const {
                map<int, bool> visited;
                queue<pair<int, int>> q; // (distance, node)
                q.push({0, start_node});

                while (!q.empty()) {
                        auto top = q.front();
                        int distance = top.first;
                        int current = top.second;
                        q.pop();

                        if (current == node) return distance;
                        visited[current] = true;

                        for (const Edge& edge : edges.at(current)) {
                                int neighbor_node = edge.node;
                                int new_distance = distance + edge.distance;
                                if (!visited[neighbor_node]) q.push({new_distance, neighbor_node});
                        }
                }

                return -1;
        }

        // Returns the shortest path from "start_node" to "node". If there is no path, returns an empty vector.
        vector<int> reconstruct_path(int start_node, int node) const {
                map<int, bool> visited;
                queue<pair<int, int>> q; // (distance, node)
                q.push({0, start_node});

                // parents[i] is the parent of i in the shortest path from start_node to i.
                map<int, int> parents;

                while (!q.empty()) {
                        auto top = q.front();
                        int distance = top.first;
                        int current = top.second;
                        q.pop();

                        if (current == node) break;
                        visited[current] = true;

                        for (const Edge& edge : edges.at(current)) {
                                int neighbor_node = edge.node;
                                int new_distance = distance + edge.distance;
                                if (!visited[neighbor_node]) {
                                        q.push({new_distance, neighbor_node});
                                        parents[neighbor_node] = current;
                                }
                        }
                }

                vector<int> path;
                for (int n = node; n != start_node; n = parents.at(n)) {
                        path.push_back(n);
                }
                path.push_back(start_node);
                reverse(path.begin(), path.end());
                return path;
        }

private:
        // Map of <node, list of edges> representing the graph.
        map<int, vector<Edge>> edges;
};

int main() {
        Graph g;
        g.add_edge(0, 1, 2);
        g.add_edge(0, 3, 4);
        g.add_edge(1, 2, 5);
        g.add_edge(1, 3, 6);
        g.add_edge(2, 3, 7);
        g.add_edge(2, 4, 8);
        g.add_edge(3, 4, 9);

        map<int, int> distances = g.dijkstra(0);
        for (auto [node, distance] : distances) {
                cout << "Distance from 0 to " << node << ": " << distance << endl;
        }

        cout << boolalpha;
        cout << "Path exists between 0 and 4: " << g.path_exists(0, 4) << endl;
        cout << "Shortest distance from 0 to 4: " << g.get_shortest_distance(0, 4) << endl;
        vector<int> path = g.reconstruct_path(0, 4);
        for (int node : path) {
                cout << node << " ";
        }
        cout << endl;
}


encoded   24 tokens in    0.419 seconds, speed:   57.252 t/s
decoded 2071 tokens in  126.799 seconds, speed:   16.333 t/s

n_draft   = 16
n_predict = 2071
n_drafted = 2400
n_accept  = 1749
accept    = 72.875%

draft:

llama_print_timings:        load time =   723.91 ms
llama_print_timings:      sample time =  4027.20 ms /     1 runs   ( 4027.20 ms per token,     0.25 tokens per second)
llama_print_timings: prompt eval time =   101.86 ms /    24 tokens (    4.24 ms per token,   235.61 tokens per second)
llama_print_timings:        eval time = 51436.25 ms /  2638 runs   (   19.50 ms per token,    51.29 tokens per second)
llama_print_timings:       total time = 127217.59 ms

target:

llama_print_timings:        load time =  2103.87 ms
llama_print_timings:      sample time =   687.69 ms /  2071 runs   (    0.33 ms per token,  3011.53 tokens per second)
llama_print_timings: prompt eval time = 67933.80 ms /  2687 tokens (   25.28 ms per token,    39.55 tokens per second)
llama_print_timings:        eval time =  2321.62 ms /    58 runs   (   40.03 ms per token,    24.98 tokens per second)
llama_print_timings:       total time = 127945.07 ms

Full log: speculative.140657253044224.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The models I am using are: codellama-7b.Q4_K_M.gguf and codellama-34b.Q4_K_M.gguf.

Copy link
Owner

@ggerganov ggerganov Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, very strange. After ~140 tokens of identical results, the target model samples a different token with 100% probability:

image

What GPU backend do you use? Is this CUDA?

Edit: the 100% probability is actually expected since we are doing --top_k 1 sampling

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you try instead of using --top_k 1, to use --temp -1 and see if the problem persists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using cuBLAS backend, and I got same output after changing --top_k 1 to --temp -1.
With heuristic: speculative-a.140629255983104.log
Without heuristic: speculative-b.140469514129408.log

@leng-yue
Copy link
Contributor Author

leng-yue commented Sep 4, 2023

I don't understand - currently the entire sequence of drafted tokens is submitted as a single batch to the target model.

Thanks for clarifying! I initially thought the current setup created a sequence using N_PREV + N_DRAFT, rather than running batches of N_PREV, N_PREV + 1, N_PREV + 2, ..., N_PREV + N_DRAFT - 1 in parallel.

@JohannesGaessler
Copy link
Collaborator

I should mention that as of right now the CUDA code is very much not optimized for the use of draft models so it may be better to benchmark the performance using other backends. Also I would intuitively assume that given enough optimization there will be no benefit to using a draft shorter than 8 tokens because that is the minimum length to fully utilize tensor cores (as of right now tensor cores are not used at all).

@ggerganov
Copy link
Owner

ggerganov commented Sep 4, 2023

I did the following experiment:

Run perplexity with the same input, but changing the batch size via the -b parameter.
Here are the results for the first few iterations on different backends:

# Q4_0 7B
# batch sizes: 16, 32, 64, 128, 256, 512

# Metal:

[1]4.3263,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3261,[2]4.8290,[3]5.4475,[4]6.0514,[5]6.1813,[6]6.0808,[7]6.2560,[8]6.3669,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3263,[2]4.8290,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2560,[8]6.3670,[9]6.7256,[10]6.9356
[1]4.3264,[2]4.8291,[3]5.4476,[4]6.0515,[5]6.1814,[6]6.0809,[7]6.2561,[8]6.3670,[9]6.7256,[10]6.9356

# CPU (M2, LLAMA_ACCELERATE=OFF):

[1]4.3233,[2]4.8256,[3]5.4456,[4]6.0456,[5]6.1772,[6]6.0762  # SIMD is off for n_batch = 16 (ggml_vec_dot_f16)
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800
[1]4.3214,[2]4.8286,[3]5.4463,[4]6.0497,[5]6.1802,[6]6.0800

# CPU (M2, LLAMA_ACCELERATE=ON):

[1]4.3233,[2]4.8256,[3]5.4456,[4]6.0456,[5]6.1772
[1]4.3256,[2]4.8287,[3]5.4475,[4]6.0515,[5]6.1813
[1]4.3258,[2]4.8288,[3]5.4475,[4]6.0515,[5]6.1813
[1]4.3253,[2]4.8284,[3]5.4470,[4]6.0511,[5]6.1810
[1]4.3256,[2]4.8286,[3]5.4472,[4]6.0511,[5]6.1810
[1]4.3257,[2]4.8286,[3]5.4471,[4]6.0511,[5]6.1810

# CUDA:

[1]4.3283,[2]4.8268,[3]5.4451,[4]6.0526,[5]6.1871,[6]6.0874,[7]6.2609,[8]6.3685,[9]6.7238
[1]4.3329,[2]4.8348,[3]5.4534,[4]6.0545,[5]6.1855,[6]6.0867,[7]6.2617,[8]6.3744,[9]6.7305
[1]4.3303,[2]4.8109,[3]5.4355,[4]6.0431,[5]6.1755,[6]6.0727,[7]6.2414,[8]6.3526,[9]6.7111
[1]4.3264,[2]4.8292,[3]5.4521,[4]6.0559,[5]6.1865,[6]6.0894,[7]6.2580,[8]6.3652,[9]6.7194
[1]4.3666,[2]4.8513,[3]5.4581,[4]6.0586,[5]6.1911,[6]6.0899,[7]6.2577,[8]6.3674,[9]6.7188
[1]4.3307,[2]4.8364,[3]5.4609,[4]6.0671,[5]6.1965,[6]6.0940,[7]6.2651,[8]6.3749,[9]6.7282

The CUDA results are much more sensitive to the batch size. Any ideas why is that?

I'm also a bit surprised that the CPU results are not identical across batch size.
Where is this variation coming from?

In any case, this explains the effect that @leng-yue observes with CUDA speculative decoding.

@JohannesGaessler
Copy link
Collaborator

I don't know why the results vary depending on batch size. Unless I'm missing something the values for different tokens should not be interacting with each other.

@ggerganov
Copy link
Owner

I guess this op makes the graph not invariant to the batch size, since the accumulated values in the dot products depends on the n_batch

struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max);

Is this the only source of the variation and can we improve it somehow?

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Sep 4, 2023

Wait, now that I think about it the difference in perplexity may not come from a difference in logits after all. In the perplexity binary the tokens that are used for the calculation depend on batch size (because for the first few tokens there is no context to infer the tokens from). So to conclude that the batch size changes results we need to first control for that.

The tokens used depend on context size, not batch size.

@ggerganov
Copy link
Owner

CPU results are actually invariant to the batch size. I just forgot to disable LLAMA_ACCELERATE which makes ggml fallback to CBLAS for different sets of ops depending on the batch size. Building with LLAMA_ACCELERATE=OFF results in identical perplexity.

The question now is why the GPU results are not invariant.
Will open a separate issue later to continue the discussion there.

@leng-yue
Copy link
Contributor Author

leng-yue commented Sep 4, 2023

After changing to the CPU backend and disabling LLAMA_ACCELERATE (it likely doesn't work for me because I'm using Linux, not MacOS), I still encountered varying results.
With heuristic: speculative-c.140400117237568.log
Without heuristic: speculative-d.140252638918464.log

@ggerganov
Copy link
Owner

After changing to the CPU backend

If you limit n_draft to not go above 30, can you confirm the CPU results are identical with and without heuristic?

@leng-yue
Copy link
Contributor Author

leng-yue commented Sep 5, 2023

It's still non-identical. speculative-e.140190271293248.log

n_draft = std::max(2, n_draft - 1);
}
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leng-yue

I've refactored the implementation to be more contained.
Also, we were rewarding the draft even when it hasn't sampled all n_draft tokens, which does not seem correct.

For example, let's say n_draft is 16 currently, but the draft samples just 3 tokens because the "low-probability" check has been triggered. Even if all 3 tokens were accepted, we should not reward the draft model, because this is just a small part of what we asked it to do.

Regarding the reproducibility - we will study this more in #3014
My guess is that this behavior would occur even without the heuristic - probably it's just less likely to happen for some reason when the heuristic is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my earlier implementation, rewards were given only when all tokens were accepted. So if only 3 out of 16 tokens are accepted, the n_tokens value would be decreased by 1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. This check does not guarantee you have n_draft tokens accepted:

if (i_dft == (int) drafted.size()) {
all_accepted = true;
}

The reason is because drafted.size() <= n_draft due to another heuristic of not drafting more tokens if the drafter becomes "unsure":

// too low probability, stop drafting
if (cur_p.data[0].p < 2*cur_p.data[1].p) {
break;
}

So in the majority of cases, when all_accepted == true, you would have accepted less than n_draft tokens. That's why your n_draft would increase so much even beyond 70 in some cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@JohannesGaessler
Copy link
Collaborator

After changing to the CPU backend and disabling LLAMA_ACCELERATE (it likely doesn't work for me because I'm using Linux, not MacOS), I still encountered varying results.

Should you be expecting the exact same results in the first place? If I understand the paper correctly they only guarantee equivalent results within hardware numerics. This is a direct quote from the paper:

Even with greedy sampling, a single token deviating due to numerics could result in two sequences
diverging wildly. Since pseudo-random seeds are processed differently between ArS and SpS, and
because the different computation graphs lead to different numerics, we cannot not expect identical
outputs. However, we expect the samples to come from the same distribution within numerics and
we empirically verify this by evaluating these benchmarks.

@leng-yue
Copy link
Contributor Author

leng-yue commented Sep 5, 2023

The non-identical issue is discussing in #3014, I think it's not an issue of heuristic algorithm.

@ggerganov
Copy link
Owner

Yes, if our results are different for different batch size, we cannot expect 100% reproducible when using speculative decoding with greedy sampling, because the target model processes the sequence in different batch sizes based on what the draft model provides.

The only source of numerical differences for different batch sizes that I currently see (and I'm still not 100% sure about it) is the KQV op because we are accumulating n_batch number of floats in the dot products, while in all other ops the number of elements being accumulated does not depend on n_batch. So I expect small variations due to this and even less variations with F32 cache. However, the GPU results seem a bit more variable than the expected numerical variation, so I think we should look more into it and understand where this is coming from.

Let's move the discussion to #3014

@JohannesGaessler
Copy link
Collaborator

If I understand the previous discussion correctly the issue is that when limiting the draft length the results change. My point is that this is to be expected independently from any potential bugs in the code that change results depending on batch size. @charliexchen can you weigh in on this?

@charliexchen
Copy link

charliexchen commented Sep 5, 2023

Assuming magically precise and deterministic hardware, Spec Sampling should be identical for greedy target models (this is true even if the drafter is stochastic!). However, if the target model is stochastic then it will deterministically give a different result to normal sampling since it processes the pseudorandom numbers differently.

In practice, 8-16bit numerics are annoying and things will likely diverge after a few tokens. If you have any numeric deltas between batches, that will show up for both spec sampling and normal sampling.

@leng-yue
Copy link
Contributor Author

Should we merge this PR then?

@ggerganov ggerganov merged commit 35f7304 into ggerganov:master Sep 14, 2023
@leng-yue leng-yue deleted the add-speculative-heuristic branch September 15, 2023 04:46
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
* Add heuristic algo for speculative

* Constrain minimum n_draft to 2

* speculative : improve heuristic impl

* speculative : be more rewarding upon guessing max drafted tokens

* speculative : fix typos

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants