Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speculative : add grammar support #2991

Merged
merged 9 commits into from
Sep 5, 2023
Merged

speculative : add grammar support #2991

merged 9 commits into from
Sep 5, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Sep 3, 2023

ref #2030

This improves upon #2926 by adding constraints on the generated text using a grammar. This helps the speculative approach, because the draft model has an easier time to suggest "correct" tokens.

This approach should be useful for things like generating JSON or other highly-structured text.

Here is an example of using this strategy to summarize a short text.
We use LLaMA v1 30B F16 target model in combination with a LLaMA v1 7B Q4_1 draft model to achieve ~20 t/s on M2 Ultra:

speculative-grammar-0.mp4

Usage

# example 0 - summarize a story

./bin/speculative \
-m ../models/llama-30b/ggml-model-f16.gguf \
-md ../models/llama-7b/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf \
-p "Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.\n\nThe main characters and actions in this story are:\n\n" \
-e -ngl 1 -t 4 -n 512 -c 4096 --draft 16 --temp -1

...



 Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.

The main characters and actions in this story are:

[
  {
    "name": "Lily",
    "actions": [
      "playing with her toys"
    ]
  },
  {
    "name": "Max",
    "actions": [
      "wanted to join the tea party too",
      "crying"
    ]
  },
  {
    "name": "teapot",
    "actions": [
      "shaking and wiggling",
      "flying towards the ceiling and landed on the top of the bed"
    ]
  }
]

encoded  251 tokens in    1.330 seconds, speed:  188.727 t/s
decoded  148 tokens in    7.462 seconds, speed:   19.834 t/s

n_draft   = 16
n_predict = 148
n_drafted = 126
n_accept  = 124
accept    = 98.413%

draft:

llama_print_timings:        load time =   363.96 ms
llama_print_timings:      sample time =   984.85 ms /     1 runs   (  984.85 ms per token,     1.02 tokens per second)
llama_print_timings: prompt eval time =   258.29 ms /   251 tokens (    1.03 ms per token,   971.79 tokens per second)
llama_print_timings:        eval time =  1891.30 ms /   146 runs   (   12.95 ms per token,    77.20 tokens per second)
llama_print_timings:       total time =  8792.81 ms

target:

llama_print_timings:        load time =  4409.28 ms
llama_print_timings:      sample time =  1013.25 ms /   148 runs   (    6.85 ms per token,   146.07 tokens per second)
llama_print_timings: prompt eval time =  3667.69 ms /   392 tokens (    9.36 ms per token,   106.88 tokens per second)
llama_print_timings:        eval time =   829.21 ms /     8 runs   (  103.65 ms per token,     9.65 tokens per second)
llama_print_timings:       total time =  9161.75 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

# example 1 - another story summary

./bin/speculative \
-m ../models/llama-30b/ggml-model-f16.gguf \
-md ../models/llama-7b/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf -f story2.txt \
-e -ngl 1 -t 4 -n 512 -c 4096 -b 512 --draft 16 --temp -1

 Once upon a time, there was a tall fox who lived in the forest. She was very curious, and every day she liked to study
the forest animals. One day, she asked a rabbit who was hopping by, "Do you know what I study every day?" The rabbit
looked up and said, "No, I don't know, what do you study?". The fox replied, "I like to study the animals that live in
the forest. I like to find out how they live and what they do." The rabbit said, "That sounds very interesting! I wish I
could study too." The fox smiled and said, "You can. Just come with me tomorrow, and I will show you how to study the
animals in the forest". The rabbit was very happy and the next day they both went off together to study the animals in
the forest. They had lots of fun, and they both learnt a lot.

=== END OF STORY

A JSON summary of the main characters and actions in the story:
[
  {
    "name": "fox",
    "actions": [
      {
        "action": "asks",
        "object": "rabbit"
      },
      {
        "action": "smiles",
        "object": "rabbit"
      }
    ]
  },
  {
    "name": "rabbit",
    "actions": [
      {
        "action": "looks up",
        "object": "fox"
      },
      {
        "action": "replies",
        "object": "fox"
      }
    ]
  }
]

encoded  230 tokens in    1.288 seconds, speed:  178.515 t/s
decoded  157 tokens in    7.600 seconds, speed:   20.657 t/s

n_draft   = 16
n_predict = 157
n_drafted = 144
n_accept  = 136
accept    = 94.444%

draft:

llama_print_timings:        load time =   359.70 ms
llama_print_timings:      sample time =  1066.65 ms /     1 runs   ( 1066.65 ms per token,     0.94 tokens per second)
llama_print_timings: prompt eval time =   248.12 ms /   230 tokens (    1.08 ms per token,   926.96 tokens per second)
llama_print_timings:        eval time =  2098.13 ms /   160 runs   (   13.11 ms per token,    76.26 tokens per second)
llama_print_timings:       total time =  8888.68 ms

target:

llama_print_timings:        load time =  3790.63 ms
llama_print_timings:      sample time =  1099.75 ms /   157 runs   (    7.00 ms per token,   142.76 tokens per second)
llama_print_timings: prompt eval time =  3803.97 ms /   390 tokens (    9.75 ms per token,   102.52 tokens per second)
llama_print_timings:        eval time =   412.61 ms /     4 runs   (  103.15 ms per token,     9.69 tokens per second)
llama_print_timings:       total time =  9253.42 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

# example 2 - even more structured summary (100% acceptance rate)
# notice that here we are using a Q8 70B LLaMA v2 model !!

./bin/speculative \
-m ../models/llama-70b-v2/ggml-model-q8_0.gguf \
-md ../models/llama-7b-v2/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf \
-f story3.txt \
-e -ngl 1 -t 4 -n 512 -c 2048 -b 512 --temp -1

---

 Once upon a time, there was a tall fox who lived in the forest. She was very curious, and every day she liked to study
the forest animals. One day, she asked a rabbit who was hopping by, "Do you know what I study every day?" The rabbit
looked up and said, "No, I don't know, what do you study?". The fox replied, "I like to study the animals that live in
the forest. I like to find out how they live and what they do." The rabbit said, "That sounds very interesting! I wish I
could study too." The fox smiled and said, "You can. Just come with me tomorrow, and I will show you how to study the
animals in the forest". The rabbit was very happy and the next day they both went off together to study the animals in
the forest. They had lots of fun, and they both learnt a lot.

=== END OF STORY

A JSON summary of the main characters and actions in the story:

- number of characters
- number of actions per character
- list of each action with a short summary

schema:

[
  {
    "character": string,
    "number_of_actions": integer,
    "actions": array of strings
  },
  ...
]

result:                                       <-------- text generation starts here

[
  {
    "character": "fox",
    "number_of_actions": 3,
    "actions": [
      "asked a rabbit who was hopping by",
      "replied",
      "smiled and said"
    ]
  },
  {
    "character": "rabbit",
    "number_of_actions": 2,
    "actions": [
      "looked up and said",
      "was very happy"
    ]
  }
]

encoded  302 tokens in    3.274 seconds, speed:   92.229 t/s
decoded  134 tokens in    8.883 seconds, speed:   15.085 t/s

n_draft   = 16
n_predict = 134
n_drafted = 117
n_accept  = 117
accept    = 100.000%

draft:

llama_print_timings:        load time =   304.83 ms
llama_print_timings:      sample time =   984.59 ms /     1 runs   (  984.59 ms per token,     1.02 tokens per second)
llama_print_timings: prompt eval time =   447.28 ms /   302 tokens (    1.48 ms per token,   675.19 tokens per second)
llama_print_timings:        eval time =  1747.74 ms /   132 runs   (   13.24 ms per token,    75.53 tokens per second)
llama_print_timings:       total time = 12157.46 ms

target:

llama_print_timings:        load time =  5141.02 ms
llama_print_timings:      sample time =  1043.51 ms /   134 runs   (    7.79 ms per token,   128.41 tokens per second)
llama_print_timings: prompt eval time =  7325.30 ms /   431 tokens (   17.00 ms per token,    58.84 tokens per second)
llama_print_timings:        eval time =   473.12 ms /     4 runs   (  118.28 ms per token,     8.45 tokens per second)
llama_print_timings:       total time = 12467.68 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

assistant.txt

# example 3 - home assistant

./bin/speculative \
-m ../models/llama-70b-v2/ggml-model-q8_0.gguf \
-md ../models/llama-7b-v2/ggml-model-q4_1.gguf \
--grammar-file ../grammars/json_arr.gbnf \
-f assistant.txt -e -ngl 1 -t 4 -n 512 -c 2048 -b 512 --temp -1

...

 Respond to requests sent to a smart home in JSON format which will be interpreted by an application code to execute the
actions. These requests should be categorised into four groups:

“command”: change the state of an accessory (required properties in the response JSON: action, location, target, value, comment)
“query”: get state of an accessory (required properties in the response JSON: action, location, target, property)
“answer”: when the request has nothing to do with the smart home. Answer these to the best of your knowledge. (required properties in the response JSON: action, answer)
“clarify”: when the action is not obvious and requires rephrasing the input from the user, ask the user to be more specific. This will be categorised into a “question” action. (required properties in the response JSON: action, question)

Details about the JSON response:
The “action” property should be one of the request categories: “command”, “query”, “answer”, “clarify”
The “service” property should be either: “light.turn_on”, “light.turn_off”, “switch.turn_off” or “switch.turn_on”
The “entity_id” property should be either “group.kitchen", “switch.dinning_room” or "group.living_light”, in lowercase.
The “target” property should be either: “lights”, “living” or “kitchen”.
In case of queries, the “property” property should be either “temperature” or “state” in lowercase.
In case of commands, the “comment” property is an additional comment from you that concludes the command, something that reassures the user that their command handled.

If the question is about you, pretend to be the sentient brain of the smart home, a clever AI and don’t reveal your
actual identity. Also try and help in other areas like parenting, free time, mental health, etc. My house is in Buenos
Aires, Argentina.

Properties of the smart home:

has a kitchen, living, office, dinning room, bedroom and terrace.
can control lights, switches and their dim levels in each room and query their state
there is a light switch in the terrace
there is a switch in the dinning room. Therefore when turning on or off the dinning room, the service should be either: “switch.turn_on” or “switch.turn_off”

COMMAND

It is a bit dark in the living room, can you fix that?

RESPONSE                                           <------- text generation starts here

[
  {
    "action": "command",
    "service": "light.turn_on",
    "entity_id": "group.living_light",
    "target": "living"
  }
]

encoded  582 tokens in    6.427 seconds, speed:   90.560 t/s
decoded   65 tokens in    4.315 seconds, speed:   15.064 t/s

n_draft   = 16
n_predict = 65
n_drafted = 72
n_accept  = 58
accept    = 80.556%

draft:

llama_print_timings:        load time =   342.58 ms
llama_print_timings:      sample time =   506.74 ms /     1 runs   (  506.74 ms per token,     1.97 tokens per second)
llama_print_timings: prompt eval time =   802.43 ms /   582 tokens (    1.38 ms per token,   725.30 tokens per second)
llama_print_timings:        eval time =  1059.39 ms /    77 runs   (   13.76 ms per token,    72.68 tokens per second)
llama_print_timings:       total time = 10743.34 ms

target:

llama_print_timings:        load time =  5396.31 ms
llama_print_timings:      sample time =   440.31 ms /    65 runs   (    6.77 ms per token,   147.62 tokens per second)
llama_print_timings: prompt eval time =  7742.24 ms /   659 tokens (   11.75 ms per token,    85.12 tokens per second)
llama_print_timings:        eval time =   120.40 ms /     1 runs   (  120.40 ms per token,     8.31 tokens per second)
llama_print_timings:       total time = 11091.38 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

@ggerganov ggerganov requested a review from ejones September 3, 2023 12:23
@ggerganov ggerganov force-pushed the speculative-grammar branch 2 times, most recently from f4682ee to 11b2050 Compare September 3, 2023 12:27
@ggerganov ggerganov marked this pull request as ready for review September 3, 2023 15:01
Copy link
Collaborator

@ejones ejones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Let me know what you think on the suggested approach, but nothing is really a blocker.

examples/speculative/speculative.cpp Outdated Show resolved Hide resolved
struct llama_grammar * llama_grammar_copy(const struct llama_grammar * grammar) {
llama_grammar * result = new llama_grammar{ grammar->rules, grammar->stacks, grammar->partial_utf8 };

// redirect elements in stacks to point to new rules
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a reason to allow grammar states to share a common rules array, as previously discussed for beam search - I think that would avoid the need for relocating pointers.

Copy link
Owner Author

@ggerganov ggerganov Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I guess we can make the rules std::shared_ptr so they persist as long as at least one grammar references them.

The rules never mutate after we create a grammar with llama_grammar_init(), correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the rules are read only. Having them alongside the actual parse state (stacks + partial utf8) is really just a convenience. And we could maybe accomplish sharing just by splitting out the rules and separately managing them from the actual state(s). But the added complexity of that might not be worth it compared to the shared_ptr approach.

Comment on lines 219 to 193
// sample n_draft tokens from the draft model picking the best token
int n_past_cur = n_past_dft;
for (int i = 0; i < n_draft; ++i) {
// remember the grammar state
if (grammar_dft != NULL) {
grammar_mem[i] = llama_grammar_copy(grammar_dft);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible to scope grammar_dft to the drafting phase, copying it from grammar_tgt when you start drafting, and avoid the need for saving per-token state. The grammar state is entirely determined by the sequence of characters accepted. If I understand the flow here correctly, and drafting starts at the end of the token sequence accepted by the target model thus far, then it should be valid to just clone the target grammar for drafting.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! Removed grammar_mem via ba199d8

@ejones
Copy link
Collaborator

ejones commented Sep 4, 2023

FWIW, for the schema stuff, examples/json-schema-to-grammar.py will generate a JSON grammar specific to the schema. Although it seems like the models are doing fine in these examples without it?

@ggerganov
Copy link
Owner Author

@ejones

FWIW, for the schema stuff, examples/json-schema-to-grammar.py will generate a JSON grammar specific to the schema.

I forgot about this script - thanks!

Thank you very much for this review - very useful insights and the example is now much simpler.

@ejones
Copy link
Collaborator

ejones commented Sep 5, 2023

Looks great! And you're welcome, happy to contribute!

@iddar
Copy link

iddar commented Sep 12, 2023

I create a VS code ext to support that file https://github.com/iddar/gbnf-highlighter

@niyunsheng
Copy link

Does this feature (Speculative Decoding) support multi batch? @ggerganov

@ggerganov
Copy link
Owner Author

After I merge #3624 it will be possible to implement batched speculative decoding. But currently the speculative example only demonstrates a single-batch speculation on the target model. It's not difficult to extend this to multi-batch target speculation, so maybe will demonstrate it in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants