Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force response format and bias responses by regex #1397

Conversation

AutonomicPerfectionist
Copy link
Contributor

@AutonomicPerfectionist AutonomicPerfectionist commented May 10, 2023

This PR implements the idea from https://github.com/r2d4/rellm, allowing users to set a regex that the model must adhere to in its response. I also added the ability to bias completions matching a regex instead to make responses more or less likely.

Combined, these two options improve accuracy for Langchain or AutoGPT style prompts that require machine-readable responses in a particular format. I tested with Vicuna 1.1 q4_0 on a variation of Langchain's knowledge graph triple extraction prompt:

You are a networked intelligence helping a human track knowledge triples about all relevant people, things, concepts, etc. and integrating them with your knowledge stored within your weights as well as that stored in a knowledge graph. Extract all of the knowledge triples from the last line of conversation. A knowledge triple is a clause that contains a subject, a predicate, and an object. The subject is the entity being described, the predicate is the property of the subject that is being described, and the object is the value of the property.

EXAMPLE
Conversation history:
Person #1: Did you hear aliens landed in Area 51?
AI: No, I didn't hear that. What do you know about Area 51?
Person #1: It's a secret military base in Nevada.
AI: What do you know about Nevada?
Last line of conversation:
Person #1: It's a state in the US. It's also the number 1 producer of gold in the US.

Output: (Nevada, is a, state)<|>(Nevada, is in, US)<|>(Nevada, is the number 1 producer of, gold)
END OF EXAMPLE

EXAMPLE
Conversation history:
Person #1: Hello.
AI: Hi! How are you?
Person #1: I'm good. How are you?
AI: I'm good too.
Last line of conversation:
Person #1: I'm going to the store.

Output: NONE
END OF EXAMPLE

EXAMPLE
Conversation history:
Person #1: What do you know about Descartes?
AI: Descartes was a French philosopher, mathematician, and scientist who lived in the 17th century.
Person #1: The Descartes I'm referring to is a standup comedian and interior designer from Montreal.
AI: Oh yes, He is a comedian and an interior designer. He has been in the industry for 30 years. His favorite food is baked bean pie.
Last line of conversation:
Person #1: Oh huh. I know Descartes likes to drive antique scooters and play the mandolin.
Output: (Descartes, likes to drive, antique scooters)<|>(Descartes, plays, mandolin)
END OF EXAMPLE

Conversation history (for reference only):
Person #1: I have a cat named Sunny

Last line of conversation (for extraction):
Human: He is orange and soft
Output:

Executing with the following command line (no regex biasing):

./main -m models/ggml-vicuna-7b-1.1-q4_0.bin --threads 6 --color -c 2048 --temp 0.7 --repeat_penalty 1.0 -n -1 -f prompts/extract-knowledge-triples.txt

Outputs decent results some of the time, but usually the results are mangled with extra lines (fixable, obviously), extra elements ((Sunny, is, an orange, cat)), or are nonsensical ((Sunny, is, a cat named)). Manipulating the temperature and other parameters usually ends up making the issues worse or making the model so conservative it only ever outputs NONE, at least for me. I'm not that experienced with model tuning so maybe I'm missing something. Regardless, the following command that enables regex biasing ensures properly-formatted output every time:

./main -m models/ggml-vicuna-7b-1.1-q4_0.bin --threads 6 --color -c 2048 --temp 0.2 --repeat_penalty 1.0 -n -1 -f prompts/extract-knowledge-triples.txt --allowed-response-regex "(?:(?:\([a-z A-Z 0-9]*, [a-z A-Z 0-9]*, [a-z A-Z 0-9]*\))(?:<\|>\([a-z A-Z 0-9]*, [a-z A-Z 0-9]*, [a-z A-Z 0-9]*\))*)|NONE" --response-bias-regex "NONE" --response-bias-value -4.5

This command forces the model to generate according to the pattern (<subject>, <predicate>, <object>) or NONE if no knowledge graph triple makes sense, with NONE being biased with a value of -4.5. The temperature needed to be lowered, otherwise, nonsense was generated, but the results were astonishingly good. Using the same prompt from above I got the following triples every time: (Sunny, is, a cat)<|>(Sunny, is, orange)<|>(Sunny, is, soft). Modifying the prompt so the last few lines read:

Conversation history (for reference only):
Person #1: I have a cat named Sunny.
AI: I like cats! What color is he?
Person #1: He's orange.
AI: Orange cats are very nice looking. Do you have any other cats?
Person #1: I have another cat named Alvin.
AI: What color is he?

Last line of conversation (for extraction):
Human: He is blonde.
Output:

Resulted in correct triples of (Sunny, is, orange)<|>(Alvin, is, blonde). Without the regex biasing, no combination of parameters generated those triples; usually, the model would keep repeating (Sunny, is an, orange cat) at low temperatures, and at higher, it would spit out NONE. Setting repeat penalty to anything other than 1.0 resulted in it straying from the format.

This was mainly an experiment of mine, but I figured I'd make a PR in case it was interesting. Unfortunately, I needed partial regex matching, which is not in the C++ standard library, so I had to add a compile option to use Boost. That alone probably removes its viability for this project based on the design philosophies I've seen so far, so I understand if it's rejected.

The design can probably be streamlined, since the "allowed response regex" is really just another form of biasing. If this interests people, I would probably make the regex bias options repeatable and remove the allowed response regex, so that it behaves like the --logit-bias option.

I have not tested this, but I think regex biasing would also allow users to remove As an AI language model style responses without needing an uncensored model.

EDIT:
Drastically lowering top_k to 5 to 10-ish and further restricting my regex (I forgot the start-of-line assertion so it was still allowed to generate parens inside the triplet) allowed me to increase the temperature significantly. Interestingly, this allows the model to sometimes generate additional triplets for information inferred from the conversation, such as (Alvin, is owned by, Person1) or (Person1, has, two cats). At temperature 2.5, it was creating perhaps superfluous triplets, but correct nonetheless: (Orange, is, a color)

@slaren
Copy link
Collaborator

slaren commented May 10, 2023

Nice job, this looks very interesting. Would it be possible to replace the boost dependency with the standard library std::regex?

@AutonomicPerfectionist
Copy link
Contributor Author

It would be, but as a result, there would be a loss of functionality. The standard library does not support the match_partial feature that I am using (at least as far as I can tell), so regexes would need to be written from the perspective of a partial completion instead of a full response. Personally, I couldn't come up with a single regex that would work like my example but without requiring match_partial. Maybe a set of regexes would work though

@ggerganov
Copy link
Owner

This is a very interesting concept that I will be happy to see advanced further.
Boost is definitely a no-go, so alternative implementations have to be considered.

How is the "sample time" affected from these regex matches?

Here is a specific application that I am very interested in: constrain the generation only to a set of valid chess moves on a board given the state of the current board (i.e. the position of the pieces). The chess moves can be generated in short algebraic notation (e.g. Be5) or as normal speech (e.g. "bishop to e five", "knight takes", "castle", etc.). An optimized solution of this task can be integrated in Whisper's speech-to-text transcription to achieve a high-quality voice controlled chess app.

@AutonomicPerfectionist
Copy link
Contributor Author

With some fiddling I did manage to get the C++ standard library regex implementation to show similar results, my regex ended up being gigantic with multiple alternatives for each in progress piece of the response but someone with more time could probably make a better one.

As for performance, I'm not at my desktop right now so I can't get you hard numbers for a bit but from what I remember the milliseconds per run was not significantly impacted

@slaren
Copy link
Collaborator

slaren commented May 11, 2023

It may be better to move this code to llama.cpp as part of the llama_sample_* APIs, something like llama_sample_regex_penalty or whatever you want to call it. Then measure the time it takes and add it to ctx->t_sample_us. Testing 32000 regular expressions may have a significant impact on the sampling time, and currently it is not being measured.

@AutonomicPerfectionist
Copy link
Contributor Author

Honestly was not aware of that API, I will definitely do that as soon as I have time

@AutonomicPerfectionist
Copy link
Contributor Author

If it is a performance issue I think I might have some ideas to help with that too. Most of those tokens will never match the required format, so "banning" them before the first iteration would reduce the number of regex matches needed per iteration significantly. I think I also saw an option somewhere in the regex documentation to optimize for matching performance versus regex creation performance.

@ejones
Copy link
Collaborator

ejones commented May 11, 2023

This is super cool. As maybe a future direction, I've been wondering if things like repetition penalty, logit bias, this, and reverse prompt/stop words can all be generalized as something like an FSM from tokens to biases (reverse prompt could be something like, forcing EOS as a terminal state after the string). Not sure what form that would take.

@AutonomicPerfectionist
Copy link
Contributor Author

AutonomicPerfectionist commented May 11, 2023

This is super cool. As maybe a future direction, I've been wondering if things like repetition penalty, logit bias, this, and reverse prompt/stop words can all be generalized as something like an FSM from tokens to biases (reverse prompt could be something like, forcing EOS as a terminal state after the string). Not sure what form that would take.

I actually had the same idea last night about unifying them under an FSM. Didn't put much thought into it though, that's probably above my experience.

Thinking about it though, an FSM based biasing engine would allow more complicated response biasing than is allowed through even regular expressions. I'm thinking something like code generation would be improved by biasing the next token to only be valid syntax in the target language, which would be basically impossible with the current systems but doable with an FSM

@AutonomicPerfectionist
Copy link
Contributor Author

It may be better to move this code to llama.cpp as part of the llama_sample_* APIs, something like llama_sample_regex_penalty or whatever you want to call it. Then measure the time it takes and add it to ctx->t_sample_us. Testing 32000 regular expressions may have a significant impact on the sampling time, and currently it is not being measured.

I put the regex testing in a llama_sample_regex_bias() function and, as expected, performance was awful. Total sample time was 3.1 ms per run with regex biasing while only 0.2 ms per run without regex biasing. Implementing my earlier performance idea by chopping off the candidates that don't match a simple "allowed tokens" regex ([\(\), a-zA-Z0-9]*) definitely helped, but it's still terrible. With that performance fix, the total sample time was 2.06 ms per run.

@slaren
Copy link
Collaborator

slaren commented May 11, 2023

Total sample time was 3.1 ms per run with regex biasing

That doesn't look too bad actually, it seems very reasonable and won't affect the overall token generation time too much.

@bakkot
Copy link
Contributor

bakkot commented May 11, 2023

A more general form would be for the params object to take a function pointer float penaltyForToken(std::string partial_completion, char* token). Then you could use regex (or multiple regexes), like you have currently, but also other rules for constraining or biasing sampling, like a JSON schema.

@JHawkley
Copy link

My two cents is that providing a mechanism to control the AI such that it could follow a strict formatting might be something that is just standard in every AI service before too long.

However, I don't think llama.cpp should actually provide an implementation of this in the form of a sampler or templating engine or something; it should provide only the special something that is necessary for another program to implement such features.

The services that a generative text AI inference library should probably provide are just:

  1. Convert a string into tokens.
  2. Convert tokens into embeddings (for vector databases).
  3. Convert tokens back into strings.
  4. Sample new tokens, given a context.
  5. Determine how surprised the AI is if some given tokens came next in a given context.

We don't currently provide item 5 in the API of main.cpp, but we do HAVE it: that's essentially perplexity. Some implementation of perplexity could be moved to the main API, since it's actually quite useful for creating controlled AI services.

Basically, the idea would be that instead of having the AI continue the context on its own, we could show the AI a variety of possible continuations and rank them so we could then make a decision programmatically, but still informed by the AI.

As an example, I was trying out kobold.cpp's multi-user chat functionality the other day. It looks like all it does is randomly pick one of the available characters the AI should respond as and sticks a prompt in for it, regardless of whether or not it makes sense that the selected character would reply to the user's last message.

For instance, if the context is:

Below is a chat between a human, Albert Einstein, and Socrates. The two historical figures are played by an AI that converses with the human and responds as these figures appropriately and in-character.

Human: What would happen if I fell into a black hole?

...you would expect that Einstein would have more to say about black holes than Socrates, who wouldn't even know what those are! The AI model should be able to tell you which of \nAlbert Einstein: or \nSocrates: it would prefer as the next text. If that could be measured, you can programmatically determine which character should speak next and just add it to the context to prompt the AI.

This isn't quite the same thing as using a regular expression, but if you had a service that provided this information, you could create an AI-powered generative grammar that strictly controls how the context can evolve. Here is how the above AI's chat turn might be encoded in such a grammar.

root -> "{chatSequence}"
chatSequence -> "{chatPrompt}{chatTerminator}"
chatSequence -> "{chatPrompt}{chatPrompt}{chatTerminator}"
chatPrompt -> "\n{aiUser} {chatContent}"
aiUser -> "Albert Einstein:"
aiUser -> "Socrates:"
chatContent -> GenerateTokensWithAI(minTokens: 1, maxTokens: 60)
chatTerminator(discard=true) -> "\nHuman:"

The places where the same identifier appears twice are places where the AI can be probed to see which path it would have preferred to see come next. The AI is also constrained to give only one OR two in-character replies before it is forced to give control back to the human and end its turn.

@AutonomicPerfectionist
Copy link
Contributor Author

@JHawkley that's a fascinating idea and sounds like the right way forward in my opinion.

@slaren I've conducted more experiments with my regex biasing implementation, and I've found that with more complex regular expressions, the performance impact becomes exponential with regard to completion length. It's such a large impact that on my Ryzen 5 5600g and with a pretty complex regex, anything more than 4 triplets (see example above) slowed the generation to a crawl, with sample times per run jumping up to 70-90ms. Shorter completions, i.e. around 2 triplets, still had a sample time of around 20-30ms per run. I've identified the cause of the poor performance as the standard library's regex implementation; simply switching to Boost without even using the partial match feature more than halved the performance impact. My conclusion is this PR, unfortunately, is not fit for llama.cpp proper, but if a more generalized biasing framework is built into the API, users could still reap benefits.

@howard0su
Copy link
Collaborator

suggest lookibg at this project to se if we can avoid regex and abstract the FSM.
https://lexy.foonathan.net/

@deep-pipeline
Copy link

@AutonomicPerfectionist @ejones I've been thinking about how one builds elegant FSMs which are in effect run by an LLM so was interested in this PR, but I wonder if (given how much is going on!) you've seen this https://lmql.ai and in particular the output constraints and answer templates features?

It seems to me that some of what's being done in that project has a whole lot of overlap with what you're aiming for by the technique of culling outputs using regex, but perhaps with some interesting subtleties (eg. the ability to extract percentage probabilities of semantic match to different templates outputs rather than simply regex culling to match strings - I mean we probably ought to be keeping matches in semantic high dimensional space rather than dropping down to flat unicode..).

I'm not sure if there's a way to run that lmql on models run on ggml infrastructure-code ...but that would certainly be really useful.. (so I thought I would bring the project to your attention).

@ejones
Copy link
Collaborator

ejones commented May 23, 2023

@deep-pipeline LMQL looks really cool! Their model serving process approach for local models would likely translate to llama.cpp.

@deep-pipeline
Copy link

@AutonomicPerfectionist I see you've closed this without to being merged - has this work been folded in elsewhere (e.g. are you anticipating the functionality will be folded in with work which @ejones is doing on regex bias?). Just wanting you to know your work was/is interesting to others.

@AutonomicPerfectionist
Copy link
Contributor Author

@deep-pipeline yes, I believe the above-linked PR is a much better solution than this one. As I mentioned in a previous comment, the current implementation suffers from severe performance issues, and rectifying them would require switching to a different design altogether. The grammar-based approach is significantly more elegant and appears to be performant. I would have helped contribute to that PR, but unfortunately, I'm swamped with schoolwork. I will, however, keep my branch available for those who are interested in its implementation. I might have forgotten to push some additional changes I made during performance testing, I will push those for anyone interested once I have a bit of free time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants