-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force response format and bias responses by regex #1397
Force response format and bias responses by regex #1397
Conversation
…lue to main example
Nice job, this looks very interesting. Would it be possible to replace the boost dependency with the standard library |
It would be, but as a result, there would be a loss of functionality. The standard library does not support the |
This is a very interesting concept that I will be happy to see advanced further. How is the "sample time" affected from these regex matches? Here is a specific application that I am very interested in: constrain the generation only to a set of valid chess moves on a board given the state of the current board (i.e. the position of the pieces). The chess moves can be generated in short algebraic notation (e.g. Be5) or as normal speech (e.g. "bishop to e five", "knight takes", "castle", etc.). An optimized solution of this task can be integrated in Whisper's speech-to-text transcription to achieve a high-quality voice controlled chess app. |
With some fiddling I did manage to get the C++ standard library regex implementation to show similar results, my regex ended up being gigantic with multiple alternatives for each in progress piece of the response but someone with more time could probably make a better one. As for performance, I'm not at my desktop right now so I can't get you hard numbers for a bit but from what I remember the milliseconds per run was not significantly impacted |
It may be better to move this code to llama.cpp as part of the |
Honestly was not aware of that API, I will definitely do that as soon as I have time |
If it is a performance issue I think I might have some ideas to help with that too. Most of those tokens will never match the required format, so "banning" them before the first iteration would reduce the number of regex matches needed per iteration significantly. I think I also saw an option somewhere in the regex documentation to optimize for matching performance versus regex creation performance. |
This is super cool. As maybe a future direction, I've been wondering if things like repetition penalty, logit bias, this, and reverse prompt/stop words can all be generalized as something like an FSM from tokens to biases (reverse prompt could be something like, forcing EOS as a terminal state after the string). Not sure what form that would take. |
I actually had the same idea last night about unifying them under an FSM. Didn't put much thought into it though, that's probably above my experience. Thinking about it though, an FSM based biasing engine would allow more complicated response biasing than is allowed through even regular expressions. I'm thinking something like code generation would be improved by biasing the next token to only be valid syntax in the target language, which would be basically impossible with the current systems but doable with an FSM |
I put the regex testing in a |
That doesn't look too bad actually, it seems very reasonable and won't affect the overall token generation time too much. |
A more general form would be for the |
My two cents is that providing a mechanism to control the AI such that it could follow a strict formatting might be something that is just standard in every AI service before too long. However, I don't think llama.cpp should actually provide an implementation of this in the form of a sampler or templating engine or something; it should provide only the special something that is necessary for another program to implement such features. The services that a generative text AI inference library should probably provide are just:
We don't currently provide item 5 in the API of Basically, the idea would be that instead of having the AI continue the context on its own, we could show the AI a variety of possible continuations and rank them so we could then make a decision programmatically, but still informed by the AI. As an example, I was trying out kobold.cpp's multi-user chat functionality the other day. It looks like all it does is randomly pick one of the available characters the AI should respond as and sticks a prompt in for it, regardless of whether or not it makes sense that the selected character would reply to the user's last message. For instance, if the context is:
...you would expect that Einstein would have more to say about black holes than Socrates, who wouldn't even know what those are! The AI model should be able to tell you which of This isn't quite the same thing as using a regular expression, but if you had a service that provided this information, you could create an AI-powered generative grammar that strictly controls how the context can evolve. Here is how the above AI's chat turn might be encoded in such a grammar.
The places where the same identifier appears twice are places where the AI can be probed to see which path it would have preferred to see come next. The AI is also constrained to give only one OR two in-character replies before it is forced to give control back to the human and end its turn. |
@JHawkley that's a fascinating idea and sounds like the right way forward in my opinion. @slaren I've conducted more experiments with my regex biasing implementation, and I've found that with more complex regular expressions, the performance impact becomes exponential with regard to completion length. It's such a large impact that on my Ryzen 5 5600g and with a pretty complex regex, anything more than 4 triplets (see example above) slowed the generation to a crawl, with sample times per run jumping up to 70-90ms. Shorter completions, i.e. around 2 triplets, still had a sample time of around 20-30ms per run. I've identified the cause of the poor performance as the standard library's regex implementation; simply switching to Boost without even using the partial match feature more than halved the performance impact. My conclusion is this PR, unfortunately, is not fit for llama.cpp proper, but if a more generalized biasing framework is built into the API, users could still reap benefits. |
suggest lookibg at this project to se if we can avoid regex and abstract the FSM. |
@AutonomicPerfectionist @ejones I've been thinking about how one builds elegant FSMs which are in effect run by an LLM so was interested in this PR, but I wonder if (given how much is going on!) you've seen this https://lmql.ai and in particular the output constraints and answer templates features? It seems to me that some of what's being done in that project has a whole lot of overlap with what you're aiming for by the technique of culling outputs using regex, but perhaps with some interesting subtleties (eg. the ability to extract percentage probabilities of semantic match to different templates outputs rather than simply regex culling to match strings - I mean we probably ought to be keeping matches in semantic high dimensional space rather than dropping down to flat unicode..). I'm not sure if there's a way to run that lmql on models run on ggml infrastructure-code ...but that would certainly be really useful.. (so I thought I would bring the project to your attention). |
@deep-pipeline LMQL looks really cool! Their model serving process approach for local models would likely translate to |
@AutonomicPerfectionist I see you've closed this without to being merged - has this work been folded in elsewhere (e.g. are you anticipating the functionality will be folded in with work which @ejones is doing on regex bias?). Just wanting you to know your work was/is interesting to others. |
@deep-pipeline yes, I believe the above-linked PR is a much better solution than this one. As I mentioned in a previous comment, the current implementation suffers from severe performance issues, and rectifying them would require switching to a different design altogether. The grammar-based approach is significantly more elegant and appears to be performant. I would have helped contribute to that PR, but unfortunately, I'm swamped with schoolwork. I will, however, keep my branch available for those who are interested in its implementation. I might have forgotten to push some additional changes I made during performance testing, I will push those for anyone interested once I have a bit of free time. |
This PR implements the idea from https://github.com/r2d4/rellm, allowing users to set a regex that the model must adhere to in its response. I also added the ability to bias completions matching a regex instead to make responses more or less likely.
Combined, these two options improve accuracy for Langchain or AutoGPT style prompts that require machine-readable responses in a particular format. I tested with Vicuna 1.1 q4_0 on a variation of Langchain's knowledge graph triple extraction prompt:
Executing with the following command line (no regex biasing):
Outputs decent results some of the time, but usually the results are mangled with extra lines (fixable, obviously), extra elements (
(Sunny, is, an orange, cat)
), or are nonsensical ((Sunny, is, a cat named)
). Manipulating the temperature and other parameters usually ends up making the issues worse or making the model so conservative it only ever outputsNONE
, at least for me. I'm not that experienced with model tuning so maybe I'm missing something. Regardless, the following command that enables regex biasing ensures properly-formatted output every time:This command forces the model to generate according to the pattern
(<subject>, <predicate>, <object>)
orNONE
if no knowledge graph triple makes sense, withNONE
being biased with a value of -4.5. The temperature needed to be lowered, otherwise, nonsense was generated, but the results were astonishingly good. Using the same prompt from above I got the following triples every time:(Sunny, is, a cat)<|>(Sunny, is, orange)<|>(Sunny, is, soft)
. Modifying the prompt so the last few lines read:Resulted in correct triples of
(Sunny, is, orange)<|>(Alvin, is, blonde)
. Without the regex biasing, no combination of parameters generated those triples; usually, the model would keep repeating(Sunny, is an, orange cat)
at low temperatures, and at higher, it would spit outNONE
. Setting repeat penalty to anything other than 1.0 resulted in it straying from the format.This was mainly an experiment of mine, but I figured I'd make a PR in case it was interesting. Unfortunately, I needed partial regex matching, which is not in the C++ standard library, so I had to add a compile option to use Boost. That alone probably removes its viability for this project based on the design philosophies I've seen so far, so I understand if it's rejected.
The design can probably be streamlined, since the "allowed response regex" is really just another form of biasing. If this interests people, I would probably make the regex bias options repeatable and remove the allowed response regex, so that it behaves like the
--logit-bias
option.I have not tested this, but I think regex biasing would also allow users to remove
As an AI language model
style responses without needing an uncensored model.EDIT:
Drastically lowering
top_k
to 5 to 10-ish and further restricting my regex (I forgot the start-of-line assertion so it was still allowed to generate parens inside the triplet) allowed me to increase the temperature significantly. Interestingly, this allows the model to sometimes generate additional triplets for information inferred from the conversation, such as(Alvin, is owned by, Person1)
or(Person1, has, two cats)
. At temperature 2.5, it was creating perhaps superfluous triplets, but correct nonetheless:(Orange, is, a color)