-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault when using simple self-referential grammar on M2 Mac #3801
Comments
I haven't tried self-referential grammars yet - probably this is not supported atm. Pinging @ejones |
In case it's interesting, I've written up a bit more about the context here: https://github.com/jonastemplestein/tree_sitter_grammar_to_gbnf Mostly so I can pick up where I left off if/when support for such grammars is added to llama.cpp Thanks so much for looking into this <3 |
Yeah the issue here is left recursion on In other words, there can be recursion, just not in the first term of a rule body. So this grammar works, for instance:
In this case, by moving Now, I haven't looked into tree sitter and I don't know how easy it is to produce or convert grammars in this way automatically. |
Oh amazing, thank you so much. I can just insert a whitespace symbol on the left and the toy grammar works:
I've prepended whitespace symbols to various definitions in my Elixir grammar and I can now get to the input prompt! Exciting! Unfortunately llama.cpp still doesn't generate Elixir code. It just keeps giving either single word answers or printing an infinite series of newlines. And each subsequent token that llama.cpp generates takes longer than the first one, suggesting to me that the whole output string is parsed in the grammar each time, as opposed to tagging the output as we go along and storing that state. Is that true? In that case we are probably some way off from using formal grammars to force llama.cpp to output only valid code in high level programming languages. In case you're interested, you can find the full elixir grammar that is causing this behaviour in this repo (elixir_no_left_recursion.gbnf) alongside some more notes |
It is not. We do indeed just store the parse state as we go along. I'd have to look into this. One thing that comes to mind is that: we store all possible parse states, which is reasonable for deterministic or mostly deterministic grammars. If the grammar is ambiguous or nondeterministic in certain ways this could cause a compounding effect; I don't know off the top of my head if this is the case here. Another thing is, there's definitely certain pathological cases that cause slow token sampling, which I haven't had a chance to look into. But when I've seen that it's more of a sudden thing than a slow accumulation.
This is something we've observed with whitespace in grammars and generally we have to limit the amount of whitespace the model is allowed to add.
Note that the model isn't aware of the grammar constraints, so it will get confused if the grammar diverges greatly from what it's intending to do. Looking at your sample repo, it looks like this instruct model is trying to write explanations, which are sort of being forced into Elixir code. Maybe you need to incorporate a code block syntax ( Furthermore, once you commit to grammar constraints, your grammar basically needs to describe the full language. Or you need to explain to the model what subset you're targeting. Again, the model will think it can do any Elixir-y thing and if the grammar doesn't recognize that, the generation could go off the rails. This tension between the constraints and completeness of the grammar is more what I see as a barrier to applying them to high-level languages. The latest models are also pretty good at generating code I think? Depending on the use case, it's not clear how much value is added by specifying the programming language grammar. It may make more sense to adopt an approach like, "generate a code block and I will take whatever's inside". |
Thank you so much for the thoughtful response, this gives me some avenues to explore. I can remove the whitespace from the grammar and will play around with code blocks and different prompts If I get that working, I'll try to debug the increasing slowdown over time with some better test cases.
Regarding the need for such a model, I agree that for large models that have seen a lot of Elixir in training, you can just ask it to output a code block. But for very small models, in particular ones that haven't been fine-tuned on Elixir (because the base models often haven't been exposed to Elixir nearly as much as python, javascript etc) |
One last thought on this - I hear mention of the idea of custom WASM samplers. Is this being worked on? If so, the easiest way to accomplish this for many programming languages at once might actually be to build a kind of tree-sitter WASM that exposes an interface that llama.cpp can query to limit generated tokens to only grammatically legal ones (apologies for hijacking the thread from the original infinite loop - let me know if I should drop this idea elsewhere 🙏) |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hey folks, first of all massive thanks for this amazing project and in particular grammar support - it honestly feels like magic <3
I was inspired by this tweet that @ggerganov retweeted to try converting tree sitter grammars to GBNF form. After some hiccups I had a script that created grammar's that could be parsed by llama.cpp, but unfortunately I'm now getting segmentation faults.
I've isolated it to this simple example grammar:
Here's the segfault. I've included the full arguments and shortened output of llama.cpp:
A few other observations:
binary-operator ::= integer ("+" | "-") integer
Any help would be much appreciated. It'd be really cool if we could force llama.cpp to output any programming language that tree sitter supports. But most of them contain this sort of self-referential symbols.
The text was updated successfully, but these errors were encountered: