-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: GGUF metadata tokenizer #389
Refactor: GGUF metadata tokenizer #389
Conversation
Proper way to opt-out of the dead code warnings is annotate the test module as purely for testing.
Remove the repetitive noise to common functions under test. This also addresses a bug fix for the encode test case where the upstream encoder/decoder calls flip the meaning of the `bool` for handling special tokens.
Retrieving metadata items from their hashmap `Value` enum into primitive types with error handling is very verbose and noisy. Use traits to abstract all that away. This could also benefit usage in models `from_gguf()` methods. Meanwhile the unigram tokenizer has unified the special token handling at the end by keeping `unk` as a `u32` and only casting it to `usize` when actually needed.
The special token strings are also being created in the tokenizer now. A bit awkward, but unclear why only `unk` was an option, presumably the `bos` and `eos` may also need similar treatment to `unk`?
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 9 21 21 0 0 Python 24 864 731 25 108 TOML 15 403 365 1 37 ------------------------------------------------------------------------------- Jupyter Notebooks 1 0 0 0 0 |- Markdown 1 60 30 22 8 |- Python 1 96 87 1 8 (Total) 156 117 23 16 ------------------------------------------------------------------------------- Markdown 16 1057 0 783 274 |- BASH 6 203 190 0 13 |- Python 6 121 110 0 11 |- Rust 3 185 172 9 4 (Total) 1566 472 792 302 ------------------------------------------------------------------------------- Rust 94 29779 27098 437 2244 |- Markdown 48 479 0 463 16 (Total) 30258 27098 900 2260 =============================================================================== Total 162 32600 28609 1246 2745 =============================================================================== |
Hi @polarathene thanks. If I roll back the changes such that #380 was effectively never merged, would that work for you? I did it locally already so I can just push it up. |
Ok, I reverted and there is just 1 simple merge conflict now :) |
Thanks! Much appreciated 🙏 Sorry about this, I'll try to better communicate when I'm working on a refactor in future 😓 I hope I didn't waste any of your time. You still had some nice changes I think, such as better restructuring of the files (something I've not given too much thought to). |
No problem! In the future, if you let me know in Discord, I'll see it immediately so we can coordinate better.
Maybe you can do something similar? A key point is that each GGUF model has an |
That's no longer necessary if you take the approach I did here with
I don't know if we need to beyond the I'll take a look after some rest, past midnight here 😅 Failing test is for the last encode test case: HF tokenizer has |
I think this should be fine. Maybe we can call it
No rush! Maybe it did indeed surface that bug, perhaps we aren't adding the bos token to the tokenizer? |
This is a great thing to have. As far as my research, there isn't currently a crate that does this! Not sure if it's an error in this quant, but it's listed as Given how prolific this hugging face user is I imagine people are going to try their quants and run into the same issue I had (only support for llama). Should I reach out to them or is it something that will be handled by this library? |
@ShelbyJenkins thank you!
We do the conversion in this codebase, so I can add support for the |
@polarathene I took a look at it, and I think that the GGUF one is actually wrong here. It is not prepending a BOS token where one should not be. Regardless, GGUF tokenizers are documented to be less accurate than the HF ones, and the GGML docs themselves recommend using the HF one. I have not seen any effect of this error, as the chat template adds that anyway. This can, however, be an issue if you are not using a model with a chat template. I also did a cursory search for an API through As another note, I plan on merging this before #397 (which #388 will need for testing) so I am excited to merge! |
I'm not quite sure how to? Is there no information in the GGUF to ensure the correct tokenizer is used? Even with the improved tokenizer support after unigram, is the GGUF still lacking the information to use the appropriate tokenizer?
Sorry for the delay, I'll get onto applying the requested changes. I'm still not sure how to proceed with the test case issue though. |
Yes, the GGUF tokenizer specification is not as precise as the HF one, and we even use different tokenizer types.
No problem! Given that we actually found a mistake in the GGUF tokenizer, I think we should go back to only testing if special tokens = false just to get the cases to pass. This will just ensure the body of the text is the same, and while it is not ideal, that inaccuracy is coming from the GGUF tokenizer. |
This is a partial change, the method will be changed over in subsequent commits.
For the quantized models to leverage. Additionally changes helper methods over to `anyhow::Error`.
Thanks, I'll look into using a cache. The Windows test is a bit flaky for this reason.
Sounds good. I'll probably resolve it quickly, as I will bundle the changes in the next PR. |
Similar to the enum workaround. As the upstream builder is awkward to use, an alternative one is implemented to improve the DX. The enum conversion to upstream wrapper types is handled in the builder now, simplifying usage in a tokenizer method.
This comment was marked as resolved.
This comment was marked as resolved.
This test fails presently. It is due to the mismatch of the HF tokenizer vs GGUF tokenizer used.
27c6cdf
to
4b8d775
Compare
This special character looks very similar to `_` but it is not. This was a mistake I introduced when converting to local enums approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context for recent commits.
Should be good to merge now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes and comments! Can you please hardcode the test passage so that we have a consistent input? It does not need to be excessively long, just enough to test the capabilities of the tokenizer.
fn get_test_passage() -> String { | ||
// TODO: Why is it necessary to depend on this for a multi-line test string? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can hardcode this? It should be easy.
Sure. I haven't gone over the contents of this For now I'll just go with a single input based on some content from there. |
Added context for this particular configuration.
@EricLBuehler your recent merge was based off your earlier PR work that you since reverted, you merged that and now there's a bunch of conflicts for changes unrelated to Phi3 vision 😓 I'm not even able to resolve those via the web UI this time apparently (not that I particularly would want to): Potentially, revert the merge, and fix the history there? That PR was 184 commits long though, it looks there were other changes like dropping some docs (Adding models, CLI docs), so I don't know what else was done that was/wasn't intentional 🤷♂️ This is the commit for removing the two docs files, but there is no context in the commit, only a title "Overhaul docs": d7a7c3c |
Ok, sounds good. Is this PR ready for merge? |
Yes? You asked for the test to be updated and I addressed that roughly 11 hours ago. But now there is the conflicts. |
Ok, thanks for confirming. I'll undo the merge of that PR so that this can merge cleanly. Sorry for the inconvenience! |
Thank you! |
Just to confirm, was the removal of those mentioned doc pages intentional? Seemed like useful information, just a bit outdated? |
Sorry, which docs pages were removed? |
I referenced them at the end of the comment above. "Adding Models" / "CLI". I then said if that was unintentional that there may have been other mishaps, but there was quite a bit of changes in the PR you merged. |
Oh I see. I removed the "Adding models" in favor of a comment in the README contributing section because of the rate of development regarding that section, and the "CLI" document was prone to being outdated. If you know of a better way to autogenerate the CLI docs that would be great, I think we could do it in CI. |
Yeah, you could probably have a template file to produce the markdown content? Or switch to a docs generator that can import content from separate sources. Then for the CI part, you would just get the output of the CLI help and if it doesn't match the importable snippet, it'd update that and generate the page again. Another way with CI is to have that sort of behaviour triggered on an automated schedule rather than commit triggers. That can then create a PR with the update, I've not done that before myself but I'm aware of something like that 🤔 |
Ok, I'll look into that. But before starting work on this, do you see static CLI docs as a useful tool when users can simply download and run with --help? That was my rationale. |
It was useful as an insight when I first came across If there's no additional context though, it's probably not that meaningful :) I had a look over the deleted document, it doesn't appear to provide any value beyond an overview to reference without the CLI available. I wouldn't worry about restoring it 👍 |
Great. I did a refactor of the README yesterday, do you see any parts that are confusing or hard to find? |
This was my progress on a refactor, but I didn't communicate this effort well with maintainers.
from_gguf()
methods in model files that could have used the approach I proposed here (a simpleTryFrom
on a struct of metadata fields). I was considering a derive macro to simplify that further, but it might have been overkill for the project 😅The merged alternative was rather extensive in changes, I don't have the motivation to go through all that. Some structs were renamed, methods changed, the gguf tokenizer file moved along with other changes that'll bring conflicts for rebasing.
If this PR were updated, it would simplify the
from_gguf()
metadata parameter extraction:mistral.rs/mistralrs-core/src/models/quantized_llama.rs
Lines 274 to 287 in 89dea1b
But without the derive proc macro approach, due to all the
usize
and defaults, this PR may not be that worthwhile of an improvement.The related tests changeset is probably still worth having. It fixes a bug with the encoder test case being misconfigured. It also removes the need for the
dead_code
annotations.Each commit has a commit message for additional context.