-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement gpt2 (BPE) GGUF tokenizer conversion #397
Conversation
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 9 21 21 0 0 Python 27 995 848 29 118 TOML 16 430 390 1 39 ------------------------------------------------------------------------------- Jupyter Notebooks 1 0 0 0 0 |- Markdown 1 60 30 22 8 |- Python 1 96 87 1 8 (Total) 156 117 23 16 ------------------------------------------------------------------------------- Markdown 16 1091 0 809 282 |- BASH 5 100 97 0 3 |- Python 6 122 110 0 12 |- Rust 2 80 72 3 5 (Total) 1393 279 812 302 ------------------------------------------------------------------------------- Rust 109 33197 30074 549 2574 |- Markdown 55 627 13 581 33 (Total) 33824 30087 1130 2607 =============================================================================== Total 181 36210 31727 1388 3095 =============================================================================== |
tokenizer.add_special_tokens(&[AddedToken::from(tokens[bos as usize].clone(), true)]); | ||
tokenizer.add_special_tokens(&[AddedToken::from(tokens[eos as usize].clone(), true)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, BPE has support for setting unk
in it's builder variant, is it not relevant for some reason? (I know very little about these things)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polarathene the GGUF file I am testing with (QuantFactory/Meta-Llama-3-8B-Instruct-GGUF) does not have a unk token in the metadata, so I left it out here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but what about when they do? I assume that's possible since the tokenizer builder for BPE does support setting unk
? There is no check for this, so if there was it'd just ignore it and introduce a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.map(|merge| { | ||
let split: (&str, &str) = merge | ||
.splitn(2, ' ') | ||
.collect_tuple() | ||
.expect("Failed to convert split into 2-tuple"); | ||
(split.0.to_string(), split.1.to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This splits a string like "I like rust" into ("I", "like rust")
?
I haven't seen any examples where you'd have space in the input, but I assume this is referencing an existing implementation somewhere already 😅 (not doubting your work, just curious)
I came across this article that mentions white-space splitting at the end, noting it is not suitable for languages like chinese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, happy to explain. When you have a merges
field, from all the examples I have looked at, they are encoded together, separated by a space because the BPE tokenizer inserts spaces, so there is no space token. The key point is that each element of the table is a merge pair. This code would split the merge representation "he llo"
into the pair ("he", "llo")
.
I looked at the article you linked, and it also mentioned the pair construction of merges.
Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>
* Bump version to 0.1.17 * Fix version bump
* Add readmes * Fix typos
* ISQ support for phi3v * Document it
* Use new slice_assign * Fix dead image links
* Fix kv head usage * Fix rope weights * Clippy
No description provided.