Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token healing (under 40 LOC) #28346

Closed
ahmed-moubtahij opened this issue Jan 4, 2024 · 5 comments · Fixed by #30081
Closed

Token healing (under 40 LOC) #28346

ahmed-moubtahij opened this issue Jan 4, 2024 · 5 comments · Fixed by #30081
Labels
Feature request Request for a new feature

Comments

@ahmed-moubtahij
Copy link
Contributor

ahmed-moubtahij commented Jan 4, 2024

Feature request

Token healing rectifies the token boundary bias in greedy tokenization. It does this by trimming and regrowing the prompt to better align with the model's tokenizer, thus enhancing generation quality. The improvement is clearest with completion models.

Token boundary bias is a silent performance killer that doesn't seem very well known. It has clear impact on completion quality, though I'm not sure where it would fit as a transformers feature.

A more thorough explanation of the problem: The Art of Prompt Design: Prompt Boundaries and Token Healing | by Scott Lundberg.

Motivation

Given a completion prompt with a partial url ending with :, the model might have seen the expected completion :// as a single token in training. However, the prompt's tail token : tells it that the next token is not //, and so it generates a wrong completion. Such errors compound in auto-regressive language models.

Your contribution

My implementation (under 40 LOC): https://github.com/Ayenem/TokenHealer/tree/main

@ahmed-moubtahij ahmed-moubtahij changed the title Token healing Token healing (under 40 LOC) Jan 4, 2024
@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Jan 5, 2024
@ArthurZucker
Copy link
Collaborator

FYI @gante

@gante
Copy link
Member

gante commented Jan 10, 2024

@Ayenem that's a really cool use of sequence_bias! I had read about token healing in the past, and never thought about it when I added sequence_bias.

Would you be up for adding it to transformers? I think we could include it inside generate (e.g. here), passing a simple token_healing=True flag. What do you think?

@ahmed-moubtahij
Copy link
Contributor Author

ahmed-moubtahij commented Jan 11, 2024

I'm glad you appreciate this use of your feature! Hyrum's law at play :)

Sounds good, I'll get started on a dev branch. Thanks for the pointer!

@gante
Copy link
Member

gante commented Mar 29, 2024

Hi @Ayenem 👋

I noticed you closed your latest PR. Is there anything we can help you with? 🤗

@ahmed-moubtahij
Copy link
Contributor Author

ahmed-moubtahij commented Mar 29, 2024

Hi @gante thanks for the follow-up!

I deleted my branch because it was becoming a mess and it automatically closed this PR. I didn't think about it but it makes sense. I think I'll just open a new PR when my changes are done.

I've been working on removing the dependency to pygtrie by adding functionality to HF's Trie. I've implemented it in my repo and it works.

I added my changes to tokenization_utils::Trie and my Trie tests to test_tokenization_utils::TrieTest but running the latter throws Attribute Error on my added Trie functions, as if test_tokenization_utils.py doesn't see them.

I commented out the entire Trie in tokenization_utils.py and successfully ran test_tokenization_utils, which is weird. I don't know if test_tokenization_utils imported Trie from some cache or some config or pathing was in play but it ran successfuly with the entire Trie class commented out.

I'll retry from scratch with a new PR 😅

EDIT:
I made a new fork. Same issue:
image
Left: TrieTest in test_tokenization_utils.py
Middle: Trie class in tokenization_utils.py, commented out.
Right: tests passing for test_tokenization_utils.py even though the Trie class shouldn't be accessible.

So whenever I add new tests to TrieTest, it complains about Attribute Error in the Trie class even when the attributes are actually there.

EDIT 2:
I think I found the issue...test_tokenization_utils.py is probably importing Trie from the transformers packages installed in my environment, instead of the Trie from source code. Investigating...

EDIT 3:
Problem solved, re-opened in #30081

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants