Token healing (under 40 LOC) #28346

ahmed-moubtahij · 2024-01-04T19:39:49Z

Feature request

Token healing rectifies the token boundary bias in greedy tokenization. It does this by trimming and regrowing the prompt to better align with the model's tokenizer, thus enhancing generation quality. The improvement is clearest with completion models.

Token boundary bias is a silent performance killer that doesn't seem very well known. It has clear impact on completion quality, though I'm not sure where it would fit as a transformers feature.

A more thorough explanation of the problem: The Art of Prompt Design: Prompt Boundaries and Token Healing | by Scott Lundberg.

Motivation

Given a completion prompt with a partial url ending with :, the model might have seen the expected completion :// as a single token in training. However, the prompt's tail token : tells it that the next token is not //, and so it generates a wrong completion. Such errors compound in auto-regressive language models.

Your contribution

My implementation (under 40 LOC): https://github.com/Ayenem/TokenHealer/tree/main

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-01-05T10:54:34Z

FYI @gante

gante · 2024-01-10T17:52:04Z

@Ayenem that's a really cool use of sequence_bias! I had read about token healing in the past, and never thought about it when I added sequence_bias.

Would you be up for adding it to transformers? I think we could include it inside generate (e.g. here), passing a simple token_healing=True flag. What do you think?

ahmed-moubtahij · 2024-01-11T03:22:11Z

I'm glad you appreciate this use of your feature! Hyrum's law at play :)

Sounds good, I'll get started on a dev branch. Thanks for the pointer!

gante · 2024-03-29T11:00:51Z

Hi @Ayenem 👋

I noticed you closed your latest PR. Is there anything we can help you with? 🤗

ahmed-moubtahij · 2024-03-29T14:36:16Z

Hi @gante thanks for the follow-up!

I deleted my branch because it was becoming a mess and it automatically closed this PR. I didn't think about it but it makes sense. I think I'll just open a new PR when my changes are done.

I've been working on removing the dependency to pygtrie by adding functionality to HF's Trie. I've implemented it in my repo and it works.

I added my changes to tokenization_utils::Trie and my Trie tests to test_tokenization_utils::TrieTest but running the latter throws Attribute Error on my added Trie functions, as if test_tokenization_utils.py doesn't see them.

I commented out the entire Trie in tokenization_utils.py and successfully ran test_tokenization_utils, which is weird. I don't know if test_tokenization_utils imported Trie from some cache or some config or pathing was in play but it ran successfuly with the entire Trie class commented out.

I'll retry from scratch with a new PR 😅

EDIT:
I made a new fork. Same issue:

Left: TrieTest in test_tokenization_utils.py
Middle: Trie class in tokenization_utils.py, commented out.
Right: tests passing for test_tokenization_utils.py even though the Trie class shouldn't be accessible.

So whenever I add new tests to TrieTest, it complains about Attribute Error in the Trie class even when the attributes are actually there.

EDIT 2:
I think I found the issue...test_tokenization_utils.py is probably importing Trie from the transformers packages installed in my environment, instead of the Trie from source code. Investigating...

EDIT 3:
Problem solved, re-opened in #30081

ahmed-moubtahij changed the title ~~Token healing~~ Token healing (under 40 LOC) Jan 4, 2024

ArthurZucker added the Feature request Request for a new feature label Jan 5, 2024

ahmed-moubtahij mentioned this issue Jan 20, 2024

Token healing #28616

Closed

5 tasks

ahmed-moubtahij mentioned this issue Feb 18, 2024

token healing impl #29081

Closed

5 tasks

ahmed-moubtahij mentioned this issue Apr 6, 2024

Token healing #30081

Merged

5 tasks

ArthurZucker closed this as completed in #30081 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token healing (under 40 LOC) #28346

Token healing (under 40 LOC) #28346

ahmed-moubtahij commented Jan 4, 2024 •

edited

Loading

ArthurZucker commented Jan 5, 2024

gante commented Jan 10, 2024

ahmed-moubtahij commented Jan 11, 2024 •

edited

Loading

gante commented Mar 29, 2024

ahmed-moubtahij commented Mar 29, 2024 •

edited

Loading

Token healing (under 40 LOC) #28346

Token healing (under 40 LOC) #28346

Comments

ahmed-moubtahij commented Jan 4, 2024 • edited Loading

Feature request

Motivation

Your contribution

ArthurZucker commented Jan 5, 2024

gante commented Jan 10, 2024

ahmed-moubtahij commented Jan 11, 2024 • edited Loading

gante commented Mar 29, 2024

ahmed-moubtahij commented Mar 29, 2024 • edited Loading

ahmed-moubtahij commented Jan 4, 2024 •

edited

Loading

ahmed-moubtahij commented Jan 11, 2024 •

edited

Loading

ahmed-moubtahij commented Mar 29, 2024 •

edited

Loading