token healing impl #29081

ahmed-moubtahij · 2024-02-18T04:40:28Z

What does this PR do?

Token healing rectifies the token boundary bias in greedy tokenization. It does this by trimming and regrowing the prompt to better align with the model's tokenizer, thus enhancing generation quality. The improvement is clearest with completion models.

Token boundary bias is a silent performance killer that doesn't seem very well known. It has clear impact on completion quality.

A more thorough explanation of the problem: The Art of Prompt Design: Prompt Boundaries and Token Healing | by Scott Lundberg.

Motivation

Given a completion prompt with a partial url ending with :, the model might have seen the expected completion :// as a single token in training. However, the prompt's tail token : tells it that the next token is not //, and so it generates a wrong completion. Such errors compound in auto-regressive language models.

Fixes #28346

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@gante

HuggingFaceDocBuilderDev · 2024-02-19T10:34:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-02-19T11:31:20Z

CI is failing due to an automatic update in the pytest package, we are tracking it. Will let you know when it is sorted -- it will need a rebase

ahmed-moubtahij · 2024-02-19T12:45:39Z

CI is failing due to an automatic update in the pytest package, we are tracking it. Will let you know when it is sorted -- it will need a rebase

Thanks for the follow-up!

gante · 2024-02-19T16:37:54Z

@Ayenem main is fixed, rebasing should make CI green except if there are PR-specific issues :)

ahmed-moubtahij · 2024-02-20T00:22:31Z

In case it's relevant, here are (some) listed remotes with git branch -r:

  origin/HEAD -> origin/main
  origin/heal_tokens
  origin/main
  origin/token_healing
  upstream/'delete-delete-doc'
  upstream/BritneyMuller-housekeeping-patch
  upstream/_dummy_fix_weight_only_usage
  upstream/_dummy_fix_weight_only_usage_2
  upstream/add-chat-glm
  upstream/add-deci-lm
  upstream/add-encode-special-tokens
  upstream/add-flash-decoding
  upstream/add-mamba
  upstream/add-prefix-space
  upstream/add-quantization-workflow

gante · 2024-02-26T12:42:32Z

(@Ayenem we're trying to fix the merge conflicts for you, and we're experimenting with a few GH permissions on our side. You may see a few test commits 🤗 )

gante · 2024-02-28T11:48:40Z

Now rebased after #29320 was merged, which was causing the last set of errors seen here. If everything went well, we should see a green CI here 🤞

gante · 2024-02-28T12:55:29Z

@Ayenem FYI, I've reverted the tokenizer input to your original suggestion (tokenizer passed to generate), after a discussion I had with @Rocketknight1. That way, the input is standardized and matches another incoming PR (#28932) 🤗

ahmed-moubtahij · 2024-02-28T13:11:50Z

@Ayenem FYI, I've reverted the tokenizer input to your original suggestion (tokenizer passed to generate), after a discussion I had with @Rocketknight1. That way, the input is standardized and matches another incoming PR (#28932) 🤗

It does feel better to offload the tokenizer choice and loading to the caller. Thanks again for following up on this 🙏

ahmed-moubtahij · 2024-03-01T02:58:15Z

CI is green! It was possible :')

gante · 2024-03-05T10:34:28Z

ping @ArthurZucker :)

ArthurZucker · 2024-03-06T02:01:06Z

Sorry for the late review on it!

ArthurZucker

Left a few nits, mostly safely import and protect the function as the new dependency is optional / should be optional. Potentially use our own trie?

ArthurZucker · 2024-03-06T02:02:11Z

src/transformers/generation/utils.py

@@ -22,6 +22,7 @@

 import torch
 import torch.distributed as dist
+from pygtrie import CharTrie


if this is an optinal dependency we need to protect the import

ArthurZucker · 2024-03-06T02:03:22Z

src/transformers/generation/utils.py

+        """
+        if tokenizer is None:


Suggested change

"""

if tokenizer is None:

"""

requires_backends(self, ["pygtrie"])

we also need to make sure this function errors out correctly if used

ArthurZucker · 2024-03-06T02:04:34Z

src/transformers/generation/utils.py

+                "argument of `generate`."
+            )
+        bos_id, pad_id = tokenizer.bos_token_id, tokenizer.pad_token_id
+        vocab_trie = CharTrie(tokenizer.get_vocab())


BTW we have https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py#L52

which could be used for this? Would remove the dependency? (It's might be additional work as well)

ArthurZucker · 2024-03-06T02:06:40Z

src/transformers/generation/utils.py

+        input_ids = torch.where(input_ids == bos_id, pad_id, input_ids)
+
+        tail_ids = input_ids[:, -1].tolist()
+        space_tok = tokenizer.tokenize(" ")[0]


Not 100% sure this will always do what you want, specifically for tokenizer that add a prefix token you could get [▁▁]

LeonardoEmili · 2024-03-07T16:33:01Z

Hi @Ayenem , thanks for this feature. I was curious to look into this as an early feature to see how this works on my domain data but had some issues with some generation using the example data provided (stacktrace attached below). Could you share some example script how to test it?

Traceback (most recent call last):
  File "token_heal.py", line 33, in <module>
    output = model.generate(
  File "/home/leonardo/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/leonardo/projects/transformers/src/transformers/generation/utils.py", line 1439, in generate
    input_ids = self.heal_tokens(input_ids, tokenizer)
  File "/home/leonardo/projects/transformers/src/transformers/generation/utils.py", line 1861, in heal_tokens
    seq_bias[(tail_id,)] += 1.0
KeyError: (518,)

Environment used:

transformers: I'm checked out at the head of your fork, this specific commit
pygtrie version: 2.5.0
Python version: 3.8.10
Model: AutoModelForCausalLM("meta-llama/Llama-2-7b-hf")

…l_tokens

token healing impl

d17584e

ahmed-moubtahij and others added 4 commits February 28, 2024 11:47

token healing impl

fb8f187

empty_commit

2a146b6

merge requirements

07fa23f

fix requirements

762ebb9

tokenizer as an input

c777787

gante requested a review from ArthurZucker February 28, 2024 12:54

Merge branch 'main' into heal_tokens

249a8f7

ArthurZucker reviewed Mar 6, 2024

View reviewed changes

ahmed-moubtahij added 2 commits March 7, 2024 20:41

Merge branch 'heal_tokens' of github.com:Ayenem/transformers into hea…

43cb7ce

…l_tokens

protected pygtrie import

27d3ca0

ahmed-moubtahij closed this Mar 28, 2024

ahmed-moubtahij deleted the heal_tokens branch March 28, 2024 21:24

ahmed-moubtahij mentioned this pull request Apr 6, 2024

Token healing #30081

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

token healing impl #29081

token healing impl #29081

ahmed-moubtahij commented Feb 18, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 19, 2024

gante commented Feb 19, 2024

ahmed-moubtahij commented Feb 19, 2024

gante commented Feb 19, 2024

ahmed-moubtahij commented Feb 20, 2024 •

edited

Loading

gante commented Feb 26, 2024

gante commented Feb 28, 2024

gante commented Feb 28, 2024 •

edited

Loading

ahmed-moubtahij commented Feb 28, 2024

ahmed-moubtahij commented Mar 1, 2024

gante commented Mar 5, 2024

ArthurZucker commented Mar 6, 2024

ArthurZucker left a comment

ArthurZucker Mar 6, 2024

ArthurZucker Mar 6, 2024

ArthurZucker Mar 6, 2024

ArthurZucker Mar 6, 2024

LeonardoEmili commented Mar 7, 2024

token healing impl #29081

token healing impl #29081

Conversation

ahmed-moubtahij commented Feb 18, 2024 • edited Loading

What does this PR do?

Motivation

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Feb 19, 2024

gante commented Feb 19, 2024

ahmed-moubtahij commented Feb 19, 2024

gante commented Feb 19, 2024

ahmed-moubtahij commented Feb 20, 2024 • edited Loading

gante commented Feb 26, 2024

gante commented Feb 28, 2024

gante commented Feb 28, 2024 • edited Loading

ahmed-moubtahij commented Feb 28, 2024

ahmed-moubtahij commented Mar 1, 2024

gante commented Mar 5, 2024

ArthurZucker commented Mar 6, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Mar 6, 2024

Choose a reason for hiding this comment

ArthurZucker Mar 6, 2024

Choose a reason for hiding this comment

ArthurZucker Mar 6, 2024

Choose a reason for hiding this comment

ArthurZucker Mar 6, 2024

Choose a reason for hiding this comment

LeonardoEmili commented Mar 7, 2024

ahmed-moubtahij commented Feb 18, 2024 •

edited

Loading

ahmed-moubtahij commented Feb 20, 2024 •

edited

Loading

gante commented Feb 28, 2024 •

edited

Loading