Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add correct batched handling for apply_chat_template #29222

Merged
merged 28 commits into from
Mar 20, 2024

Conversation

Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Feb 22, 2024

apply_chat_template has a few issues since it was written. Firstly, by default it returns the naked input_ids rather than a dict, and secondly it didn't support rendering a batch of chats simultaneously. This PR makes a few changes:

  • Batched chats are now supported, and we sniff the input to figure out what the user is passing
  • return_dict now defaults to None. For now, we interpret this as False to maintain backward compatibility, but this PR adds a warning that the default behaviour will be changing to True to match other tokenizer methods.

cc @siddk @lewtun who have both requested this or something like it!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1 Rocketknight1 marked this pull request as ready for review February 22, 2024 19:37
@Rocketknight1 Rocketknight1 force-pushed the batched_apply_chat_template branch from 684008a to ed7136f Compare February 23, 2024 14:22
@Rocketknight1
Copy link
Member Author

Should be ready for review now! cc @ArthurZucker

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

Comment on lines 1733 to 1753
"In version 4.40, `return_dict` will be set to `True` by default. "
"Please explicitly set `return_dict` to `False` to maintain the current behaviour, "
"or set it to `True` to get the new behaviour immediately."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to explain why this should be set to True for example? I have no idea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed my mind about this and removed the warning to make this a simpler PR!

)

if not batched:
rendered = rendered[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not always return a batched output? (breaking but we can warn)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - other tokenizer methods don't auto-batch a single input, right? (And sorry for taking so long to reply here!)

@Rocketknight1 Rocketknight1 force-pushed the batched_apply_chat_template branch from da681cb to cb0bbb6 Compare March 12, 2024 13:42
@Rocketknight1
Copy link
Member Author

This should be ready for re-review now, cc @amyeroberts @ArthurZucker! I simplified the PR by removing the deprecation warning - I'm not sure if we want to move to return_dict=True that quickly anyway. As a result, this shouldn't result in any behaviour changes now, it only adds new functionality.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Two things before merge:

  • Question about the default value of return_dict
  • Let's wait for @ArthurZucker to get back to confirm desired batching behaviour

@@ -1730,18 +1730,24 @@ def apply_chat_template(
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
return_dict (`bool`, *optional*, defaults to `False`):
return_dict (`bool`, *optional*):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the default to None here? AFAICT, this doesn't change things. It gets set to False if tokenize is True, but it's only used in truth checks on L1763 and L1773 (which shouldn't really do this if the value can be none anyway) and False or None will have the same result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! You're right - this is leftovers from when I was planning to slowly make return_dict=True the default.

Rocketknight1 and others added 2 commits March 13, 2024 17:56
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Rocketknight1 and others added 8 commits March 13, 2024 17:57
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
# Conflicts:
#	src/transformers/tokenization_utils_base.py
@Rocketknight1
Copy link
Member Author

Merging this now that the branch cut has passed!

@Rocketknight1 Rocketknight1 merged commit 9d99948 into main Mar 20, 2024
21 checks passed
@Rocketknight1 Rocketknight1 deleted the batched_apply_chat_template branch March 20, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants