Customise the separator used for splicing in DataCollatorWithFlattening #33114

beep-bebop · 2024-08-26T04:42:57Z

What does this PR do?

#31629 added DataCollatorWithFlattening, which packs examples in a small batch into a long sequence and uses -100 to splice the samples and returns position ids for attention calculation.
Since different models may use different token ids for splicing samples during training, for example, when using the Qwen model for post pre-training, short samples can be packed into long samples to speed up training and memory usage, and separated by <|endoftext|>, which token id is 151643. So allowing the user to customise the separator may be a more flexible implementation, allowing the user to use this DataCollator when building the pre-training dataset with different models.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Models:

text models: @ArthurZucker

ArthurZucker

LGTM, indeed this makes sense.
can you just update the documentation of this datacolator please!

…ning

beep-bebop · 2024-08-28T02:18:17Z

LGTM, indeed this makes sense. can you just update the documentation of this datacolator please!

Updated! Feel free to edit if needed:) @ArthurZucker

ArthurZucker · 2024-08-28T13:22:10Z

Thanks 🤗

HuggingFaceDocBuilderDev · 2024-08-28T13:41:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ng (huggingface#33114) * Customising the separator used for splicing in DataCollatorWithFlattening * update DataCollatorWithFlattening docs --------- Co-authored-by: weifangyuan <i.weifangyuan@yuewen.com>

ArthurZucker approved these changes Aug 27, 2024

View reviewed changes

weifangyuan added 2 commits August 28, 2024 10:01

Customising the separator used for splicing in DataCollatorWithFlatte…

32fe1af

…ning

update DataCollatorWithFlattening docs

0a6539a

beep-bebop force-pushed the custom-separator branch from 07c6db0 to 0a6539a Compare August 28, 2024 02:03

ArthurZucker merged commit 5c84682 into huggingface:main Aug 28, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customise the separator used for splicing in DataCollatorWithFlattening #33114

Customise the separator used for splicing in DataCollatorWithFlattening #33114

beep-bebop commented Aug 26, 2024 •

edited

Loading

ArthurZucker left a comment

beep-bebop commented Aug 28, 2024

ArthurZucker commented Aug 28, 2024

HuggingFaceDocBuilderDev commented Aug 28, 2024

Customise the separator used for splicing in DataCollatorWithFlattening #33114

Customise the separator used for splicing in DataCollatorWithFlattening #33114

Conversation

beep-bebop commented Aug 26, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker left a comment

Choose a reason for hiding this comment

beep-bebop commented Aug 28, 2024

ArthurZucker commented Aug 28, 2024

HuggingFaceDocBuilderDev commented Aug 28, 2024

beep-bebop commented Aug 26, 2024 •

edited

Loading