Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_insertion_noise in DenoisingDataset does not respect max_source_positions #2297

Open
dkavaler opened this issue Jul 3, 2020 · 1 comment
Assignees
Labels

Comments

@dkavaler
Copy link

dkavaler commented Jul 3, 2020

🐛 Bug

add_insertion_noise in DenoisingDataset does not respect --max-source-positions.

This can become an issue when specifying, e.g., --mask 0.3 --mask-length span-poisson --poisson-lambda 3.5 for the denoising task.

In the call to add_whole_word_mask, if the input source tokens are already at the maximum size allowed by the model, the resulting call to add_insertion_noise can yield a result that is longer than --max-source-positions.

To Reproduce

Have a dataset with inputs of size equal to --max-source-positions.

Run a denoising task with --mask 0.3 --mask-length span-poisson --poisson-lambda 3.5.

Eventually, the dataset will return a batch with token count greater than --max-source-positions.

Expected behavior

The call to add_whole_word_mask (or really, add_insertion_noise) should not return a set of tokens with a length longer than --max-source-positions.

Environment

  • fairseq Version (e.g., 1.0 or master): master
  • PyTorch Version (e.g., 1.0): 1.5
  • OS (e.g., Linux): Linux
  • How you installed fairseq (pip, source): pip install -e .
@lematt1991
Copy link
Contributor

CC @myleott @ngoyal2707

facebook-github-bot pushed a commit that referenced this issue Sep 20, 2021
…change (#2297)

Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: fairinternal/fairseq-py#2297

Reviewed By: alexeib

Differential Revision: D30906090

Pulled By: dianaml0

fbshipit-source-id: 941d30db7f766c9077a1b5bb2a04680f57e2e070
sorenmulli pushed a commit to sorenmulli/fairseq that referenced this issue Oct 4, 2021
…change (facebookresearch#2297)

Summary:
# Before submitting

- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?

## What does this PR do?
Fixes # (issue).

## PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

## Did you have fun?
Make sure you had fun coding �

Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2297

Reviewed By: alexeib

Differential Revision: D30906090

Pulled By: dianaml0

fbshipit-source-id: 941d30db7f766c9077a1b5bb2a04680f57e2e070
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants