add_insertion_noise in DenoisingDataset does not respect max_source_positions #2297

dkavaler · 2020-07-03T00:02:25Z

🐛 Bug

add_insertion_noise in DenoisingDataset does not respect --max-source-positions.

This can become an issue when specifying, e.g., --mask 0.3 --mask-length span-poisson --poisson-lambda 3.5 for the denoising task.

In the call to add_whole_word_mask, if the input source tokens are already at the maximum size allowed by the model, the resulting call to add_insertion_noise can yield a result that is longer than --max-source-positions.

To Reproduce

Have a dataset with inputs of size equal to --max-source-positions.

Run a denoising task with --mask 0.3 --mask-length span-poisson --poisson-lambda 3.5.

Eventually, the dataset will return a batch with token count greater than --max-source-positions.

Expected behavior

The call to add_whole_word_mask (or really, add_insertion_noise) should not return a set of tokens with a length longer than --max-source-positions.

Environment

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0): 1.5
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip install -e .

The text was updated successfully, but these errors were encountered:

lematt1991 · 2020-07-06T13:09:49Z

CC @myleott @ngoyal2707

…change (#2297) Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Fixes # (issue). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: fairinternal/fairseq-py#2297 Reviewed By: alexeib Differential Revision: D30906090 Pulled By: dianaml0 fbshipit-source-id: 941d30db7f766c9077a1b5bb2a04680f57e2e070

…change (facebookresearch#2297) Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Fixes # (issue). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/2297 Reviewed By: alexeib Differential Revision: D30906090 Pulled By: dianaml0 fbshipit-source-id: 941d30db7f766c9077a1b5bb2a04680f57e2e070

dkavaler added bug needs triage labels Jul 3, 2020

lematt1991 removed the needs triage label Jul 6, 2020

myleott assigned ngoyal2707 Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add_insertion_noise in DenoisingDataset does not respect max_source_positions #2297

add_insertion_noise in DenoisingDataset does not respect max_source_positions #2297

dkavaler commented Jul 3, 2020

lematt1991 commented Jul 6, 2020

add_insertion_noise in DenoisingDataset does not respect max_source_positions #2297

add_insertion_noise in DenoisingDataset does not respect max_source_positions #2297

Comments

dkavaler commented Jul 3, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

lematt1991 commented Jul 6, 2020