-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328
Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7328 +/- ##
==========================================
+ Coverage 68.88% 68.89% +0.01%
==========================================
Files 1473 1473
Lines 274201 274277 +76
Branches 28419 28421 +2
==========================================
+ Hits 188881 188972 +91
+ Misses 77998 77984 -14
+ Partials 7322 7321 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@shaltielshmid thanks for submitting this! @tarekgh any other thoughts? |
I haven't tested this, just a theory - but a fear I have with the lowercase is that even though we identify the special tokens and keep them as a separate unit, because they are converted to lowercase they aren't going to be identified at the vocabulary lookup stage. |
@shaltielshmid Thank you for catching that! Could you please add a test using the same code that reproduced the issue? |
I marked this for 4.0 as will be good to service this fix. CC @ericstj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.
This is not true. When
We don't need to add the lowered case to the vocab. We don't lowercase all vocab, and we shouldn't do it. as I mentioned, with enabling
|
@tarekgh Thank you for the response! Looking into it further, there seems to be a bigger question here. In the vocabulary, the special tokens are stored in uppercase (e.g., When the user does not specify the SpecialTokens, then the special tokens dictionary is created in the code. When this is done, the tokens are normalized before adding them to the SpecialTokens dictionary, and they are also appended to the vocabulary in their lowercase form: machinelearning/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs Lines 715 to 726 in 81122c4
This solves the issue of identifying the lowercased special tokens when tokenizing the text, but also brings up a few questions: 1] There is a hidden assumption here that the SpecialTokens dictionary keys will be lowercased. 2] What if the lowercased special token already existed in the vocabulary, as a regular token? Of course, this is not a very likely scenario, but worth considering. 3] In the
This lead to an issue with the way the SpecialTokens are handled when specified explicitly: If the SpecialTokens keys were passed in uppercase (as would be expected), then the code creates a copy of the dictionary, where each key appears both in uppercase and lowercase - which is different than what happens when the dictionary is created internally, and also causes it to crash when creating the Example code to reproduce: The same test you pointed me to here, with the following change: Replace the line that creates the BertTokenizer with this: var specialTokens = new Dictionary<string, int>() {
{ "[PAD]", 0 },
{ "[UNK]", 1 },
{ "[CLS]", 2 },
{ "[SEP]", 3 },
{ "[MASK]", 4 },
};
// Create two separate options, since during the create the dictionary is manipulated and the Options variables are mutable.
var options1 = new BertOptions() { SpecialTokens = specialTokens.ToDictionary() };
var options2 = new BertOptions() { SpecialTokens = specialTokens.ToDictionary() };
BertTokenizer[] bertTokenizers = [BertTokenizer.Create(vocabFile, options1), BertTokenizer.Create(vocabStream, options2)]; This will throw an error during the assert stage, since the first token will be converted to an In order for this to work, you need to take the change from my commit in this PR, and also replace the following line:
with: SpecialTokensReverse = options.SpecialTokens is not null ? options.SpecialTokens.GroupBy(kvp => kvp.Value).ToDictionary(g => g.Key, g => g.First().Key) : null; With all this being said, the simplest solution is to make the special tokens handling be the same in both cases - create a copy of the special tokens dictionary with only the lowercase keys, and add them explicitly to the vocabulary (as is done in AddSpecialToken). What do you think? |
Thanks @shaltielshmid!
That is right when the lower-case option is specified. This is important because we lower case the input text we tokenize before processing it.
This doesn't matter much. the current code is just doing
We can think of normalizing the tokens when creating this reverse mapping. Note, this is internal so far anyway.
Although we have this as internal property today, it is possible we'll need to expose it in the future to allow mapping the special token id to the string token. Having it as a dictionary should be correct to use for that. I stand corrected regarding adding the lowered cased special tokens to the vocab. I forgot we already doing that when the special tokens not provided in the options. Yes, we need to have consistent behavior whether the special tokens are provided or not. if (options.SpecialTokens is not null)
{
if (lowerCase)
{
Dictionary<string, int> dic = options.SpecialTokens.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
foreach (var kvp in options.SpecialTokens)
{
AddSpecialToken(vocab, dic, kvp.Key, lowerCase: true);
}
// I commented the following line too to avoid overwriting the special tokens in the options. we may consider doing the same in the case when the special tokens are not provided too.
// options.SpecialTokens = dic;
}
} One thing we can consider is not overwriting the special tokens inside the options. This can be done by storing the special tokens in a local variable and use it in the line options.PreTokenizer ??= options.ApplyBasicTokenization ? PreTokenizer.CreateWordOrPunctuation(options.SplitOnSpecialTokens ? options.SpecialTokens : null) : PreTokenizer.CreateWhiteSpace(); This should help not needing the change SpecialTokensReverse = options.SpecialTokens is not null ? options.SpecialTokens.GroupBy(kvp => kvp.Value).ToDictionary(g => g.Key, g => g.First().Key) : null; But I don't mind having this change anyway just in case anyone provides a duplicated Ids special tokens. Let me know if there is anything unclear or if you want me to help edit your PR. Thanks for your help! |
Thank you for the detailed response @tarekgh ! I updated the code as discussed, and added tests for both. Two things I'd like to note: 1] When the code dynamically creates the special tokens dictionary, I kept a separate dictionary of the un-normalized tokens which I assigned to the BertOptions.SpecialTokens so that it would propagate onwards to the WordPieceTokenizer class.
2] This is a general comment about the tokenizer with lowercase - in hugginface's This will result in different behavior, where in Python we would have: from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('bert-base-uncased')
tok.tokenize("[cls] hello")
# Output: ['[', 'cl', '##s', ']', 'hello'] And in Microsoft.ML: tokenizer = BertTokenizer.Create(vocab_file, options);
tokenizer.EncodeToTokens("[cls] hello", out _));
// Output: ["[CLS]", "hello"] Not critical to change, but we should be aware that there is this discrepancy. If you're interested, I'm happy to try and create a separate PR which addresses this issue. |
@shaltielshmid your changes look good, I left a comment if you can address it. |
Thanks for your comment - great point, I updated the code accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks @shaltielshmid
/backport to release/4.0 |
Started backporting to release/4.0: https://github.com/dotnet/machinelearning/actions/runs/12184398388 |
We are excited to review your PR.
So we can do the best job, please check:
Fixes #nnnn
in your description to cause GitHub to automatically close the issue(s) when your PR is merged.Hey all! I was working with BertTokenizer, and noticed that when I specified "BasicTokenization" and "Lowercase" then I was getting the "Collection Modified" error, since the updated dictionary is iterated over and updated in the same loop.
Solution: Just assign the dictionary after.
I didn't include an issue/tests since this was just a simple typo fix. I'm happy to expand further if needed.