Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328

shaltielshmid · 2024-12-02T16:59:19Z

We are excited to review your PR.

So we can do the best job, please check:

[v] There's a descriptive title that will make sense to other developers some time from now.
[v] There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
[v] Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
You have included any necessary tests in the same PR.

Hey all! I was working with BertTokenizer, and noticed that when I specified "BasicTokenization" and "Lowercase" then I was getting the "Collection Modified" error, since the updated dictionary is iterated over and updated in the same loop.
Solution: Just assign the dictionary after.

I didn't include an issue/tests since this was just a simple typo fix. I'm happy to expand further if needed.

…fied

codecov · 2024-12-02T18:34:45Z

Codecov Report

Attention: Patch coverage is 98.86364% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.89%. Comparing base (fb7cc25) to head (25a4c6d).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs	94.44%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7328      +/-   ##
==========================================
+ Coverage   68.88%   68.89%   +0.01%     
==========================================
  Files        1473     1473              
  Lines      274201   274277      +76     
  Branches    28419    28421       +2     
==========================================
+ Hits       188881   188972      +91     
+ Misses      77998    77984      -14     
+ Partials     7322     7321       -1

Flag	Coverage Δ
Debug	`68.89% <98.86%> (+0.01%)`	⬆️
production	`63.30% <94.73%> (+<0.01%)`	⬆️
test	`89.43% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs	`75.69% <100.00%> (ø)`
...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs	`100.00% <100.00%> (ø)`
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs	`63.70% <94.44%> (+4.01%)`	⬆️

... and 7 files with indirect coverage changes

michaelgsharp · 2024-12-04T16:52:36Z

@shaltielshmid thanks for submitting this!

@tarekgh any other thoughts?

shaltielshmid · 2024-12-04T17:02:25Z

I haven't tested this, just a theory - but a fear I have with the lowercase is that even though we identify the special tokens and keep them as a separate unit, because they are converted to lowercase they aren't going to be identified at the vocabulary lookup stage.
I think we need to add it to the vocab as well, as is done in the AddSpecialToken function. @michaelgsharp what do you think?

tarekgh · 2024-12-04T17:22:40Z

@shaltielshmid Thank you for catching that! Could you please add a test using the same code that reproduced the issue?

tarekgh · 2024-12-04T17:23:39Z

I marked this for 4.0 as will be good to service this fix.

CC @ericstj

Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.

tarekgh · 2024-12-04T17:45:08Z

they are converted to lowercase they aren't going to be identified at the vocabulary lookup stage.

This is not true. When lowercase option is turned on, the whole input text will be lowered cased before processing. This will always guarantee to handle the special tokens correctly.

I think we need to add it to the vocab as well, as is done in the AddSpecialToken function.

We don't need to add the lowered case to the vocab. We don't lowercase all vocab, and we shouldn't do it. as I mentioned, with enabling lowercasing, will always lowercase the input text before processing and this produces the correct result. You can try to test it yourself. You may look at the test

machinelearning/test/Microsoft.ML.Tokenizers.Tests/BertTokenizerTests.cs

Line 60 in 81122c4

Assert.Equal("[cls] hello, how are you? [sep]", normalizedText);

too.

shaltielshmid · 2024-12-04T19:20:00Z

@tarekgh Thank you for the response!

Looking into it further, there seems to be a bigger question here.

In the vocabulary, the special tokens are stored in uppercase (e.g., [CLS], [MASK], etc.).

When the user does not specify the SpecialTokens, then the special tokens dictionary is created in the code. When this is done, the tokens are normalized before adding them to the SpecialTokens dictionary, and they are also appended to the vocabulary in their lowercase form:

machinelearning/src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs

Lines 715 to 726 in 81122c4

    
           if (lowerCase) 
        
           { 
        
               // Lowercase the special tokens to have the pre-tokenization can find them as we lowercase the input text. 
        
               // we don't even need to do case-insensitive comparisons as we are lowercasing the input text. 
        
               normalizedToken = token.ToLowerInvariant(); 
        
               // Add lowercased special tokens to the vocab if they are not already there. 
        
               // This will allow matching during the encoding process. 
        
               vocab[new StringSpanOrdinalKey(normalizedToken)] = id; 
        
           } 
        
           specialTokens[normalizedToken] = id;

This solves the issue of identifying the lowercased special tokens when tokenizing the text, but also brings up a few questions:

1] There is a hidden assumption here that the SpecialTokens dictionary keys will be lowercased.

2] What if the lowercased special token already existed in the vocabulary, as a regular token? Of course, this is not a very likely scenario, but worth considering.

3] In the WordPieceTokenizer.cs file, a reversed dictionary is created (SpecialTokensReverse) - does it make sense that this dictionary value points to the lowercased token?

It seems like the usage of this dictionary is just to check if an ID is a special token, so to solve this we can just replace this variable with a HashSet of the IDs.

This lead to an issue with the way the SpecialTokens are handled when specified explicitly:

If the SpecialTokens keys were passed in uppercase (as would be expected), then the code creates a copy of the dictionary, where each key appears both in uppercase and lowercase - which is different than what happens when the dictionary is created internally, and also causes it to crash when creating the SpecialTokensReverse dictionary (since we have multiple keys pointing to the same value).
In addition, since the values aren't added to the vocabulary, then they won't be converted to the correct token.

Example code to reproduce: The same test you pointed me to here, with the following change:

Replace the line that creates the BertTokenizer with this:

var specialTokens = new Dictionary<string, int>() {
    { "[PAD]", 0 },
    { "[UNK]", 1 },
    { "[CLS]", 2 },
    { "[SEP]", 3 },
    { "[MASK]", 4 },
};
// Create two separate options, since during the create the dictionary is manipulated and the Options variables are mutable. 
var options1 = new BertOptions() { SpecialTokens = specialTokens.ToDictionary() };
var options2 = new BertOptions() { SpecialTokens = specialTokens.ToDictionary() }; 
BertTokenizer[] bertTokenizers = [BertTokenizer.Create(vocabFile, options1), BertTokenizer.Create(vocabStream, options2)];

This will throw an error during the assert stage, since the first token will be converted to an [UNK] since a lowercase [cls] doesn't exist in the vocabulary.

In order for this to work, you need to take the change from my commit in this PR, and also replace the following line:

machinelearning/src/Microsoft.ML.Tokenizers/Model/WordPieceTokenizer.cs

Line 50 in 81122c4

    
           SpecialTokensReverse = specialTokens is not null ? specialTokens.ToDictionary(kvp => kvp.Value, kvp => kvp.Key) : null;

with:

SpecialTokensReverse = options.SpecialTokens is not null ? options.SpecialTokens.GroupBy(kvp => kvp.Value).ToDictionary(g => g.Key, g => g.First().Key) : null;

With all this being said, the simplest solution is to make the special tokens handling be the same in both cases - create a copy of the special tokens dictionary with only the lowercase keys, and add them explicitly to the vocabulary (as is done in AddSpecialToken).

What do you think?

tarekgh · 2024-12-04T21:03:21Z

Thanks @shaltielshmid!

1] There is a hidden assumption here that the SpecialTokens dictionary keys will be lowercased.

That is right when the lower-case option is specified. This is important because we lower case the input text we tokenize before processing it.

2] What if the lowercased special token already existed in the vocabulary, as a regular token? Of course, this is not a very likely scenario, but worth considering.

This doesn't matter much. the current code is just doing vocab[new StringSpanOrdinalKey(normalizedToken)] = id;. So, it will just override the id value there which even correct the problem if the vocab was using wrong id for that special token and specifying lowercasing.

3] In the WordPieceTokenizer.cs file, a reversed dictionary is created (SpecialTokensReverse) - does it make sense that this dictionary value points to the lowercased token?

We can think of normalizing the tokens when creating this reverse mapping. Note, this is internal so far anyway.

It seems like the usage of this dictionary is just to check if an ID is a special token, so to solve this we can just replace this variable with a HashSet of the IDs.

Although we have this as internal property today, it is possible we'll need to expose it in the future to allow mapping the special token id to the string token. Having it as a dictionary should be correct to use for that.

I stand corrected regarding adding the lowered cased special tokens to the vocab. I forgot we already doing that when the special tokens not provided in the options. Yes, we need to have consistent behavior whether the special tokens are provided or not.
We could call AddSpecialToken on the provided special tokens

                if (options.SpecialTokens is not null)
                {
                    if (lowerCase)
                    {
                        Dictionary<string, int> dic = options.SpecialTokens.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

                        foreach (var kvp in options.SpecialTokens)
                        {
                            AddSpecialToken(vocab, dic, kvp.Key, lowerCase: true);
                        }
                        
                        // I commented the following line too to avoid overwriting the special tokens in the options. we may consider doing the same in the case when the special tokens are not provided too.
                        // options.SpecialTokens = dic; 
                    }
                }

One thing we can consider is not overwriting the special tokens inside the options. This can be done by storing the special tokens in a local variable and use it in the line

options.PreTokenizer ??= options.ApplyBasicTokenization ? PreTokenizer.CreateWordOrPunctuation(options.SplitOnSpecialTokens ? options.SpecialTokens : null) : PreTokenizer.CreateWhiteSpace();

This should help not needing the change

SpecialTokensReverse = options.SpecialTokens is not null ? options.SpecialTokens.GroupBy(kvp => kvp.Value).ToDictionary(g => g.Key, g => g.First().Key) : null;

But I don't mind having this change anyway just in case anyone provides a duplicated Ids special tokens.

Let me know if there is anything unclear or if you want me to help edit your PR. Thanks for your help!

…on-mod

shaltielshmid · 2024-12-05T00:04:21Z

Thank you for the detailed response @tarekgh !

I updated the code as discussed, and added tests for both.

Two things I'd like to note:

1] When the code dynamically creates the special tokens dictionary, I kept a separate dictionary of the un-normalized tokens which I assigned to the BertOptions.SpecialTokens so that it would propagate onwards to the WordPieceTokenizer class.

I kept the tokens not normalized, so that the behavior would match when passing in the tokens explicitly and when the code creates it.
We still need to modify the BertOptions variable so that the SpecialTokens gets propagated to the WordPieceTokenizer, even though we would rather not modify it.

2] This is a general comment about the tokenizer with lowercase - in hugginface's tokenizers library, they extract the special tokens before the normalization, so that they only match exact-case matches. (To be precise, they store a flag for each special token specifying whether the special token is pre/post normalizing, and then they have two procedures where they extract the special tokens - before normalization and before pre-tokenization).

This will result in different behavior, where in Python we would have:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('bert-base-uncased')
tok.tokenize("[cls] hello")
# Output: ['[', 'cl', '##s', ']', 'hello']

And in Microsoft.ML:

tokenizer = BertTokenizer.Create(vocab_file, options);
tokenizer.EncodeToTokens("[cls] hello", out _)); 
// Output: ["[CLS]", "hello"]

Not critical to change, but we should be aware that there is this discrepancy. If you're interested, I'm happy to try and create a separate PR which addresses this issue.

src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs

tarekgh · 2024-12-05T00:46:47Z

@shaltielshmid your changes look good, I left a comment if you can address it.

shaltielshmid · 2024-12-05T01:45:51Z

Thanks for your comment - great point, I updated the code accordingly.

tarekgh

LGTM!

Thanks @shaltielshmid

tarekgh · 2024-12-05T16:56:36Z

/backport to release/4.0

github-actions · 2024-12-05T16:56:48Z

Started backporting to release/4.0: https://github.com/dotnet/machinelearning/actions/runs/12184398388

Moved special tokens assignment below so the collection won't be modi…

6e5fbed

…fied

dotnet-policy-service bot added the community-contribution label Dec 2, 2024

This was referenced Dec 2, 2024

Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA followed by process killed / return 137 #6978

Open

Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA timeout failure #7251

Closed

tarekgh added the Tokenizers label Dec 4, 2024

tarekgh self-assigned this Dec 4, 2024

tarekgh added this to the ML.NET 4.0 milestone Dec 4, 2024

tarekgh requested a review from Copilot December 4, 2024 17:24

Copilot AI reviewed Dec 4, 2024

View reviewed changes

shaltielshmid added 5 commits December 5, 2024 01:09

Added safe dictionary inversion

5edde92

Added storing the not-normalized special tokens

84bd69d

Added support for net standard

6ae9d10

Added and updated tests

4f06337

Merge remote-tracking branch 'dotnet/main' into bug/bert-tok-collecti…

0fcbe64

…on-mod

tarekgh reviewed Dec 5, 2024

View reviewed changes

src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs Outdated Show resolved Hide resolved

Updated without additional memory allocation

25a4c6d

tarekgh approved these changes Dec 5, 2024

View reviewed changes

tarekgh merged commit 01c4164 into dotnet:main Dec 5, 2024
25 checks passed

github-actions bot mentioned this pull request Dec 5, 2024

[release/4.0] Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7330

Merged

github-actions bot locked and limited conversation to collaborators Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328

Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328

shaltielshmid commented Dec 2, 2024

codecov bot commented Dec 2, 2024 •

edited

Loading

michaelgsharp commented Dec 4, 2024

shaltielshmid commented Dec 4, 2024

tarekgh commented Dec 4, 2024

tarekgh commented Dec 4, 2024

tarekgh commented Dec 4, 2024 •

edited

Loading

shaltielshmid commented Dec 4, 2024 •

edited

Loading

tarekgh commented Dec 4, 2024 •

edited

Loading

shaltielshmid commented Dec 5, 2024

tarekgh commented Dec 5, 2024 •

edited

Loading

shaltielshmid commented Dec 5, 2024

tarekgh left a comment

tarekgh commented Dec 5, 2024

github-actions bot commented Dec 5, 2024

Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328

Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328

Conversation

shaltielshmid commented Dec 2, 2024

codecov bot commented Dec 2, 2024 • edited Loading

Codecov Report

michaelgsharp commented Dec 4, 2024

shaltielshmid commented Dec 4, 2024

tarekgh commented Dec 4, 2024

tarekgh commented Dec 4, 2024

Choose a reason for hiding this comment

tarekgh commented Dec 4, 2024 • edited Loading

shaltielshmid commented Dec 4, 2024 • edited Loading

tarekgh commented Dec 4, 2024 • edited Loading

shaltielshmid commented Dec 5, 2024

tarekgh commented Dec 5, 2024 • edited Loading

shaltielshmid commented Dec 5, 2024

tarekgh left a comment

Choose a reason for hiding this comment

tarekgh commented Dec 5, 2024

github-actions bot commented Dec 5, 2024

codecov bot commented Dec 2, 2024 •

edited

Loading

tarekgh commented Dec 4, 2024 •

edited

Loading

shaltielshmid commented Dec 4, 2024 •

edited

Loading

tarekgh commented Dec 4, 2024 •

edited

Loading

tarekgh commented Dec 5, 2024 •

edited

Loading