Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moved SpecialTokens assignment after the modification to avoid "Collection Modified" error #7328

Merged
merged 7 commits into from
Dec 5, 2024

Conversation

shaltielshmid
Copy link
Contributor

We are excited to review your PR.

So we can do the best job, please check:

  • [v] There's a descriptive title that will make sense to other developers some time from now.
  • [v] There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • [v] Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

Hey all! I was working with BertTokenizer, and noticed that when I specified "BasicTokenization" and "Lowercase" then I was getting the "Collection Modified" error, since the updated dictionary is iterated over and updated in the same loop.
Solution: Just assign the dictionary after.

I didn't include an issue/tests since this was just a simple typo fix. I'm happy to expand further if needed.

Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 98.86364% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.89%. Comparing base (fb7cc25) to head (25a4c6d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7328      +/-   ##
==========================================
+ Coverage   68.88%   68.89%   +0.01%     
==========================================
  Files        1473     1473              
  Lines      274201   274277      +76     
  Branches    28419    28421       +2     
==========================================
+ Hits       188881   188972      +91     
+ Misses      77998    77984      -14     
+ Partials     7322     7321       -1     
Flag Coverage Δ
Debug 68.89% <98.86%> (+0.01%) ⬆️
production 63.30% <94.73%> (+<0.01%) ⬆️
test 89.43% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 75.69% <100.00%> (ø)
...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs 100.00% <100.00%> (ø)
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 63.70% <94.44%> (+4.01%) ⬆️

... and 7 files with indirect coverage changes

@michaelgsharp
Copy link
Member

@shaltielshmid thanks for submitting this!

@tarekgh any other thoughts?

@shaltielshmid
Copy link
Contributor Author

I haven't tested this, just a theory - but a fear I have with the lowercase is that even though we identify the special tokens and keep them as a separate unit, because they are converted to lowercase they aren't going to be identified at the vocabulary lookup stage.
I think we need to add it to the vocab as well, as is done in the AddSpecialToken function. @michaelgsharp what do you think?

@tarekgh
Copy link
Member

tarekgh commented Dec 4, 2024

@shaltielshmid Thank you for catching that! Could you please add a test using the same code that reproduced the issue?

@tarekgh tarekgh self-assigned this Dec 4, 2024
@tarekgh tarekgh added this to the ML.NET 4.0 milestone Dec 4, 2024
@tarekgh
Copy link
Member

tarekgh commented Dec 4, 2024

I marked this for 4.0 as will be good to service this fix.

CC @ericstj

@tarekgh tarekgh requested a review from Copilot December 4, 2024 17:24

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no suggestions.

@tarekgh
Copy link
Member

tarekgh commented Dec 4, 2024

they are converted to lowercase they aren't going to be identified at the vocabulary lookup stage.

This is not true. When lowercase option is turned on, the whole input text will be lowered cased before processing. This will always guarantee to handle the special tokens correctly.

I think we need to add it to the vocab as well, as is done in the AddSpecialToken function.

We don't need to add the lowered case to the vocab. We don't lowercase all vocab, and we shouldn't do it. as I mentioned, with enabling lowercasing, will always lowercase the input text before processing and this produces the correct result. You can try to test it yourself. You may look at the test

Assert.Equal("[cls] hello, how are you? [sep]", normalizedText);
too.

@shaltielshmid
Copy link
Contributor Author

shaltielshmid commented Dec 4, 2024

@tarekgh Thank you for the response!

Looking into it further, there seems to be a bigger question here.

In the vocabulary, the special tokens are stored in uppercase (e.g., [CLS], [MASK], etc.).

When the user does not specify the SpecialTokens, then the special tokens dictionary is created in the code. When this is done, the tokens are normalized before adding them to the SpecialTokens dictionary, and they are also appended to the vocabulary in their lowercase form:

if (lowerCase)
{
// Lowercase the special tokens to have the pre-tokenization can find them as we lowercase the input text.
// we don't even need to do case-insensitive comparisons as we are lowercasing the input text.
normalizedToken = token.ToLowerInvariant();
// Add lowercased special tokens to the vocab if they are not already there.
// This will allow matching during the encoding process.
vocab[new StringSpanOrdinalKey(normalizedToken)] = id;
}
specialTokens[normalizedToken] = id;

This solves the issue of identifying the lowercased special tokens when tokenizing the text, but also brings up a few questions:

1] There is a hidden assumption here that the SpecialTokens dictionary keys will be lowercased.

2] What if the lowercased special token already existed in the vocabulary, as a regular token? Of course, this is not a very likely scenario, but worth considering.

3] In the WordPieceTokenizer.cs file, a reversed dictionary is created (SpecialTokensReverse) - does it make sense that this dictionary value points to the lowercased token?

  • It seems like the usage of this dictionary is just to check if an ID is a special token, so to solve this we can just replace this variable with a HashSet of the IDs.

This lead to an issue with the way the SpecialTokens are handled when specified explicitly:

If the SpecialTokens keys were passed in uppercase (as would be expected), then the code creates a copy of the dictionary, where each key appears both in uppercase and lowercase - which is different than what happens when the dictionary is created internally, and also causes it to crash when creating the SpecialTokensReverse dictionary (since we have multiple keys pointing to the same value).
In addition, since the values aren't added to the vocabulary, then they won't be converted to the correct token.

Example code to reproduce: The same test you pointed me to here, with the following change:

Replace the line that creates the BertTokenizer with this:

var specialTokens = new Dictionary<string, int>() {
    { "[PAD]", 0 },
    { "[UNK]", 1 },
    { "[CLS]", 2 },
    { "[SEP]", 3 },
    { "[MASK]", 4 },
};
// Create two separate options, since during the create the dictionary is manipulated and the Options variables are mutable. 
var options1 = new BertOptions() { SpecialTokens = specialTokens.ToDictionary() };
var options2 = new BertOptions() { SpecialTokens = specialTokens.ToDictionary() }; 
BertTokenizer[] bertTokenizers = [BertTokenizer.Create(vocabFile, options1), BertTokenizer.Create(vocabStream, options2)];

This will throw an error during the assert stage, since the first token will be converted to an [UNK] since a lowercase [cls] doesn't exist in the vocabulary.

In order for this to work, you need to take the change from my commit in this PR, and also replace the following line:

SpecialTokensReverse = specialTokens is not null ? specialTokens.ToDictionary(kvp => kvp.Value, kvp => kvp.Key) : null;

with:

SpecialTokensReverse = options.SpecialTokens is not null ? options.SpecialTokens.GroupBy(kvp => kvp.Value).ToDictionary(g => g.Key, g => g.First().Key) : null;

With all this being said, the simplest solution is to make the special tokens handling be the same in both cases - create a copy of the special tokens dictionary with only the lowercase keys, and add them explicitly to the vocabulary (as is done in AddSpecialToken).

What do you think?

@tarekgh
Copy link
Member

tarekgh commented Dec 4, 2024

Thanks @shaltielshmid!

1] There is a hidden assumption here that the SpecialTokens dictionary keys will be lowercased.

That is right when the lower-case option is specified. This is important because we lower case the input text we tokenize before processing it.

2] What if the lowercased special token already existed in the vocabulary, as a regular token? Of course, this is not a very likely scenario, but worth considering.

This doesn't matter much. the current code is just doing vocab[new StringSpanOrdinalKey(normalizedToken)] = id;. So, it will just override the id value there which even correct the problem if the vocab was using wrong id for that special token and specifying lowercasing.

3] In the WordPieceTokenizer.cs file, a reversed dictionary is created (SpecialTokensReverse) - does it make sense that this dictionary value points to the lowercased token?

We can think of normalizing the tokens when creating this reverse mapping. Note, this is internal so far anyway.

It seems like the usage of this dictionary is just to check if an ID is a special token, so to solve this we can just replace this variable with a HashSet of the IDs.

Although we have this as internal property today, it is possible we'll need to expose it in the future to allow mapping the special token id to the string token. Having it as a dictionary should be correct to use for that.

I stand corrected regarding adding the lowered cased special tokens to the vocab. I forgot we already doing that when the special tokens not provided in the options. Yes, we need to have consistent behavior whether the special tokens are provided or not.
We could call AddSpecialToken on the provided special tokens

                if (options.SpecialTokens is not null)
                {
                    if (lowerCase)
                    {
                        Dictionary<string, int> dic = options.SpecialTokens.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

                        foreach (var kvp in options.SpecialTokens)
                        {
                            AddSpecialToken(vocab, dic, kvp.Key, lowerCase: true);
                        }
                        
                        // I commented the following line too to avoid overwriting the special tokens in the options. we may consider doing the same in the case when the special tokens are not provided too.
                        // options.SpecialTokens = dic; 
                    }
                }

One thing we can consider is not overwriting the special tokens inside the options. This can be done by storing the special tokens in a local variable and use it in the line

options.PreTokenizer ??= options.ApplyBasicTokenization ? PreTokenizer.CreateWordOrPunctuation(options.SplitOnSpecialTokens ? options.SpecialTokens : null) : PreTokenizer.CreateWhiteSpace();

This should help not needing the change

SpecialTokensReverse = options.SpecialTokens is not null ? options.SpecialTokens.GroupBy(kvp => kvp.Value).ToDictionary(g => g.Key, g => g.First().Key) : null;

But I don't mind having this change anyway just in case anyone provides a duplicated Ids special tokens.

Let me know if there is anything unclear or if you want me to help edit your PR. Thanks for your help!

@shaltielshmid
Copy link
Contributor Author

Thank you for the detailed response @tarekgh !

I updated the code as discussed, and added tests for both.

Two things I'd like to note:

1] When the code dynamically creates the special tokens dictionary, I kept a separate dictionary of the un-normalized tokens which I assigned to the BertOptions.SpecialTokens so that it would propagate onwards to the WordPieceTokenizer class.

  • I kept the tokens not normalized, so that the behavior would match when passing in the tokens explicitly and when the code creates it.
  • We still need to modify the BertOptions variable so that the SpecialTokens gets propagated to the WordPieceTokenizer, even though we would rather not modify it.

2] This is a general comment about the tokenizer with lowercase - in hugginface's tokenizers library, they extract the special tokens before the normalization, so that they only match exact-case matches. (To be precise, they store a flag for each special token specifying whether the special token is pre/post normalizing, and then they have two procedures where they extract the special tokens - before normalization and before pre-tokenization).

This will result in different behavior, where in Python we would have:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('bert-base-uncased')
tok.tokenize("[cls] hello")
# Output: ['[', 'cl', '##s', ']', 'hello']

And in Microsoft.ML:

tokenizer = BertTokenizer.Create(vocab_file, options);
tokenizer.EncodeToTokens("[cls] hello", out _)); 
// Output: ["[CLS]", "hello"]

Not critical to change, but we should be aware that there is this discrepancy. If you're interested, I'm happy to try and create a separate PR which addresses this issue.

@tarekgh
Copy link
Member

tarekgh commented Dec 5, 2024

@shaltielshmid your changes look good, I left a comment if you can address it.

@shaltielshmid
Copy link
Contributor Author

Thanks for your comment - great point, I updated the code accordingly.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks @shaltielshmid

@tarekgh tarekgh merged commit 01c4164 into dotnet:main Dec 5, 2024
25 checks passed
@tarekgh
Copy link
Member

tarekgh commented Dec 5, 2024

/backport to release/4.0

Copy link

github-actions bot commented Dec 5, 2024

Started backporting to release/4.0: https://github.com/dotnet/machinelearning/actions/runs/12184398388

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants