Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] fix char offset #2137

Merged
merged 1 commit into from
Nov 5, 2022
Merged

Conversation

lanking520
Copy link
Contributor

@lanking520 lanking520 commented Nov 5, 2022

Description

Fixed char offset issues.
fix #2112

@lanking520
Copy link
Contributor Author

@andreabrduque FYI

@@ -389,4 +389,29 @@ public void testTruncationAndPaddingForPairInputs() throws IOException {
Assert.assertEquals(encoding.getIds().length, 8);
}
}

@Test
public void testSpecialTokenHandling() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not need this test, the test above this already has special characters.

@codecov-commenter
Copy link

Codecov Report

Base: 72.08% // Head: 71.40% // Decreases project coverage by -0.68% ⚠️

Coverage data is based on head (67cd1cf) compared to base (bb5073f).
Patch coverage: 71.54% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2137      +/-   ##
============================================
- Coverage     72.08%   71.40%   -0.69%     
- Complexity     5126     6292    +1166     
============================================
  Files           473      624     +151     
  Lines         21970    27847    +5877     
  Branches       2351     3004     +653     
============================================
+ Hits          15838    19883    +4045     
- Misses         4925     6503    +1578     
- Partials       1207     1461     +254     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...main/java/ai/djl/modality/cv/output/Rectangle.java 72.41% <0.00%> (ø)
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <0.00%> (-5.24%) ⬇️
.../modality/cv/translator/ImageFeatureExtractor.java 0.00% <0.00%> (ø)
.../ai/djl/modality/cv/translator/YoloTranslator.java 27.77% <0.00%> (+18.95%) ⬆️
...modality/cv/translator/wrapper/FileTranslator.java 44.44% <ø> (ø)
...y/cv/translator/wrapper/InputStreamTranslator.java 44.44% <ø> (ø)
... and 557 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@lanking520 lanking520 merged commit b50d7fc into deepjavalibrary:master Nov 5, 2022
@andreabrduque
Copy link
Contributor

@andreabrduque FYI

Uh thanks for this one. It was indeed returning UTF-8 Bytes instead of the right char spans :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Returned character token spans are not correct for some inputs
4 participants