Fix misalignment between token offsets returned from the API and samples in the UI #821
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue Description
Right now, for Decoder-Only models, there is a small mis-alignment bug between what is processed by the DQ client and what we see in the UI. Essentially, there is a small disconnect (for many situations) between what the model "sees" (as input and returns alignment / token data on) and string that we display in the UI.
Current Flow
formatted_prompt
formatted_prompt
From this flow, we need the logged text and the sliced response text (from formatted_prompt) to EXACTLY match ---- But currently they don't always, for example in the case below:
Where we see their is an added space in the
slice_response_text
. While subtle and honestly hard to see often in the UI, for certain cases this off by one case can really screw things up.*** Solution ***
The proposed solution is to get rid of having the user log
labels
/targets
and instead directly infer them from as thesliced_response_text
. This way, there will be no discrepancy between what the model sees and what is in the UI!This change works quite well with the current system design because for computing token alignmnet we already
decode
the response tokens in the functionalign_response_tokens_to_character_spans
.Other benefits are:
EOS
token and other special tokens, which we have wanted to make the default anyways! NOTE this is just for Decoder-Only modles.Inputs
andFormatted Prompt
Demonstration
Example run with the error: run
If you look at the token DEPs you can see that the highlighted tokens don't make sense / are clearly off by one. The tokenizer would never encoder
{"m
as one token. It really means to encode{"
but the space is missing!Example run with the fix: run
In the first sample you can see that
{"
is properly tokenized! Also looking at the other samples you see that the space appears before tokens NOT after. This was another sign that in retrospect I should have seen with other runs but did not notice!