Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization with truncation, offset support and tests #47

Closed
wants to merge 7 commits into from

Conversation

fialhocoelho
Copy link

  • Support for Truncation and Offset: The tokenize() function has been edited to include support for truncation and offset.
  • New Tests: Tests have been added to evaluate the tokenization output, focusing on truncation and offset functionality to ensure replicability.
  • Makefile Addition: A Makefile has been added to facilitate the generation of necessary files for gRPC import for vllm/tests/test_server_tokenize_truncate.py
  • Direct Use of tokenizer.encode_plus(): The tokenizer.encode_plus() function is used directly to avoid modifying the tokenizer_group.encode_async function, which is part of the upstream code/project. This ensures compatibility and maintains the integrity of the upstream codebase. This implementation won't affect performance because tokenizer_group.encode_async doesn't run Tokenizer.encode in a background thread.

@fialhocoelho fialhocoelho changed the title Tokenization with Truncation and Offset Support ad tests Tokenization with truncation, offset support and tests Jun 14, 2024
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
@fialhocoelho fialhocoelho force-pushed the jeff-token-truncate branch from 6a521af to 4fbef92 Compare June 15, 2024 05:46
@fialhocoelho fialhocoelho marked this pull request as ready for review June 15, 2024 05:53
batch_encoding = self.tokenizer.encode_plus(
text=req.text,
return_offsets_mapping=request.return_offsets
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill, here the code is calling encode_plus directly because adding an async wrapper would require changing the vLLM code. But I don't think it has a performance impact because the actual encoding in encode_async not offloaded to a thread pool or anything:

    async def encode_async(
            self,
            prompt: str,
            request_id: Optional[str] = None,
            lora_request: Optional[LoRARequest] = None) -> List[int]:
        tokenizer = await self.get_lora_tokenizer_async(lora_request)
        ret = tokenizer.encode(prompt)
        self._raise_if_input_too_long(ret, lora_request)
        return ret

@joerunde
Copy link
Collaborator

joerunde commented Jun 20, 2024

for 515

@joerunde
Copy link
Collaborator

@prashantgupta24 can you help coordinate getting this into the tgis adapter?

tokenized results.
"""
# Log the incoming tokenization request for metrics
service_metrics.observe_tokenization_request(request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
service_metrics.observe_tokenization_request(request)
service_metrics.count_tokenization_request(request)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be making this change in the adapter repo

@njhill
Copy link
Contributor

njhill commented Jul 12, 2024

Closing since this was ported to the vllm-tgis-adapter repo. Thanks @fialhocoelho.

@njhill njhill closed this Jul 12, 2024
@prashantgupta24 prashantgupta24 deleted the jeff-token-truncate branch July 12, 2024 23:58
tdoublep added a commit that referenced this pull request Jan 20, 2025
…U issue (#47)

This PR contains some changes:

- It enables the `only_last_token` option when calling FMS. This only
copies the logits from the last token back in the first iteration, which
leads to faster TTFT. Will need to check if this creates issues for some
vLLM features like `prompt_logprobs` at a later stage.
- There is a bug in torch 2.3.1 which gives incorrect results on CPU
when using prompts of length > 512 tokens. If we set this
`attn_algorithm=math` it works around this for this time being.
- Change the mask format to what FMS is currently using (note I have
note tested the right-padding part of the code, sine we are not using
it)
- Load the model with precision specified by vLLM user (rather than
hard-coded).
- Remove some unused stuff. 
- Refresh aiu_setup to latest version from aiu-fms

---------

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants