-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Branch] KV Cache Interface #1083
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate
* initial commit * coreys simplifications * finishing the second model static * ready, time for beautification * ready for review * moved the code to examples * fix eos logic * add argument num_tokens_to_generate * initial commit * change order * Update examples/codegen/README.md Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com> --------- Co-authored-by: corey-nm <109536191+corey-nm@users.noreply.github.com>
… window not yet implemented!
…esult. Hey, this is good news still
…E: tokens past the base seq len are repeated
…in the wrong place
dbogunowicz
commented
Jul 7, 2023
…lmagic/deepsparse into feature/damian/fb_kv_cache
…lmagic/deepsparse into feature/damian/fb_kv_cache
bfineran
previously approved these changes
Jul 10, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - we 100% need a bit more testing, let's make a plan for that. Let's also include the deepsparse vs ort perplexities in the description
rahul-tuli
previously approved these changes
Jul 11, 2023
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
bfineran
previously approved these changes
Jul 11, 2023
rahul-tuli
previously approved these changes
Jul 12, 2023
bfineran
approved these changes
Jul 12, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Feature Preview
Feature branch that aggregates all the features constituting the KV Cache Interface implementation. This includes:
No-cache inference
Single-token engine decoding only:
2023-06-27 07:55:20 deepsparse.transformers.engines.nl_decoder_engine INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx 2023-06-27 07:55:24 deepsparse.utils.onnx INFO Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx 2023-06-27 07:56:37 deepsparse.transformers.engines.nl_decoder_engine INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx 2023-06-27 07:56:40 deepsparse.utils.onnx INFO Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx ['\n\nThe president of the United States is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government.\n\nThe president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive branch of government, and the president is the head of the executive branch of government. The president is the head of the executive']
Single-token engine and multi-token engine decoding:
2023-06-27 07:57:53 deepsparse.transformers.engines.nl_decoder_engine INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx 2023-06-27 07:58:47 deepsparse.utils.onnx INFO Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx 2023-06-27 07:58:52 deepsparse.transformers.engines.nl_decoder_engine INFO Overwriting in-place the input shapes of the transformer model at /home/ubuntu/damian/sparseml/deployment/model.onnx 2023-06-27 07:58:58 deepsparse.utils.onnx INFO Overwriting in-place the batch size of the model at /home/ubuntu/damian/sparseml/deployment/model.onnx ['Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is the president of the United States?Who is']
Testing Scope
Manual Tests
The script below terminates without raising an error
Testing with
eval downstream
HF baseline:
Result with kv cache model
Result with non-kv cache model
Current Limitations
LIB.kv_cache
object for cache manipulation. We are also unable to run multi-token inference in the engine due to the issue with "zero-length" cache ingestion. In the deepsparse engine inference, whenever a sequence would normally be processed by the multi-token engine, the single-token engine will take over instead