Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenAI is running 2x as fast as vanilla onnxruntime #836

Closed
elephantpanda opened this issue Aug 24, 2024 · 2 comments
Closed

GenAI is running 2x as fast as vanilla onnxruntime #836

elephantpanda opened this issue Aug 24, 2024 · 2 comments
Labels

Comments

@elephantpanda
Copy link

elephantpanda commented Aug 24, 2024

OK, this is not a bug. But I am running phi-mini-int4 using the usual onnxruntime c# API and it is 2x as slow as when I use the genai code. I am using DirectML c# managed API and am testing it with sequence_length=1 each iteration and using bound inputs and outputs. Basically I am just calling this in a loop, and not changing the input each time for testing but it is still not as fast as genai:
session.RunWithBinding(runOptions, binding);

So in that sense I can say well done for making genai so fast. 🙂

On the other hand, I wonder if you can share the settings or source code for things like sessionOptions and so on. GenAI is good but I really need to use the full capability of onnxruntime API. Since I believe GenAI is built on top of onnxruntime, it would be nice to be able to see the source code for this so I can make my app using onnxruntime API as fast as the GenAI code.

I am using the managed onnxruntime library from nuget 1.19.1 and it is using the DirectML.dll which was installed with genai.

Thanks for any help you can give.

@yufenglee
Copy link
Member

yufenglee commented Aug 26, 2024

Liked i replied here :microsoft/onnxruntime#21847 (comment). In GenAI, past/present kv shares the same buffer. Please refer to here:

if (past_present_share_buffer_) {
for (int i = 0; i < layer_count_ * 2; ++i) {
state_.inputs_[input_index_ + i] = presents_[i].get();
}
}
}
void KV_Cache::Update(std::span<const int32_t> beam_indices, int current_length) {
// If we're sharing past & present buffers there is nothing to do here, so early exit
if (past_present_share_buffer_)
.

It is controlled by the option from genai_config.json:
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/blob/main/directml/directml-int4-awq-block-128/genai_config.json#L51

@natke
Copy link
Contributor

natke commented Sep 5, 2024

Closing this issue as I believe we have an explanation. Please re-open or create a new issue if needed

@natke natke closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants