-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Why does genai run 2x as fast as vanilla managed onnxruntime? #21847
Comments
Source code of genai: https://github.com/microsoft/onnxruntime-genai. For example, use i/o binding to bind past and present to a fixed buffer. Otherwise, copying kv cache will slow down generation significantly. |
In case you look at the GenAI code, the GenAI library doesn't use I/O binding but it passes preallocated output OrtValues to the Session::Run() function. This has the same performance benefit, as it avoids copies and allocations. I'm not sure if this is convenient in the C# APIs. |
Thanks I will try it. BTW I am using the net standard 2.0 API for onnxruntime. I don't know if it would make a difference using a different version like net 6.0? (I assumed it wouldn't be since it's just calling functions in the dll mostly?) If I can give you some more information about why I want to use onnxruntime API rather than the genai API, it's because mainly I would like to have more control about manipulating the inputs and outputs. e.g. the input tokens and the output probability vectors. Which unfortunately is not accessibly currently with the genai API (even though it's good to get up and running fast which is appreciated.) In an ideal world it would be nice if these two libraries had more compatibility - such as using the same tensor format. Thanks. These are my session options so far which I tried to copy from the genai code. Apart from the execution provider the others don't seem to have much effect:
|
I changed the bound output from: using
But now when I call:
it slows down again. So back to the drawing board... |
You do not need IOBinding. With the new OrtValue based API you can achieve the same performance and avoid much of the garbage colleciton. https://onnxruntime.ai/docs/tutorials/csharp/basic_csharp.html |
OK thanks. Well that makes things a easier. 😊 I'm still not sure why my onnxruntime code is slower than the genai code. I'll see if I can share my project. Or if there is already a pure c# onnxruntime API project that someone has made for an LLM it would be nice to look at it. I think it's actually the model itself that is running faster using the genai code. There's probably some trick I missed somewhere. 🤔 Or perhaps it's just the managed dot net runtime that is missing some trick (like does it support int4?) Or perhaps there's some setting I'm missing when passing back in the cached key values. I'll keep trying at it. |
We (the GenAI team) have been trying to figure out what types of custom scoring people will be doing to keep the API simple, can you share more about what custom scoring you're doing? We have some proposed APIs to return the logits and let you append tokens during the generation loop, but with all of the different providers (cuda/directml/etc) it's tricky optimizing the data flow to avoid copies. A simple pseudocode of what you're doing, perhaps with an imaginary GenAI API would be great, so we can see if we can make it possible |
Hi thanks for your reply. Here is an example. Well one problem I'm having is that sometimes GenAI generates a premature END token. And I want to tell it, to pick a different one. In other words I want to change the probability of certain tokens at various steps, or just to have my own custom function to select the token myself given the probabilities. Also, just for experimentation purposes to try out different algorithms such as doing my own implementations of beam search or trying out speculative decoding (using a smaller model to predict a few tokens in advance) . It is nice to have hard-coded solutions but I'd also like the flexibility to experiment. For making an app especially in a game it is important to be able to experiment and optimise and find different "tricks". I would be quite happy if there was a function like Here is some pseudo code for a chat-like model:
So GenAi works great except for a few issues:
So these are my main roadblocks. For balance here are my points about why I would like to use GenAi over pure onnxruntime code:
Hope this helps 🙂 |
A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append kv for new generated tokens to existing one. It avoids copying of past kv cache. For the issues of genai, we can discuss in details in GenAI repro. |
Interesting perhaps that is what is giving the big speed up? 🤔Well, who knows.
Thanks. |
This is great to know. So for your case, would these hypothetical APIs let you do what you want?
Would OgaTensor always being in CPU memory be a problem or would you expect it to be in DML device memory? |
I think that's about right. For me personally I might prefer something like As for CPU, from my perspective that doesn't bother me as it's only 32064 values which is is barely anything. That's just my opinion. And I'd most likely do the calculation on the CPU. This would get the logits/probability for only one token. Although for something like speculative decoding it requires getting the logits of more than one position in the output. So in an ideal world this would be supported too. e.g. P.S. As well as |
Yes, you can try disabling the past_present_share_buffer option and will be able to see the difference. |
I tried it. Unfortunately it gives me an error if I disable it (is this expected?):
Here is the error:
(The context length is 4096 and my input string comes to 11 tokens) Do I have to pad the input? |
i see. You're using DML. It is required for DML EP. |
Returning the raw logits is the clearest for an API like this. Softmax is just one of the internal steps that might be used in processing the logits, and there are variations on it. For speculative decoding, it sounds like you need to have 'GetLogits()' be sized to match the number of tokens added. So when adding multiple speculated tokens, you'd get back the same count in the returned logits. For 'RemoveLastToken()' we are planning on adding a 'Rewind()' function that lets you rewind the generation process by any number of tokens. This should cover what you need. |
Yes that sounds like it covers everything 🙂. I can't think of any other things but other people might have some ideas. (Just to be clear with the speculative decoding it's getting the logits (or predicted token) from the output for several positions in a single iteration - a new token plus the past N tokens. Rather than accumulating it over several iterations. Then looking at the past N tokens and seeing which are predicted correctly and rejecting the others.) It's probably not a big deal at the moment since it would require a smaller model compatible with the phi-3 tokenizer which I'm not sure if there is one at the moment. It works best for highly predictable text, like code or speech recognition (like Whisper). I have tried this before with other models and could get up to 2x speed up sometimes more. So it's worth supporting it I think if possible. There's also even more complicated versions of this using batches, which I just learned about today! Another thing logits would be useful for is to calculate the average "confidence score" of a sentence by doing some average over the probabilities that were used to select each token. |
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Describe the issue
I am running phi3-mini-int4 using the usual onnxruntime c# API and it is 2x as slow as when I use the genai code. I am using DirectML c# managed API and am testing it with sequence_length=1 each iteration and using bound inputs and outputs. Basically I am just calling this in a loop, and not changing the input each time for testing but it is still not as fast as genai:
session.RunWithBinding(runOptions, binding);
So in that sense I can say well done for making genai so fast. 🙂
On the other hand, I wonder if you can share the settings or source code for things like sessionOptions and so on. GenAI is good but I really need to use the full capability of onnxruntime API. Since I believe GenAI is built on top of onnxruntime, it would be nice to be able to see the source code for this so I can make my app using onnxruntime API as fast as the GenAI code.
I am using the managed onnxruntime library from nuget 1.19.1 and it is using the DirectML.dll which was installed with genai.
Thanks for any help you can give.
To reproduce
running a phi-3 model using genai code and then trying to run the same model using onnxruntime c# api
Urgency
No response
Platform
Windows
OS Version
10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.1
ONNX Runtime API
C#
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: