GenAI is running 2x as fast as vanilla onnxruntime #836

elephantpanda · 2024-08-24T04:35:08Z

OK, this is not a bug. But I am running phi-mini-int4 using the usual onnxruntime c# API and it is 2x as slow as when I use the genai code. I am using DirectML c# managed API and am testing it with sequence_length=1 each iteration and using bound inputs and outputs. Basically I am just calling this in a loop, and not changing the input each time for testing but it is still not as fast as genai:
session.RunWithBinding(runOptions, binding);

So in that sense I can say well done for making genai so fast. 🙂

On the other hand, I wonder if you can share the settings or source code for things like sessionOptions and so on. GenAI is good but I really need to use the full capability of onnxruntime API. Since I believe GenAI is built on top of onnxruntime, it would be nice to be able to see the source code for this so I can make my app using onnxruntime API as fast as the GenAI code.

I am using the managed onnxruntime library from nuget 1.19.1 and it is using the DirectML.dll which was installed with genai.

Thanks for any help you can give.

The text was updated successfully, but these errors were encountered:

yufenglee · 2024-08-26T21:51:29Z

Liked i replied here :microsoft/onnxruntime#21847 (comment). In GenAI, past/present kv shares the same buffer. Please refer to here:

onnxruntime-genai/src/models/kv_cache.cpp

Lines 188 to 197 in 82fdb5e

    
             if (past_present_share_buffer_) { 
        
               for (int i = 0; i < layer_count_ * 2; ++i) { 
        
                 state_.inputs_[input_index_ + i] = presents_[i].get(); 
        
               } 
        
             } 
        
           } 
        
           void KV_Cache::Update(std::span<const int32_t> beam_indices, int current_length) { 
        
             // If we're sharing past & present buffers there is nothing to do here, so early exit 
        
             if (past_present_share_buffer_)

.

It is controlled by the option from genai_config.json:
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/blob/main/directml/directml-int4-awq-block-128/genai_config.json#L51

natke · 2024-09-05T21:37:35Z

Closing this issue as I believe we have an explanation. Please re-open or create a new issue if needed

github-actions bot added the ep:DML label Aug 24, 2024

natke closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenAI is running 2x as fast as vanilla onnxruntime #836

GenAI is running 2x as fast as vanilla onnxruntime #836

elephantpanda commented Aug 24, 2024 •

edited

Loading

yufenglee commented Aug 26, 2024 •

edited

Loading

natke commented Sep 5, 2024

GenAI is running 2x as fast as vanilla onnxruntime #836

GenAI is running 2x as fast as vanilla onnxruntime #836

Comments

elephantpanda commented Aug 24, 2024 • edited Loading

yufenglee commented Aug 26, 2024 • edited Loading

natke commented Sep 5, 2024

elephantpanda commented Aug 24, 2024 •

edited

Loading

yufenglee commented Aug 26, 2024 •

edited

Loading