You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @un-certainty , yes if you are using CUDAExecutionProvider, using IO Binding is probably helpful. I don't have a proper benchmark at hand though.
Also I wonder if the caches are perserved on GPU, will it potentially cause a memory explosion? When the QPS is high and sequences are long, there will be a lot of intermediate tensors. I'm not sure if this could lead to OOM.
Feature request
The PR #647 was merged that adds support for merged without/with past decoder as a single ONNX file, along with inference in ORTModelForCausalLM.
Some key steps are still remaining:
2023-02-10 16:29:24.868007832 [W:onnxruntime:, graph.cc:3487 CleanUnusedInitializersAndNodeArgs] Removing initializer '/transformer/h.4/attn/Constant_18_output_0'. It is not used by any node and should be removed from the model.
, tracked inCleanUnusedInitializersAndNodeArgs
warnings are printed only with subgraphs microsoft/onnxruntime#14694bloom
that is currently uglycodegen
does not support-with-past
intasks.py
Motivation
Reduce memory usage
Your contribution
/
The text was updated successfully, but these errors were encountered: