-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IOBinding support to ONNX Runtime module #421
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this killer feature @JingyaHuang 🔥 !!
I've left a few nits, but the API design looks great to me. Would you mind sharing a small code example in the PR description once you have everything ready for a final review?
I'm especially interested to know if quantization / optmization play nice with the current implementation :)
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lewtun, thanks for the review!!!
I will apply the helper on other ort models and update the PR description with a snippet once it is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @JingyaHuang! I added some first comments to the io_binding_helper.py
and first model.
Most of my comments are about performance. I am not sure if we need to keep things typer dynamic since we have dedicated classes for each "task" meaning, which allows us to have more "static/defined" code in the forward
method to improve latency.
It would be interested if you can take a look at my comments and try to evaluate the performance of replacing those "dynamic loops" with a more "static approach"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philschmid Thanks for reviewing! I have added some modifications according to the comments. And the tests are also added.
Hi folks, Thanks for helping out and reviewing. I think the PR is ready for final review. Now the IO binding direct is applied to all ORTModels except for Also since our last discussion, now the buffers' size for @lewtun Here is a (rough) snippet that I use for benchmarking. from pathlib import Path
import numpy as np
import pandas as pd
from time import perf_counter
import torch
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
from optimum.onnxruntime.modeling_seq2seq import ORTModelForSeq2SeqLM
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig, OptimizationConfig
model_id = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
onnx_path = Path("results_seq2seq/")
seq_lengths = [8, 16, 32, 64, 128, 256, 512]
# Load vanilla onnx model
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
# Graph optimization
optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(optimization_level=2) # enable all optimizations
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
def benchmark(seq_len, model, tokenizer, device, iterations=200):
# prepare date
seq_len = "l " * (seq_len - 2)
payload = tokenizer(seq_len, return_tensors="pt")
payload = {key: val.to(device) for key, val in payload.items()}
latencies = []
# warm up
for _ in range(10):
_ = model.generate(**payload)
# Timed run
for _ in range(iterations):
start_time = perf_counter()
_ = model.generate(**payload)
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_p95_ms = 1000 * np.percentile(latencies, 95)
return {"seq_len": payload["input_ids"].shape[1], "time_avg_ms": time_avg_ms, "time_p95_ms": time_p95_ms}
device = torch.device("cuda:0")
# Baseline: PyTorch
config = AutoConfig.from_pretrained(model_id, use_cache=True)
pt_model = AutoModelForSeq2SeqLM.from_config(config)
pt_model.to(device)
# Case 1: Vanilla onnx with IO binding
v_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
model_id, from_transformers=True, use_cache=True, use_io_binding=True
)
v_onnx_model.to(device)
# Case 2: graph optimized onnx with IOBinding
optim_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
model_id="results_seq2seq",
encoder_file_name="encoder_model_optimized.onnx",
decoder_file_name="decoder_model_optimized.onnx",
decoder_with_past_file_name="decoder_with_past_model_optimized.onnx",
)
optim_onnx_model.to(device)
# Benchmark
res = []
for seq_len in seq_lengths:
print("seq_len: ", seq_len)
pt = benchmark(seq_len, pt_model, tokenizer, device, iterations=500)
res.append({**pt, "model": "pt"})
v_onnx = benchmark(seq_len, v_onnx_model, tokenizer, device, iterations=500)
res.append({**v_onnx, "model": "v_onnx"})
optim_onnx = benchmark(seq_len, optim_onnx_model, tokenizer, device, iterations=500)
res.append({**optim_onnx, "model": "optim_onnx"})
df = pd.DataFrame(res)
print(df)
chart_df = pd.merge(
df[df.model == "pt"][["seq_len", "time_p95_ms"]],
df[df.model == "v_onnx"][["seq_len", "time_p95_ms"]],
on="seq_len",
)
chart_df = chart_df.rename(
columns={
"time_p95_ms_x": "pt_p95",
"time_p95_ms_y": "v_onnx_p95",
}
)
chart_df = pd.merge(
chart_df,
df[df.model == "optim_onnx"][["seq_len", "time_p95_ms"]],
on="seq_len",
)
chart_df = chart_df.rename(
columns={
"time_p95_ms": "optim_onnx_p95",
}
)
chart_df["io_improvement/pt"] = f"{round((chart_df['pt_p95'] - chart_df['v_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"
chart_df["io+optim/pt"] = f"{round((chart_df['pt_p95'] - chart_df['optim_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"
plt = chart_df.plot(x="seq_len", y=["pt_p95", "v_onnx_p95", "optim_onnx_p95"], kind="line")
plt.figure.savefig("gpu_res_iobinding_seq2seq.png", dpi=900)
print(chart_df.head(10))
chart_df.to_csv("gpu_res_iobinding_seq2seq.csv") @echarlaix I add some seq2seq models to Also gently pinging @philschmid @michaelbenayoun and @fxmarty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge PR @JingyaHuang 🔥 🚀
I just left a couple of minor comments.
…timum into add-ort-iobinding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @JingyaHuang 🚀✅ Looks good to me. Everything else can be a follow up PR.
I left some two minor comments.
Context
As reported by users, when using devices for acceleration, there is sometimes a significant performance drop. And the slow-down is especially significant for decoders(so also for seq2seq models). This is due to a large overhead while copying data across the host and device.
This PR will introduce IOBinding of ONNX Runtime to arrange inputs and pre-allocate outputs on device.
What does this PR do?
Before submitting
Associated issues:
#362 #365 #404 #414