Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IOBinding support to ONNX Runtime module #421

Merged
merged 53 commits into from
Nov 2, 2022
Merged

Conversation

JingyaHuang
Copy link
Contributor

@JingyaHuang JingyaHuang commented Oct 13, 2022

Context

As reported by users, when using devices for acceleration, there is sometimes a significant performance drop. And the slow-down is especially significant for decoders(so also for seq2seq models). This is due to a large overhead while copying data across the host and device.

This PR will introduce IOBinding of ONNX Runtime to arrange inputs and pre-allocate outputs on device.

What does this PR do?

  • Create TypeHelper to prepare IO binding
  • Integrate IO binding into ORTModels

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Associated issues:
#362 #365 #404 #414

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 13, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this killer feature @JingyaHuang 🔥 !!

I've left a few nits, but the API design looks great to me. Would you mind sharing a small code example in the PR description once you have everything ready for a final review?

I'm especially interested to know if quantization / optmization play nice with the current implementation :)

JingyaHuang and others added 13 commits October 17, 2022 14:49
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Copy link
Contributor Author

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewtun, thanks for the review!!!

I will apply the helper on other ort models and update the PR description with a snippet once it is finished.

Copy link
Contributor

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @JingyaHuang! I added some first comments to the io_binding_helper.py and first model.
Most of my comments are about performance. I am not sure if we need to keep things typer dynamic since we have dedicated classes for each "task" meaning, which allows us to have more "static/defined" code in the forward method to improve latency.
It would be interested if you can take a look at my comments and try to evaluate the performance of replacing those "dynamic loops" with a more "static approach"

Copy link
Contributor Author

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philschmid Thanks for reviewing! I have added some modifications according to the comments. And the tests are also added.

@JingyaHuang
Copy link
Contributor Author

JingyaHuang commented Oct 26, 2022

Hi folks,

Thanks for helping out and reviewing. I think the PR is ready for final review. Now the IO binding direct is applied to all ORTModels except for ORTModelForCustomTasks, for which I will open another PR.

Also since our last discussion, now the buffers' size for ORTModelForSeq2SeqLM has been reduced.

@lewtun Here is a (rough) snippet that I use for benchmarking.

from pathlib import Path
import numpy as np
import pandas as pd
from time import perf_counter
import torch
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
from optimum.onnxruntime.modeling_seq2seq import ORTModelForSeq2SeqLM
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig, OptimizationConfig

model_id = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
onnx_path = Path("results_seq2seq/")
seq_lengths = [8, 16, 32, 64, 128, 256, 512]

# Load vanilla onnx model
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

# Graph optimization
optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(optimization_level=2)  # enable all optimizations
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)

def benchmark(seq_len, model, tokenizer, device, iterations=200):
    # prepare date
    seq_len = "l " * (seq_len - 2)
    payload = tokenizer(seq_len, return_tensors="pt")
    payload = {key: val.to(device) for key, val in payload.items()}
    latencies = []
    # warm up
    for _ in range(10):
        _ = model.generate(**payload)
    # Timed run
    for _ in range(iterations):
        start_time = perf_counter()
        _ = model.generate(**payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_p95_ms = 1000 * np.percentile(latencies, 95)
    return {"seq_len": payload["input_ids"].shape[1], "time_avg_ms": time_avg_ms, "time_p95_ms": time_p95_ms}

device = torch.device("cuda:0")
# Baseline: PyTorch
config = AutoConfig.from_pretrained(model_id, use_cache=True)
pt_model = AutoModelForSeq2SeqLM.from_config(config)
pt_model.to(device)
# Case 1: Vanilla onnx with IO binding
v_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
    model_id, from_transformers=True, use_cache=True, use_io_binding=True
)
v_onnx_model.to(device)
# Case 2: graph optimized onnx with IOBinding
optim_onnx_model = ORTModelForSeq2SeqLM.from_pretrained(
    model_id="results_seq2seq",
    encoder_file_name="encoder_model_optimized.onnx",
    decoder_file_name="decoder_model_optimized.onnx",
    decoder_with_past_file_name="decoder_with_past_model_optimized.onnx",
)
optim_onnx_model.to(device)

# Benchmark
res = []
for seq_len in seq_lengths:
    print("seq_len: ", seq_len)
    pt = benchmark(seq_len, pt_model, tokenizer, device, iterations=500)
    res.append({**pt, "model": "pt"})

    v_onnx = benchmark(seq_len, v_onnx_model, tokenizer, device, iterations=500)
    res.append({**v_onnx, "model": "v_onnx"})

    optim_onnx = benchmark(seq_len, optim_onnx_model, tokenizer, device, iterations=500)
    res.append({**optim_onnx, "model": "optim_onnx"})

df = pd.DataFrame(res)
print(df)

chart_df = pd.merge(
    df[df.model == "pt"][["seq_len", "time_p95_ms"]],
    df[df.model == "v_onnx"][["seq_len", "time_p95_ms"]],
    on="seq_len",
)
chart_df = chart_df.rename(
    columns={
        "time_p95_ms_x": "pt_p95",
        "time_p95_ms_y": "v_onnx_p95",
    }
)
chart_df = pd.merge(
    chart_df,
    df[df.model == "optim_onnx"][["seq_len", "time_p95_ms"]],
    on="seq_len",
)
chart_df = chart_df.rename(
    columns={
        "time_p95_ms": "optim_onnx_p95",
    }
)

chart_df["io_improvement/pt"] = f"{round((chart_df['pt_p95'] - chart_df['v_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"
chart_df["io+optim/pt"] = f"{round((chart_df['pt_p95'] - chart_df['optim_onnx_p95']) / chart_df['pt_p95'] * 100,2)}%"

plt = chart_df.plot(x="seq_len", y=["pt_p95", "v_onnx_p95", "optim_onnx_p95"], kind="line")
plt.figure.savefig("gpu_res_iobinding_seq2seq.png", dpi=900)

print(chart_df.head(10))
chart_df.to_csv("gpu_res_iobinding_seq2seq.csv")

@echarlaix I add some seq2seq models to ORTConfigManager in order to pass the modeling tests. For they are tested with graph optimization. And whether a model can be well optimized is decided by check_optimization_supported_model_or_raise now(also updated in ORTOptimizer).

Also gently pinging @philschmid @michaelbenayoun and @fxmarty.

Copy link
Contributor

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge PR @JingyaHuang 🔥 🚀
I just left a couple of minor comments.

Copy link
Contributor

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @JingyaHuang 🚀✅ Looks good to me. Everything else can be a follow up PR.
I left some two minor comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants