-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue to use GPT2 ONNX export with past key values #552
Comments
@jplu Are you using the Optimum release version (1.5.1)? If so, directly feeding an encoder-decoder exported model from Can you try instead to do directly: If you would like to go first through the path of So using your command, that would look like:
|
Oh I see! This is crystal clear, thanks a lot for your light. I will wait the next release then. Any ETA? Last question, will it be possible to use the exported ONNX file generated by the last command you give directly through ONNXRuntime? As I guess now I get raised an error because of the same problem right? |
I think this week or next week is a good bet! Yes, you'll be able to use the exported ONNX file directly through ONNX Runtime. What the Longer term, we're thinking it could be useful to have an export of ONNX model that can handle the generation end-to-end: #526 |
Perfect! Waiting a single week is perfectly OK 👌 By curiosity I will test with the main branch if I succeed to get it work, and will let you know in this thread if I encounter any issue. Indeed, the generation is the hardest part to handle, on my side basically I host all my ONNX models into a Triton server, and I have The ideal world, the dream, would be indeed a true end-to-end model that handles tokenization+inference for simple encoders and in case of decoders and encoders-decoders models tok+inf+generation. |
I will wait the official release, it seems to be a bit unstable for now:
gives:
The way to install was:
|
cc @michaelbenayoun we should add tests for the CLI |
Thanks @fxmarty for the fix! Nevertheless, these two piece of code:
And
Still raises the errors:
And:
With the ONNX model generated by:
|
Yes apologizes, merging the previous PR closed this automatically! Basically gpt2 is decoder-only, and the support was not yet implemented: #554 However, if you try for example
or with a larger model:
you will see the different files for encoder / decoder / decoder with past. Those can be fed directly into an ORTModel: from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("/path/to/m2m100_onnx_ort")
model = ORTModelForSeq2SeqLM.from_pretrained("/path/to/m2m100_onnx_ort", from_transformers=False, use_cache=True)
tokens = tokenizer("My name is Felix and I like you", return_tensors="pt")
outputs_model = model.generate(**tokens, forced_bos_token_id=tokenizer.get_lang_id("fr"))
print(tokenizer.decode(outputs_model[0])) |
It is ok, no worries! I tried with the model you suggested, and indeed, I get all the three files. And each works like a charm. Even in pure ORT with: import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
sess_encoder = ort.InferenceSession('output/encoder_model.onnx', providers=["CPUExecutionProvider"])
sess_decoder = ort.InferenceSession('output/decoder_model.onnx', providers=["CPUExecutionProvider"])
sess_decoder_pkv = ort.InferenceSession('output/decoder_with_past_model.onnx', providers=["CPUExecutionProvider"])
tokenizer = AutoTokenizer.from_pretrained("output/")
inputs_encoder = dict(tokenizer("My name is Julien and I like", return_tensors="np"))
outputs_encoder = sess_encoder.run(None, inputs_encoder)
inputs_decoder = {
"encoder_hidden_states": outputs_encoder[0],
"encoder_attention_mask": inputs_encoder["attention_mask"],
"input_ids": inputs_encoder["input_ids"]
}
sess_decoder.run(None, inputs_decoder)
inputs_decoder_pkv = inputs_decoder
shape = (1, 16, len(inputs_encoder["input_ids"][0]), 64)
for i in range(12):
inputs_decoder_pkv[f"past_key_values.{i}.encoder.key"] = np.random.uniform(0, 1, shape).astype(np.float32)
inputs_decoder_pkv[f"past_key_values.{i}.encoder.value"] = np.random.uniform(0, 1, shape).astype(np.float32)
inputs_decoder_pkv[f"past_key_values.{i}.decoder.key"] = np.random.uniform(0, 1, shape).astype(np.float32)
inputs_decoder_pkv[f"past_key_values.{i}.decoder.value"] = np.random.uniform(0, 1, shape).astype(np.float32)
sess_decoder_pkv.run(None, inputs_decoder_pkv) I keep this issue open as it was mostly about decoder only, but I'm sure it will be ok once your PR merged! |
Hi @fxmarty!! Thanks a lot for the addition, I have updated the package. This piece of code: from optimum.onnxruntime import ORTModelForCausalLM
from transformers import GPT2Tokenizer
model = ORTModelForCausalLM.from_pretrained("output/", from_transformers=False, use_cache=True)
tokenizer = GPT2Tokenizer.from_pretrained("output/")
tokens = tokenizer("My name is Julien and I like", return_tensors="pt")
outputs_model = model.generate(**tokens) Now perfectly works!! But, unfortunately, this one: import onnxruntime as ort
from transformers import GPT2Tokenizer
import numpy as np
sess = ort.InferenceSession('output/decoder_with_past_model.onnx, providers=["CPUExecutionProvider"])
tokenizer = GPT2Tokenizer.from_pretrained("output/")
tokens = dict(tokenizer("My name is Julien and I like", return_tensors="np"))
shape = (1, 12, len(tokens["input_ids"][0]), 64)
for i in range(12):
tokens[f"past_key_values.{i}.key"] = np.random.uniform(0, 1, shape).astype(np.float32)
tokens[f"past_key_values.{i}.value"] = np.random.uniform(0, 1, shape).astype(np.float32)
sess.run(None, tokens) Still raises the exact same error for me:
The models are still generated with:
Anything I'm doing wrong? |
Yes, I think this is expected. Looking at the shapes in
So this code works: import onnxruntime as ort
from transformers import GPT2Tokenizer
import numpy as np
sess = ort.InferenceSession("/home/fxmarty/hf_internship/optimum/gpt2_onnx/decoder_with_past_model.onnx", providers=["CPUExecutionProvider"])
tokenizer = GPT2Tokenizer.from_pretrained("/home/fxmarty/hf_internship/optimum/gpt2_onnx")
tokens = dict(tokenizer("My name is Julien and I like", return_tensors="np"))
shape = (1, 12, len(tokens["input_ids"][0]) - 1, 64)
tokens["input_ids"] = np.array([[4]], dtype=np.int64)
for i in range(12):
tokens[f"past_key_values.{i}.key"] = np.random.uniform(0, 1, shape).astype(np.float32)
tokens[f"past_key_values.{i}.value"] = np.random.uniform(0, 1, shape).astype(np.float32)
sess.run(None, tokens) |
Oh I missed that part! Thanks a lot for correcting me. |
System Info
Who can help?
@JingyaHuang @ec
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Command line to export a GPT2 model:
Gives the following output logs:
Eventhough there is an error in the close values validation, that's ok. Now I would like to run the model with the following Python:
And I get the following error:
Do I have to randomly feed myself the
past_key_values.X.value
andpast_key_values.X.keys
?When I try to do this directly with onnxruntime, I also get an error. Here what I do:
And I get the following error:
Expected behavior
I expect to have a proper generation and usage with onnxruntime. The final goal is to use it through a Triton server.
I certainly miss something, but the documentation is not clear on how to properly use seq2seq and causal-lm with past-key-values either directly with onnxruntime or with optimum.
Thanks a lot in advance for all the advices you could provide :)
The text was updated successfully, but these errors were encountered: