-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validating ONNX model fails for GPT-J #607
Comments
Hi @Eichhof the model requires high |
Fixed in #609 |
Thank you very much @mht-sharma In addition at the beginning of output (see above) it is written Finally, why is
|
Hi @Eichhof , the inputs are generated randomly for validation. Hence, sometimes the model might be sensitive to inputs which results in the error. You could run the command again and the model should run successfully. The error of <=1e-4 is generally acceptable. However, if the error is high then it needs to be looked into. The |
Note that |
Thank you very much for the information @mht-sharma and @fxmarty . Do you have a comment regarding the other warning Regarding |
Any ideas regarding the not used weights and --for-ort? |
Hi @Eichhof , sorry for the late reply. If you are working on language modeling, where you do text generation, I would advise using
For now, this is the only way we have to be able to use past key values in the decoding. @JingyaHuang did a very nice PR to be able to merge the two models in one #587 , thus avoiding to duplicate the memory use. You may be interested in having a look, but it is not yet integrated with the ORTModelForXX. We do a release today, so the About the weights, not sure, I'll have a look asap |
Thank you very much @fxmarty . When will it be integrated in the ORTModelForXX? Right now, duplicating memory usage means that ORTModelForXX needs double GPU memory (i.e., 28 GB of VRAM)? Or do you refer to CPU memory? If you refer to GPU memory, that would be a problem because I only have a 24 GB GPU. That would be great if you could look into the weights problem. |
Hi @Eichhof , I agree it's a huge issue, I think it's high priority. You may want to have a look at today's release notes, notably the section "Experimental support to merge ONNX decoder with/without past key values": https://github.com/huggingface/optimum/releases/tag/v1.6.0 . We'll gradually improve the documentation to reflect the new features (notably on the export side). |
Thank you @fxmarty . I will have a look at the new release and the experimental support for the merging. I hope that the merging works for GPT-J. |
Hey @Eichhof, I made a support for past key/value in decoder for my own. My version does not require 2 models to be loaded into memory, I think this is a terrible idea, since many decoders weigh a lot. Also, I got rid of many bugs in the implementation of ORTModelForCausalLM class, which I found while trying to use it. You can check out my version here, unfortunately I'm not going to do a PR: |
Thank you very much @hivaze . That sounds very interesting. I will give it a try. How can I use your code? Do I only have to replace the original script with your script? |
@fxmarty I was trying to merge the models with the PR from @JingyaHuang but I'm getting the error |
Yes, you just need to copy the file and change the usual modeling_decoder.py to mine. Or you can redo the imports in my file so that you can use it as a plug-in script outside the library (there are relative imports, they just need to be made absolute). And also remove one more print statement from the forward() method, it's just there for debag. If you want to use model with cache for generation, you can just call After this you will be comfortable using the generate method, since you usually use it without worrying about anything. |
Thank you very much @hivaze .
I don't get it. What do you mean exactly?
After running
I guess all these files are necessary? |
You need only these three files if you want to use cached keys/values in inference:
After copying the script you need to call |
Of course, I think this should all be discussed in a separate issue, or even a PR, but we will consider that you will do me a great favor if you help me test it, and then I can maybe do a PR. |
@hivaze Thanks for working on this! I was wondering how you were handling the case where there aren't yet any past key values yet? The motive at first to introduce these two models was to be able to handle the special first-pass case, resulting in models with different inputs: Looking back I think it was the easiest solution to implement back then, even if of course it's really to good memory-wise. So looking forward to support only a single ONNX for the decoder! cc @JingyaHuang |
Oh, it's a trick to generate a fake cache. We only need to generate (randn) a cache for keys and values for each layer for the past text of length 1. And then, we can safely use the attention mask, masking this fake cache with zero. As far as I have checked, this method really works and does not affect the output of the model when forwarding real tokens. I do not consider this to be a correct solution to the problem, I just realized through experiments that it works. (tested only on gpt-j) |
@fxmarty Thank you! However, can you explain in few words what is the key idea of PR's changes? It's not clear for me if it is possible to use the only decoder with past_key_values as inputs now to generate the text? |
The strategy is to take the In the first pass, dummy past key values must be passed (they will simply not be used). The To be honest, this is a bit of a hack, and there should be cleanier solution than this: https://discuss.huggingface.co/t/how-does-the-onnx-exporter-work-for-generationmodel-with-past-key-value/31316/8?u=fxmarty |
Thanks for the update. Sounds great! How can I use a single ONNX without/with past key values for GPT-J? When loading the ONNX model exported from GPT-J it takes more than 10 minutes until the model is loaded (to load a Huggingface GPT-J model takes around 10s). In addition, when ONNX model is loaded it takes around 2 GB of GPU memory and 55 GB of CPU memory. In comparison, Huggingface GPT-J model is taking 14 GB of GPU memory and around 10 GB of CPU memory. Why is that? |
Hey Eichhof it seems that you're loading your GPT-J in ONNX with it's fp32 version. You need to convert your .onnx model in fp16 and load it then. Moreover, ONNX will always take up more space in the memory of the graphics card, because it has a static graph, unlike Pytorch. Maybe of course the problem is related to the new mechanism, have not tried it yet. But I hope the information is still useful for you :) |
There is a PR open to export in fp16 with a Overall I've found ONNX Runtime to be a bit painful to use on GPU, with the TensorRT support limited (see this), but let's hope it gets better. I still have to test #647 on large models on GPU to see the memory usage. I will keep you updated here! |
@hivaze I'm using @fxmarty Is the fp16 export already available on the main branch? Should I rather use Can I also quantize my GPT-J ONNX model so that it used less memory? I have read here about ORTQuantizer to apply dynamic quantization. |
@fxmarty Do you already have an estimate of when the PR will be ready for exporting ONNX with fp16? |
The two take a different path:
|
@fxmarty I just saw that both are merged now. Is there any difference of using |
The argument Exporting with Exporting with By the way, with ONNX Runtime float16 conversion, there is an issue I haven't been able to solve yet specifically with GPT-J, so for now this architecture is not tested: optimum/tests/exporters/onnx/test_exporters_onnx_cli.py Lines 174 to 204 in 9d76da2
More read: #785 (comment) (and the following answers). It could be a bug in ONNX Runtime. |
Tracking the issue in #800 So for now I would recommend you using |
So I will try to use |
@Eichhof I'll try and get back to you. |
@fxmarty Do you have any news regarding the decrease in response time and memory? |
Hi @Eichhof , I had a short test with CUDAExecutionProvider. The model is exported with: Here's the result:
As for memory, it appears ONNX Runtime CUDAExecutionProvider is still very bad: microsoft/onnxruntime#14526 (comment) Scripts: from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
import time
model_id = "gptj_onnx"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
print("loading model")
start = time.time()
model = ORTModelForCausalLM.from_pretrained(model_id, provider="CUDAExecutionProvider")
print(f"Loading took: {time.time() - start:.2f} s")
prompt = "ORT fast or slow"
inp = tokenizer(prompt, return_tensors="pt").to("cuda")
# warmup
res = model.generate(**inp, num_beams=1, min_length=50, max_length=50)
n_batch = 20
start = time.time()
for i in range(n_batch):
res = model.generate(**inp, num_beams=1, min_length=50, max_length=50)
end = time.time()
ort_time = end - start
print(f"ORT: {ort_time / n_batch:.3f} s") from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
model_id = "EleutherAI/gpt-j-6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("loading model")
start = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
print(f"Loading took: {time.time() - start:.2f} s")
prompt = "ORT fast or slow"
inp = tokenizer(prompt, return_tensors="pt").to("cuda")
# warmup
res = model.generate(**inp, num_beams=1, min_length=50, max_length=50)
n_batch = 20
with torch.inference_mode():
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for i in range(n_batch):
res = model.generate(**inp, num_beams=1, min_length=50, max_length=50)
end_event.record()
torch.cuda.synchronize()
pt_time = start_event.elapsed_time(end_event) * 1e-3
print(f"PT: {pt_time / n_batch:.3f} s") |
System Info
Who can help?
@lewtun @michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I Installed optimum with
pip install optimum[onnxruntime-gpu]
. Then I was runningpython -m optimum.exporters.onnx --task causal-lm-with-past --model EleutherAI/gpt-j-6B gptj_onnx/
to transform GPT-J to ONNX. The output of this call is then as follows:Expected behavior
Validation of ONNX model should succeed.
The text was updated successfully, but these errors were encountered: