-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failed: Failed to deserialize cuda engine #1324
Comments
tagging @QiJune for visibility and bugfix @darrenglow Are you still facing this issue? I encountered the same error Triton currently is incompatible with
here's the complete working code:
|
@hshabbirh the Before tritonserver updating the official docker image, you can try to build a triton server image by yourself: Then, the tensorrt_llm 0.9.0 would work with this image. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days." |
System Info
GPU: A100-40G
Who can help?
@Tracin
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
trtllm-build --checkpoint_dir /tmp/mnt/tllm_checkpoint_1gpu_awq_int8_kv_cache
--output_dir ./tmp/trt_engines/int8_kv_cache_int4_AWQ/1-gpu/
--gemm_plugin bfloat16
--gpt_attention_plugin bfloat16
--strongly_typed
--max_batch_size 64
--max_input_len 1024
--max_output_len 2048 \
python3 ../run.py --max_output_len=2048
--tokenizer_dir /tmp/mnt/model
--engine_dir=/app/tensorrt_llm/examples/llama/tmp/trt_engines/int8_kv_cache_int4_AWQ/1-gpu
--input_file test.txt
5. error occurs as follows.
Expected behavior
Run the engine successfully.
actual behavior
When I try to run the engine:
additional notes
Before I have also tried the version on #1274, but still the same problem
The text was updated successfully, but these errors were encountered: