Assertion failed: Failed to deserialize cuda engine #1324

andakai · 2024-03-20T17:03:53Z

System Info

GPU: A100-40G

Who can help?

@Tracin
@byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Build the image. The latest update is Update TensorRT-LLM #1315. I followed the doc here https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/build_from_source.md#option-1-build-tensorrt-llm-in-one-step
Quantize the model

python ../quantization/quantize.py --model_dir /tmp/mnt/model_path \
                                   --output_dir /tmp/mnt/tllm_checkpoint_1gpu_awq_int8_kv_cache \
                                   --dtype bfloat16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --kv_cache_dtype int8 \
                                   --calib_size 32
3. build the engine:

trtllm-build --checkpoint_dir /tmp/mnt/tllm_checkpoint_1gpu_awq_int8_kv_cache
--output_dir ./tmp/trt_engines/int8_kv_cache_int4_AWQ/1-gpu/
--gemm_plugin bfloat16
--gpt_attention_plugin bfloat16
--strongly_typed
--max_batch_size 64
--max_input_len 1024
--max_output_len 2048 \

4. run the engine:

python3 ../run.py --max_output_len=2048
--tokenizer_dir /tmp/mnt/model
--engine_dir=/app/tensorrt_llm/examples/llama/tmp/trt_engines/int8_kv_cache_int4_AWQ/1-gpu
--input_file test.txt
5. error occurs as follows.

Expected behavior

Run the engine successfully.

actual behavior

When I try to run the engine:

Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/../run_v2.py", line 668, in <module>
    main(args)
  File "/app/tensorrt_llm/examples/llama/../run_v2.py", line 496, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 173, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:68)
1       0x7fb42126d734 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7fb4212839ea /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xcc9ea) [0x7fb4212839ea]
3       0x7fb42139b100 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 928
4       0x7fb42f03dfc8 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb2fc8) [0x7fb42f03dfc8]
5       0x7fb42efec06b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6106b) [0x7fb42efec06b]
6       0x7fb42efd1b1e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x46b1e) [0x7fb42efd1b1e]
7       0x55bf7556110e python3(+0x15a10e) [0x55bf7556110e]
8       0x55bf75557a7b _PyObject_MakeTpCall + 603
9       0x55bf7556facb python3(+0x168acb) [0x55bf7556facb]
10      0x55bf75570635 _PyObject_Call + 277
11      0x55bf7556c087 python3(+0x165087) [0x55bf7556c087]
12      0x55bf75557e2b python3(+0x150e2b) [0x55bf75557e2b]
13      0x7fb397b0396b /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x3c696b) [0x7fb397b0396b]
14      0x55bf75557a7b _PyObject_MakeTpCall + 603
15      0x55bf75551150 _PyEval_EvalFrameDefault + 30112
16      0x55bf7556f7f1 python3(+0x1687f1) [0x55bf7556f7f1]
17      0x55bf75570492 PyObject_Call + 290
18      0x55bf7554c5d7 _PyEval_EvalFrameDefault + 10791
19      0x55bf755619fc _PyFunction_Vectorcall + 124
20      0x55bf7554a26d _PyEval_EvalFrameDefault + 1725
21      0x55bf755469c6 python3(+0x13f9c6) [0x55bf755469c6]
22      0x55bf7563c256 PyEval_EvalCode + 134
23      0x55bf75667108 python3(+0x260108) [0x55bf75667108]
24      0x55bf756609cb python3(+0x2599cb) [0x55bf756609cb]
25      0x55bf75666e55 python3(+0x25fe55) [0x55bf75666e55]
26      0x55bf75666338 _PyRun_SimpleFileObject + 424
27      0x55bf75665f83 _PyRun_AnyFileObject + 67
28      0x55bf75658a5e Py_RunMain + 702
29      0x55bf7562f02d Py_BytesMain + 45
30      0x7fb449ebfd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fb449ebfd90]
31      0x7fb449ebfe40 __libc_start_main + 128
32      0x55bf7562ef25 _start + 37

additional notes

Before I have also tried the version on #1274, but still the same problem

The text was updated successfully, but these errors were encountered:

hshabbirh · 2024-04-08T20:15:51Z

tagging @QiJune for visibility and bugfix

@darrenglow Are you still facing this issue? I encountered the same error Cuda failed to deserialize and was able to resolve with an ugly fix. It comes down to a compatibility issue between Triton Inference Server and tensorrt-llm package that is used to build the trt-engine.

Triton currently is incompatible with tensorrt-llm==0.9.0, however this version is required to run convert_chekpoint.py. The trick is to use Ver-0.9.0 to run convert_checkpoint.py and then downgrade to Ver-0.8.0 to build the engine:

pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com

here's the complete working code:

### Pull and run Nvidia Triton server with TensorRT-LLM backend
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash


### Install tensorrt_llm lib
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

### Clone `tensorrtllm_backend` repo
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend/
git submodule update --init --recursive

### LFS
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs install
git lfs pull

### Clone Mistral7B-v0.1 model
mkdir HF_mistral_7b
cd HF_mistral_7b/
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1

python3 ../tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ./Mistral-7B-v0.1 \
                             --output_dir ./tllm_checkpoint_1gpu_mistral \
                             --dtype float16

### Re-install `tensorrt-llm` to downgrade
# `tensorrt-llm`==0.8.0
# `tensorrt`==9.2.0.post12.dev5
pip3 install tensorrt_llm==0.8.0 -U --pre --extra-index-url https://pypi.nvidia.com


### Build tensorrt engine
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_mistral \
            --output_dir ./trt_engine \
            --gemm_plugin float16 \
            --max_input_len 32256

### Make tensorrt model repo
mkdir -p  /tensorrt/mistral7b-repos/
cp ./all_models/inflight_batcher_llm/* /tensorrt/mistral7b-repos/ -r

### Fill model template
export HF_MISTRAL_MODEL=/opt/tritonserver/tensorrtllm_backend/HF_mistral_7b/Mistral-7B-v0.1
export ENGINE_PATH=/opt/tritonserver/tensorrtllm_backend/HF_mistral_7b/trt_engine

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/preprocessing/config.pbtxt tokenizer_dir:${HF_MISTRAL_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/postprocessing/config.pbtxt tokenizer_dir:${HF_MISTRAL_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/ensemble/config.pbtxt triton_max_batch_size:64

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

### Run the Triton inference server
python3 scripts/launch_triton_server.py --model_repo=/tensorrt/mistral7b-repos/

QiJune · 2024-04-19T04:29:26Z

@hshabbirh the tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 is still based on tensorrt_llm 0.8.0. So, only engines built by tensorrt_llm 0.8.0 can be deployed.

Before tritonserver updating the official docker image, you can try to build a triton server image by yourself:

https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#option-2-launch-triton-server-within-the-triton-container-built-via-buildpy-script

Then, the tensorrt_llm 0.9.0 would work with this image.

github-actions · 2024-06-05T01:50:50Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

andakai added the bug Something isn't working label Mar 20, 2024

byshiue assigned QiJune Mar 22, 2024

github-actions bot added the stale label Jun 5, 2024

hijkzzz removed the stale label Jun 7, 2024

hijkzzz closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failed: Failed to deserialize cuda engine #1324

Assertion failed: Failed to deserialize cuda engine #1324

andakai commented Mar 20, 2024

hshabbirh commented Apr 8, 2024 •

edited

Loading

QiJune commented Apr 19, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024

Assertion failed: Failed to deserialize cuda engine #1324

Assertion failed: Failed to deserialize cuda engine #1324

Comments

andakai commented Mar 20, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

hshabbirh commented Apr 8, 2024 • edited Loading

QiJune commented Apr 19, 2024 • edited Loading

github-actions bot commented Jun 5, 2024

hshabbirh commented Apr 8, 2024 •

edited

Loading

QiJune commented Apr 19, 2024 •

edited

Loading