Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed: Failed to deserialize cuda engine #1324

Closed
2 of 4 tasks
andakai opened this issue Mar 20, 2024 · 3 comments
Closed
2 of 4 tasks

Assertion failed: Failed to deserialize cuda engine #1324

andakai opened this issue Mar 20, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@andakai
Copy link

andakai commented Mar 20, 2024

System Info

GPU: A100-40G

Who can help?

@Tracin
@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Build the image. The latest update is Update TensorRT-LLM #1315. I followed the doc here https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/build_from_source.md#option-1-build-tensorrt-llm-in-one-step
  2. Quantize the model
python ../quantization/quantize.py --model_dir /tmp/mnt/model_path \
                                   --output_dir /tmp/mnt/tllm_checkpoint_1gpu_awq_int8_kv_cache \
                                   --dtype bfloat16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --kv_cache_dtype int8 \
                                   --calib_size 32
3. build the engine:

trtllm-build --checkpoint_dir /tmp/mnt/tllm_checkpoint_1gpu_awq_int8_kv_cache
--output_dir ./tmp/trt_engines/int8_kv_cache_int4_AWQ/1-gpu/
--gemm_plugin bfloat16
--gpt_attention_plugin bfloat16
--strongly_typed
--max_batch_size 64
--max_input_len 1024
--max_output_len 2048 \

4. run the engine:

python3 ../run.py --max_output_len=2048
--tokenizer_dir /tmp/mnt/model
--engine_dir=/app/tensorrt_llm/examples/llama/tmp/trt_engines/int8_kv_cache_int4_AWQ/1-gpu
--input_file test.txt
5. error occurs as follows.

Expected behavior

Run the engine successfully.

actual behavior

When I try to run the engine:

Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/../run_v2.py", line 668, in <module>
    main(args)
  File "/app/tensorrt_llm/examples/llama/../run_v2.py", line 496, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 173, in from_dir
    session = GptSession(config=session_config,
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:68)
1       0x7fb42126d734 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7fb4212839ea /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xcc9ea) [0x7fb4212839ea]
3       0x7fb42139b100 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 928
4       0x7fb42f03dfc8 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb2fc8) [0x7fb42f03dfc8]
5       0x7fb42efec06b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6106b) [0x7fb42efec06b]
6       0x7fb42efd1b1e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x46b1e) [0x7fb42efd1b1e]
7       0x55bf7556110e python3(+0x15a10e) [0x55bf7556110e]
8       0x55bf75557a7b _PyObject_MakeTpCall + 603
9       0x55bf7556facb python3(+0x168acb) [0x55bf7556facb]
10      0x55bf75570635 _PyObject_Call + 277
11      0x55bf7556c087 python3(+0x165087) [0x55bf7556c087]
12      0x55bf75557e2b python3(+0x150e2b) [0x55bf75557e2b]
13      0x7fb397b0396b /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x3c696b) [0x7fb397b0396b]
14      0x55bf75557a7b _PyObject_MakeTpCall + 603
15      0x55bf75551150 _PyEval_EvalFrameDefault + 30112
16      0x55bf7556f7f1 python3(+0x1687f1) [0x55bf7556f7f1]
17      0x55bf75570492 PyObject_Call + 290
18      0x55bf7554c5d7 _PyEval_EvalFrameDefault + 10791
19      0x55bf755619fc _PyFunction_Vectorcall + 124
20      0x55bf7554a26d _PyEval_EvalFrameDefault + 1725
21      0x55bf755469c6 python3(+0x13f9c6) [0x55bf755469c6]
22      0x55bf7563c256 PyEval_EvalCode + 134
23      0x55bf75667108 python3(+0x260108) [0x55bf75667108]
24      0x55bf756609cb python3(+0x2599cb) [0x55bf756609cb]
25      0x55bf75666e55 python3(+0x25fe55) [0x55bf75666e55]
26      0x55bf75666338 _PyRun_SimpleFileObject + 424
27      0x55bf75665f83 _PyRun_AnyFileObject + 67
28      0x55bf75658a5e Py_RunMain + 702
29      0x55bf7562f02d Py_BytesMain + 45
30      0x7fb449ebfd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fb449ebfd90]
31      0x7fb449ebfe40 __libc_start_main + 128
32      0x55bf7562ef25 _start + 37

additional notes

Before I have also tried the version on #1274, but still the same problem

@andakai andakai added the bug Something isn't working label Mar 20, 2024
@hshabbirh
Copy link

hshabbirh commented Apr 8, 2024

tagging @QiJune for visibility and bugfix

@darrenglow Are you still facing this issue? I encountered the same error Cuda failed to deserialize and was able to resolve with an ugly fix. It comes down to a compatibility issue between Triton Inference Server and tensorrt-llm package that is used to build the trt-engine.

Triton currently is incompatible with tensorrt-llm==0.9.0, however this version is required to run convert_chekpoint.py. The trick is to use Ver-0.9.0 to run convert_checkpoint.py and then downgrade to Ver-0.8.0 to build the engine:

pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com

here's the complete working code:

### Pull and run Nvidia Triton server with TensorRT-LLM backend
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash


### Install tensorrt_llm lib
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

### Clone `tensorrtllm_backend` repo
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend/
git submodule update --init --recursive

### LFS
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs install
git lfs pull

### Clone Mistral7B-v0.1 model
mkdir HF_mistral_7b
cd HF_mistral_7b/
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1

python3 ../tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ./Mistral-7B-v0.1 \
                             --output_dir ./tllm_checkpoint_1gpu_mistral \
                             --dtype float16

### Re-install `tensorrt-llm` to downgrade
# `tensorrt-llm`==0.8.0
# `tensorrt`==9.2.0.post12.dev5
pip3 install tensorrt_llm==0.8.0 -U --pre --extra-index-url https://pypi.nvidia.com


### Build tensorrt engine
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_mistral \
            --output_dir ./trt_engine \
            --gemm_plugin float16 \
            --max_input_len 32256

### Make tensorrt model repo
mkdir -p  /tensorrt/mistral7b-repos/
cp ./all_models/inflight_batcher_llm/* /tensorrt/mistral7b-repos/ -r

### Fill model template
export HF_MISTRAL_MODEL=/opt/tritonserver/tensorrtllm_backend/HF_mistral_7b/Mistral-7B-v0.1
export ENGINE_PATH=/opt/tritonserver/tensorrtllm_backend/HF_mistral_7b/trt_engine

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/preprocessing/config.pbtxt tokenizer_dir:${HF_MISTRAL_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/postprocessing/config.pbtxt tokenizer_dir:${HF_MISTRAL_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/ensemble/config.pbtxt triton_max_batch_size:64

python3 tools/fill_template.py -i /tensorrt/mistral7b-repos/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

### Run the Triton inference server
python3 scripts/launch_triton_server.py --model_repo=/tensorrt/mistral7b-repos/

@QiJune
Copy link
Collaborator

QiJune commented Apr 19, 2024

@hshabbirh the tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 is still based on tensorrt_llm 0.8.0. So, only engines built by tensorrt_llm 0.8.0 can be deployed.

Before tritonserver updating the official docker image, you can try to build a triton server image by yourself:

https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#option-2-launch-triton-server-within-the-triton-container-built-via-buildpy-script

Then, the tensorrt_llm 0.9.0 would work with this image.

Copy link

github-actions bot commented Jun 5, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

@github-actions github-actions bot added the stale label Jun 5, 2024
@hijkzzz hijkzzz removed the stale label Jun 7, 2024
@hijkzzz hijkzzz closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants