Update TensorRT-LLM #1315

kaiyux · 2024-03-19T09:25:59Z

Features
- Support run GptSession without OpenMPI Run GptSession without openmpi? #1220
- Add Python bindings for new C++ executor API, see documentation and examples in examples/bindings
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
API
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see examples/gpt/README.md for the latest commands
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see examples/qwen/README.md for the latest commands.
- [BREAKING CHANGE] Roved all the lora related flags from convert_checkpoint.py script and the checkpoint content to trtllm-build command, to generalize the feature better to more models.
- [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the trtllm-build --max_prompt_embedding_table_size instead.
- [BREAKING CHANGE] Changed the trtllm-build --world_size flag to --auto_parallel flag, the option is used for auto parallel planner only.
- [BREAKING CHANGE] AsyncLLMEngine is removed, tensorrt_llm.GenerationExecutor class is refactored to work with both explicitly launching with mpirun in the application level, and accept an MPI communicator created by mpi4py
- [BREAKING CHANGE] examples/server are removed, see examples/app instead.
Bug fixes
- Fix wrong SamplingConfig tensors in ModelRunnerCpp ModelRunnerCpp does not transfer SamplingConfig Tensor fields correctly #1183
- Fix error when converting SmoothQuant LLaMA Smoothquant LLaMA builds not working on 0.8.0 release #1267
- Fix the issue that examples/run.py only load one line from --input_file
Benchmark
- Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in benchmarks/cpp/README.md
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.02-py3
  - The dependent PyTorch version is updated to 2.2.
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.02-py3
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)
Documentation
- Add documents for new C++ executor API, see docs/source/executor.md

Update TensorRT-LLM

559f921

Shixiaowei02 approved these changes Mar 19, 2024

View reviewed changes

kaiyux merged commit 66ca337 into main Mar 19, 2024

Shixiaowei02 deleted the kaiyu/update branch March 19, 2024 09:39

andakai mentioned this pull request Mar 20, 2024

Assertion failed: Failed to deserialize cuda engine #1324

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TensorRT-LLM #1315

Update TensorRT-LLM #1315

kaiyux commented Mar 19, 2024 •

edited

Loading

Update TensorRT-LLM #1315

Update TensorRT-LLM #1315

Conversation

kaiyux commented Mar 19, 2024 • edited Loading

kaiyux commented Mar 19, 2024 •

edited

Loading