You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 669, in build
model = optimize_model(model, use_unfused_qkv_gemm=use_auto_parallel)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 890, in optimize_model
model = unfuse_qkv_gemm(model)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 767, in unfuse_qkv_gemm
gemm.weight.value = weight
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (8192, 5120), original: (8192, 8192)
additional notes
Looks like auto parallel cannot deal with quantization which changes the dimension in trt
The text was updated successfully, but these errors were encountered:
System Info
Tensorrt-LLM rel 0.9.0
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
In the 0.9.0 rel version, I try to apply auto parallel + quantization on the Llama 70B model, and both of them work fine independently.
I convert checkpoint using
python3 ./TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./llama-2/llama-2-70b --dtype float16 --output_dir ./model_profile_tmp/ckpt/2 --use_weight_only --weight_only_precision int4
and then I try auto parallel using
trtllm-build --checkpoint_dir ./model_profile_tmp/ckpt/2/ --gemm_plugin float16 --use_custom_all_reduce disable --output_dir ./model_profile_tmp/engine/2/ --workers 8 --max_batch_size 1 --auto_parallel 8 --weight_only_precision int4
Expected behavior
It should build the engine as float16,
actual behavior
However, it fails with:
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 669, in build
model = optimize_model(model, use_unfused_qkv_gemm=use_auto_parallel)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 890, in optimize_model
model = unfuse_qkv_gemm(model)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 767, in unfuse_qkv_gemm
gemm.weight.value = weight
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (8192, 5120), original: (8192, 8192)
additional notes
Looks like auto parallel cannot deal with quantization which changes the dimension in trt
The text was updated successfully, but these errors were encountered: