Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported auto parallel + int4 quantization on models #1626

Closed
2 of 4 tasks
Hudayday opened this issue May 19, 2024 · 3 comments
Closed
2 of 4 tasks

Unsupported auto parallel + int4 quantization on models #1626

Hudayday opened this issue May 19, 2024 · 3 comments
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@Hudayday
Copy link

System Info

Tensorrt-LLM rel 0.9.0

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

In the 0.9.0 rel version, I try to apply auto parallel + quantization on the Llama 70B model, and both of them work fine independently.

I convert checkpoint using

python3 ./TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir ./llama-2/llama-2-70b --dtype float16 --output_dir ./model_profile_tmp/ckpt/2 --use_weight_only --weight_only_precision int4

and then I try auto parallel using

trtllm-build --checkpoint_dir ./model_profile_tmp/ckpt/2/ --gemm_plugin float16 --use_custom_all_reduce disable --output_dir ./model_profile_tmp/engine/2/ --workers 8 --max_batch_size 1 --auto_parallel 8 --weight_only_precision int4

Expected behavior

It should build the engine as float16,

actual behavior

However, it fails with:

concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 669, in build
model = optimize_model(model, use_unfused_qkv_gemm=use_auto_parallel)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 890, in optimize_model
model = unfuse_qkv_gemm(model)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 767, in unfuse_qkv_gemm
gemm.weight.value = weight
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 120, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (8192, 5120), original: (8192, 8192)

additional notes

Looks like auto parallel cannot deal with quantization which changes the dimension in trt

@Hudayday Hudayday added the bug Something isn't working label May 19, 2024
@yuxianq
Copy link

yuxianq commented May 21, 2024

auto parallel does not support to work with quantization right now. Will add assertion to make the error information clearer.

@byshiue byshiue added the triaged Issue has been triaged by maintainers label May 23, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

@kaiyux
Copy link
Member

kaiyux commented Jun 25, 2024

Clearer error information has been added to the latest main branch, closing.

Please let us know if there are any questions, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants