[Bug]: internvl2-8b提问无限循环 #7349

haoduoyu1203 · 2024-08-09T11:03:51Z

Your current environment

环境信息 wsl2 ubuntu22.04

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 5700X 8-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 2
BogoMIPS: 6787.23
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat umip vaes vpclmulqdq rdpid fsrm
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 256 KiB (8 instances)
L1i cache: 256 KiB (8 instances)
L2 cache: 4 MiB (8 instances)
L3 cache: 32 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.1.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.0
[pip3] triton==3.0.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] pyzmq 26.1.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.44.0 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

单机一张3090 执行vllm serve /home/a/lmdeploy/InternVL2-8B --dtype auto --max-model-len 8192 --api-key token-abc123 --gpu_memory_utilization 1 --trust-remote-code --port 23333 --enforce-eager --dtype=half
启动提问无限循环回答

INFO 08-09 18:47:10 logger.py:36] Received request chat-2ace427447e347df91be974ae56f34ce: prompt: '<|im_start|>system\n\nCurrent model: /home/a/lmdeploy/InternVL2-8B\nCurrent date: 2024-08-09T10:47:10.457Z\n\nYou are a helpful assistant. You can help me by answering my questions. You can also ask me questions.<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8116, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1, 92543, 9081, 402, 5564, 1762, 334, 740, 5278, 14327, 13468, 277, 35758, 301, 1214, 1070, 30924, 314, 285, 294, 309, 364, 5564, 2554, 334, 262, 638, 1311, 285, 2418, 285, 2640, 291, 734, 334, 2713, 334, 734, 281, 21211, 349, 402, 2770, 657, 395, 11100, 17993, 281, 1592, 777, 1638, 884, 684, 35728, 983, 4917, 281, 1592, 777, 1225, 2705, 884, 4917, 281, 92542, 364, 92543, 1008, 364, 77230, 92542, 364, 92543, 525, 11353, 364], lora_request: None, prompt_adapter_request: None.
INFO: 172.29.192.1:54495 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-09 18:47:10 async_llm_engine.py:174] Added request chat-2ace427447e347df91be974ae56f34ce.
DEBUG 08-09 18:47:10 async_llm_engine.py:611] Waiting for new requests...
DEBUG 08-09 18:47:10 async_llm_engine.py:625] Got new requests!
INFO 08-09 18:47:11 metrics.py:406] Avg prompt throughput: 7.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:21 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:26 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:31 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:36 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:41 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:46 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:51 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%.
INFO 08-09 18:47:56 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.
INFO 08-09 18:48:01 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.1%, CPU KV cache usage: 0.0%.
INFO 08-09 18:48:06 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.2%, CPU KV cache usage: 0.0%.
INFO 08-09 18:48:11 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.3%, CPU KV cache usage: 0.0%.
INFO 08-09 18:48:16 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.4%, CPU KV cache usage: 0.0%.
问一个你好然后无限回答

DarkLight1337 · 2024-08-09T11:38:12Z

You probably have to supply additional stop_token_ids. @Isotr0py can you update the example to show which stop_token_ids are required for each model variant? Many users have been confused by this.

Isotr0py · 2024-08-09T13:15:55Z

@haoduoyu1203 You can add "<|im_start|>" and "<|im_end|>" to the stop_token_ids.

haoduoyu1203 added the bug Something isn't working label Aug 9, 2024

Isotr0py mentioned this issue Aug 9, 2024

[VLM][Doc] Add stop_token_ids to InternVL example #7354

Merged

DarkLight1337 closed this as completed in #7354 Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: internvl2-8b提问无限循环 #7349

[Bug]: internvl2-8b提问无限循环 #7349

haoduoyu1203 commented Aug 9, 2024

DarkLight1337 commented Aug 9, 2024

Isotr0py commented Aug 9, 2024

[Bug]: internvl2-8b提问无限循环 #7349

[Bug]: internvl2-8b提问无限循环 #7349

Comments

haoduoyu1203 commented Aug 9, 2024

Your current environment

🐛 Describe the bug

DarkLight1337 commented Aug 9, 2024

Isotr0py commented Aug 9, 2024