[Bug]: Qwen2 Moe FP8 not supported on L40 #6264

TopIdiot · 2024-07-09T12:44:38Z

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.241-1-tlinux4-0017.6-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L40
GPU 1: NVIDIA L40

Nvidia driver version: 535.161.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          384
On-line CPU(s) list:             0-383
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 9K84 96-Core Processor
CPU family:                      25
Model:                           17
Thread(s) per core:              2
Core(s) per socket:              96
Socket(s):                       2
Stepping:                        0
BogoMIPS:                        5200.06
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
 rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hyperviso
r lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ibpb vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx
512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx
512_vbmi2 vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       6 MiB (192 instances)
L1i cache:                       6 MiB (192 instances)
L2 cache:                        192 MiB (192 instances)
L3 cache:                        768 MiB (24 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-191
NUMA node1 CPU(s):               192-383
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    0-191   0               N/A
GPU1    NODE     X      0-191   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

After loading a fp8 qwen2 moe model

 python -m vllm.entrypoints.openai.api_server   --model ./moe   --port 8081   --host 0.
0.0.0   --trust-remote-code   --tensor-parallel-size 2
INFO 07-09 12:19:33 api_server.py:206] vLLM API server version 0.5.1
INFO 07-09 12:19:33 api_server.py:207] args: Namespace(host='0.0.0.0', port=8081, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'],
 allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, r
oot_path=None, middleware=[], model='./moe', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_r
emote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines',
 distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block
_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_
blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eage
r=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_con
fig=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=
False, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None,
speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sam
pler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_na
me=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-09 12:19:33 config.py:698] Defaulting to use mp for distributed inference
INFO 07-09 12:19:33 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='./moe', speculative_config=None, tokenizer='./moe', skip_tokenizer_init=False, toke
nizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=16384, download_dir=None, lo
ad_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantiza
tion_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None),
 seed=0, served_model_name=./moe, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 07-09 12:19:34 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
(VllmWorkerProcess pid=22996) INFO 07-09 12:19:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=22996) INFO 07-09 12:19:35 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-09 12:19:35 utils.py:741] Found nccl from library libnccl.so.2
INFO 07-09 12:19:35 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=22996) INFO 07-09 12:19:35 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-09 12:19:35 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=22996) INFO 07-09 12:19:35 custom_all_reduce_utils.py:232] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 07-09 12:19:35 fp8.py:45] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=22996) WARNING 07-09 12:19:35 fp8.py:45] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
(VllmWorkerProcess pid=22996) INFO 07-09 12:21:45 model_runner.py:255] Loading model weights took 8.0809 GB
INFO 07-09 12:21:46 model_runner.py:255] Loading model weights took 8.0809 GB
Conversion from/to f8e4m3nv is only supported on compute capability >= 90

UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
Conversion from/to f8e4m3nv is only supported on compute capability >= 90

UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
/home/qspace/data/mmsprwelmmodelsvra/rce2004389934/vllm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_
memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)

The config.json is

{
  "_name_or_path": "/data/home/yowenchen/vllm/model_qwen_moe",
  "architectures": [
    "Qwen2MoeForCausalLM"
  ],
  "attention_dropout": 0,
  "decoder_sparse_step": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 1408,
  "max_position_embeddings": 16384,
  "max_window_layers": 28,
  "mlp_only_layers": [],
  "model_type": "qwen2_moe",
  "moe_intermediate_size": 1408,
  "norm_topk_prob": true,
  "num_attention_heads": 16,
  "num_experts": 64,
  "num_experts_per_tok": 6,
  "num_hidden_layers": 28,
  "num_key_value_heads": 16,
  "output_router_logits": false,
  "quantization_config": {
    "activation_scheme": "dynamic",
    "quant_method": "fp8"
  },
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "router_aux_loss_coef": 0.001,
  "shared_expert_intermediate_size": 2816,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 102400
}

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2024-07-09T13:20:35Z

fp8 not yet supported for Qwen. WIP PR: #6088

TopIdiot · 2024-07-25T03:12:59Z

fp8 not yet supported for Qwen. WIP PR: #6088

@robertgshaw2-neuralmagic Hello, the error still exists in version 0.5.3 .

robertgshaw2-redhat · 2024-07-25T03:17:17Z

Fp8 is now supported for Qwen, but MoE Fp8 requires compute_capability == 9.0 (aka Hopper GPUs)

Our MoE kernels are currently implemented using Triton, which require triton==3.0 for Fp8 on Ada Lovelace. We are limited by PyTorch's version of triton

We look forward to supporting Fp8 MoE on Ada Lovelace once these dependencies are enabled

github-actions · 2024-10-25T02:02:06Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-25T02:04:54Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

TopIdiot added the bug Something isn't working label Jul 9, 2024

github-actions bot added the stale Over 90 days of inactivity label Oct 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Qwen2 Moe FP8 not supported on L40 #6264

[Bug]: Qwen2 Moe FP8 not supported on L40 #6264

TopIdiot commented Jul 9, 2024 •

edited

Loading

robertgshaw2-redhat commented Jul 9, 2024

TopIdiot commented Jul 25, 2024

robertgshaw2-redhat commented Jul 25, 2024 •

edited

Loading

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 25, 2024

[Bug]: Qwen2 Moe FP8 not supported on L40 #6264

[Bug]: Qwen2 Moe FP8 not supported on L40 #6264

Comments

TopIdiot commented Jul 9, 2024 • edited Loading

Your current environment

🐛 Describe the bug

robertgshaw2-redhat commented Jul 9, 2024

TopIdiot commented Jul 25, 2024

robertgshaw2-redhat commented Jul 25, 2024 • edited Loading

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 25, 2024

TopIdiot commented Jul 9, 2024 •

edited

Loading

robertgshaw2-redhat commented Jul 25, 2024 •

edited

Loading