[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

lksj92hs · 2025-01-14T13:44:35Z

Your current environment

OS: Ubuntu Server 22.04 LTS
GPU: Nvidia H200
Driver: 550.127.08

The output of `pip list`

Package                           Version
--------------------------------- -------------------
aiohappyeyeballs                  2.4.4
aiohttp                           3.11.11
aiohttp-cors                      0.7.0
aiosignal                         1.3.2
airportsdata                      20241001
annotated-types                   0.7.0
anyio                             4.8.0
argcomplete                       3.4.0
astor                             0.8.1
attrs                             24.3.0
bitsandbytes                      0.45.0
blake3                            1.0.2
cachetools                        5.5.0
certifi                           2024.12.14
charset-normalizer                3.4.1
click                             8.1.8
cloudpickle                       3.1.0
colorful                          0.5.6
compressed-tensors                0.8.1
datasets                          3.2.0
depyf                             0.18.0
dill                              0.3.8
diskcache                         5.6.3
distlib                           0.3.9
distro                            1.9.0
einops                            0.8.0
fastapi                           0.115.6
filelock                          3.16.1
flashinfer                        0.1.6+cu121torch2.4
frozenlist                        1.5.0
fsspec                            2024.9.0
gguf                              0.10.0
google-api-core                   2.24.0
google-auth                       2.37.0
googleapis-common-protos          1.66.0
grpcio                            1.57.0
grpcio-tools                      1.57.0
h11                               0.14.0
httpcore                          1.0.7
httptools                         0.6.4
httpx                             0.28.1
huggingface-hub                   0.24.5
idna                              3.10
importlib_metadata                8.5.0
iniconfig                         2.0.0
interegular                       0.3.3
Jinja2                            3.1.5
jiter                             0.8.2
jsonschema                        4.23.0
jsonschema-specifications         2024.10.1
lark                              1.2.2
linkify-it-py                     2.0.3
llvmlite                          0.43.0
lm-format-enforcer                0.10.9
markdown-it-py                    3.0.0
MarkupSafe                        3.0.2
mdit-py-plugins                   0.4.2
mdurl                             0.1.2
memray                            1.15.0
mistral_common                    1.5.1
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.19.0
multidict                         6.1.0
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.4.2
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.4.5.8
nvidia-cuda-cupti-cu12            12.4.127
nvidia-cuda-nvrtc-cu12            12.4.127
nvidia-cuda-runtime-cu12          12.4.127
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.2.1.3
nvidia-curand-cu12                10.3.5.147
nvidia-cusolver-cu12              11.6.1.9
nvidia-cusparse-cu12              12.3.1.170
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.21.5
nvidia-nvjitlink-cu12             12.4.127
nvidia-nvtx-cu12                  12.4.127
openai                            1.59.7
opencensus                        0.11.4
opencensus-context                0.1.3
opencv-python-headless            4.10.0.84
outlines                          0.1.11
outlines_core                     0.1.26
packaging                         24.2
pandas                            2.2.3
partial-json-parser               0.2.1.1.post5
pillow                            10.4.0
pip                               24.3.1
platformdirs                      4.3.6
pluggy                            1.5.0
prometheus_client                 0.21.1
prometheus-fastapi-instrumentator 7.0.0
propcache                         0.2.1
proto-plus                        1.25.0
protobuf                          4.25.5
psutil                            6.1.1
py-cpuinfo                        9.0.0
py-spy                            0.4.0
py3nvml                           0.2.7
pyairports                        2.1.1
pyarrow                           18.1.0
pyasn1                            0.6.1
pyasn1_modules                    0.4.1
pybind11                          2.13.6
pycountry                         24.6.1
pydantic                          2.10.5
pydantic_core                     2.27.2
Pygments                          2.19.1
PyJWT                             2.7.0
pytest                            8.3.4
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
pytz                              2024.2
PyYAML                            6.0.2
pyzmq                             26.2.0
ray                               2.40.0
referencing                       0.35.1
regex                             2024.11.6
requests                          2.32.3
rich                              13.9.4
rpds-py                           0.22.3
rsa                               4.9
safetensors                       0.5.2
sentencepiece                     0.2.0
setuptools                        75.8.0
six                               1.17.0
smart-open                        7.1.0
sniffio                           1.3.1
starlette                         0.41.3
sympy                             1.13.1
textual                           1.0.0
tiktoken                          0.7.0
tokenizers                        0.21.0
torch                             2.5.1
torchvision                       0.20.1
tqdm                              4.67.1
transformers                      4.48.0
triton                            3.1.0
typing_extensions                 4.12.2
tzdata                            2024.2
uc-micro-py                       1.0.3
urllib3                           2.3.0
uvicorn                           0.34.0
uvloop                            0.21.0
virtualenv                        20.28.1
vllm                              0.6.6.post1
watchfiles                        1.0.4
websockets                        14.1
wheel                             0.45.1
wrapt                             1.17.2
xformers                          0.0.28.post3
xgrammar                          0.1.9
xmltodict                         0.14.2
xxhash                            3.5.0
yarl                              1.18.3
zipp                              3.21.0

Model Input Dumps

No response

🐛 Describe the bug

When starting vllm like this:

python -m vllm.entrypoints.openai.api_server --model /models/llama-3.3-70b-instruct-fp8-dynamic --host localhost --port 10000

The following error occurs:

INFO 01-14 13:24:33 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-14 13:24:33 api_server.py:713] args: Namespace(host='localhost', port=10000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/textgen_cache/models/llama-3.3-70b-instruct-fp8-dynamic', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 01-14 13:24:33 api_server.py:199] Started engine process with PID 11114
INFO 01-14 13:24:37 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
WARNING 01-14 13:24:38 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 01-14 13:24:38 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 774, in <module>
    uvloop.run(run_server(args))
  File "/home/ubuntu/venv/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 120, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/ubuntu/venv/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 740, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.11/contextlib.py", line 204, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 118, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.11/contextlib.py", line 204, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1238, in create_engine_config
    config = VllmConfig(
             ^^^^^^^^^^^
  File "<string>", line 18, in __init__
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/config.py", line 3114, in __post_init__
    self.quant_config = VllmConfig._get_quantization_config(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/config.py", line 3058, in _get_quantization_config
    quant_config = get_quant_config(model_config, load_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 151, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 96, in from_config
    sparsity_scheme_map = cls._sparsity_scheme_map_from_config(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 119, in _sparsity_scheme_map_from_config
    sparsity_config = SparsityCompressionConfig.model_validate(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/pydantic/main.py", line 627, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for SparsityCompressionConfig
format
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing

The model in question is cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic from huggingface: https://huggingface.co/cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic/blob/main/config.json

The error does not occur with vllm 0.6.4.post1, and 0.6.5, it starts to happen with 0.6.6.

When the line containing "sparsity_config": {} is removed from the model's config.json the error doesn't happen and the model works fine even with 0.6.6.post1. While this may be considered a workaround, it's better to fix this issue as there are potentially many models with empty sparsity_config.

{
  "_name_or_path": "/output/Llama-3.3-70B-Instruct-FP8-Dynamic",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "config_groups": {
      "group_0": {
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": true,
          "group_size": null,
          "num_bits": 8,
          "observer": null,
          "observer_kwargs": {},
          "strategy": "token",
          "symmetric": true,
          "type": "float"
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "minmax",
          "observer_kwargs": {},
          "strategy": "channel",
          "symmetric": true,
          "type": "float"
        }
      }
    },
    "format": "float-quantized",
    "global_compression_ratio": 1.463543865167781,
    "ignore": [
      "lm_head"
    ],
    "kv_cache_scheme": null,
    "quant_method": "compressed-tensors",
    "quantization_status": "compressed",
    "sparsity_config": {}
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.47.0",
  "use_cache": true,
  "vocab_size": 128256
}

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

dsikka · 2025-01-14T14:02:40Z

Hi @lksj92hs can you share your recipe used to quantize?

dsikka · 2025-01-14T14:09:32Z

Can you also share your version of compressed-tensors?

lksj92hs · 2025-01-14T14:29:46Z

Thank you for responding!

I haven't quantized the model myself but it is on huggingface: https://huggingface.co/cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic But it is a regression since 0.6.5, so I think it's worth fixing and it seems that fix will be quite simple - just ignore sparsity_config if it is an empty dict (or any other falsy value).

The version of compressed-tensors is listed in the output of pip list in the initial submission:

compressed-tensors                0.8.1

rahul-tuli · 2025-01-15T16:18:00Z

Hi @lksj92hs,

Thank you for bringing up this issue! The bug related to the ValidationError when loading fp8-dynamic models with an empty "sparsity_config" has been resolved. The fix is included in PR #12057.

Please pull the latest changes and let us know if you encounter any further issues. We appreciate your contribution to improving the project!

lksj92hs added the bug Something isn't working label Jan 14, 2025

This was referenced Jan 14, 2025

Fix: Handling empty sparsity config neuralmagic/vllm#54

Closed

Fix: cases with empty sparsity config #12057

Merged

DarkLight1337 closed this as completed in #12057 Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

lksj92hs commented Jan 14, 2025

dsikka commented Jan 14, 2025

dsikka commented Jan 14, 2025

lksj92hs commented Jan 14, 2025

rahul-tuli commented Jan 15, 2025

[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

Comments

lksj92hs commented Jan 14, 2025

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

dsikka commented Jan 14, 2025

dsikka commented Jan 14, 2025

lksj92hs commented Jan 14, 2025

rahul-tuli commented Jan 15, 2025