Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

Closed
1 task done
lksj92hs opened this issue Jan 14, 2025 · 4 comments · Fixed by #12057
Closed
1 task done

[Bug]: ValidationError when loading fp8-dynamic model with empty "sparsity_config" #12044

lksj92hs opened this issue Jan 14, 2025 · 4 comments · Fixed by #12057
Labels
bug Something isn't working

Comments

@lksj92hs
Copy link

Your current environment

OS: Ubuntu Server 22.04 LTS
GPU: Nvidia H200
Driver: 550.127.08

The output of `pip list`
Package                           Version
--------------------------------- -------------------
aiohappyeyeballs                  2.4.4
aiohttp                           3.11.11
aiohttp-cors                      0.7.0
aiosignal                         1.3.2
airportsdata                      20241001
annotated-types                   0.7.0
anyio                             4.8.0
argcomplete                       3.4.0
astor                             0.8.1
attrs                             24.3.0
bitsandbytes                      0.45.0
blake3                            1.0.2
cachetools                        5.5.0
certifi                           2024.12.14
charset-normalizer                3.4.1
click                             8.1.8
cloudpickle                       3.1.0
colorful                          0.5.6
compressed-tensors                0.8.1
datasets                          3.2.0
depyf                             0.18.0
dill                              0.3.8
diskcache                         5.6.3
distlib                           0.3.9
distro                            1.9.0
einops                            0.8.0
fastapi                           0.115.6
filelock                          3.16.1
flashinfer                        0.1.6+cu121torch2.4
frozenlist                        1.5.0
fsspec                            2024.9.0
gguf                              0.10.0
google-api-core                   2.24.0
google-auth                       2.37.0
googleapis-common-protos          1.66.0
grpcio                            1.57.0
grpcio-tools                      1.57.0
h11                               0.14.0
httpcore                          1.0.7
httptools                         0.6.4
httpx                             0.28.1
huggingface-hub                   0.24.5
idna                              3.10
importlib_metadata                8.5.0
iniconfig                         2.0.0
interegular                       0.3.3
Jinja2                            3.1.5
jiter                             0.8.2
jsonschema                        4.23.0
jsonschema-specifications         2024.10.1
lark                              1.2.2
linkify-it-py                     2.0.3
llvmlite                          0.43.0
lm-format-enforcer                0.10.9
markdown-it-py                    3.0.0
MarkupSafe                        3.0.2
mdit-py-plugins                   0.4.2
mdurl                             0.1.2
memray                            1.15.0
mistral_common                    1.5.1
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.19.0
multidict                         6.1.0
multiprocess                      0.70.16
nest-asyncio                      1.6.0
networkx                          3.4.2
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.4.5.8
nvidia-cuda-cupti-cu12            12.4.127
nvidia-cuda-nvrtc-cu12            12.4.127
nvidia-cuda-runtime-cu12          12.4.127
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.2.1.3
nvidia-curand-cu12                10.3.5.147
nvidia-cusolver-cu12              11.6.1.9
nvidia-cusparse-cu12              12.3.1.170
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.21.5
nvidia-nvjitlink-cu12             12.4.127
nvidia-nvtx-cu12                  12.4.127
openai                            1.59.7
opencensus                        0.11.4
opencensus-context                0.1.3
opencv-python-headless            4.10.0.84
outlines                          0.1.11
outlines_core                     0.1.26
packaging                         24.2
pandas                            2.2.3
partial-json-parser               0.2.1.1.post5
pillow                            10.4.0
pip                               24.3.1
platformdirs                      4.3.6
pluggy                            1.5.0
prometheus_client                 0.21.1
prometheus-fastapi-instrumentator 7.0.0
propcache                         0.2.1
proto-plus                        1.25.0
protobuf                          4.25.5
psutil                            6.1.1
py-cpuinfo                        9.0.0
py-spy                            0.4.0
py3nvml                           0.2.7
pyairports                        2.1.1
pyarrow                           18.1.0
pyasn1                            0.6.1
pyasn1_modules                    0.4.1
pybind11                          2.13.6
pycountry                         24.6.1
pydantic                          2.10.5
pydantic_core                     2.27.2
Pygments                          2.19.1
PyJWT                             2.7.0
pytest                            8.3.4
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
pytz                              2024.2
PyYAML                            6.0.2
pyzmq                             26.2.0
ray                               2.40.0
referencing                       0.35.1
regex                             2024.11.6
requests                          2.32.3
rich                              13.9.4
rpds-py                           0.22.3
rsa                               4.9
safetensors                       0.5.2
sentencepiece                     0.2.0
setuptools                        75.8.0
six                               1.17.0
smart-open                        7.1.0
sniffio                           1.3.1
starlette                         0.41.3
sympy                             1.13.1
textual                           1.0.0
tiktoken                          0.7.0
tokenizers                        0.21.0
torch                             2.5.1
torchvision                       0.20.1
tqdm                              4.67.1
transformers                      4.48.0
triton                            3.1.0
typing_extensions                 4.12.2
tzdata                            2024.2
uc-micro-py                       1.0.3
urllib3                           2.3.0
uvicorn                           0.34.0
uvloop                            0.21.0
virtualenv                        20.28.1
vllm                              0.6.6.post1
watchfiles                        1.0.4
websockets                        14.1
wheel                             0.45.1
wrapt                             1.17.2
xformers                          0.0.28.post3
xgrammar                          0.1.9
xmltodict                         0.14.2
xxhash                            3.5.0
yarl                              1.18.3
zipp                              3.21.0

Model Input Dumps

No response

🐛 Describe the bug

When starting vllm like this:

python -m vllm.entrypoints.openai.api_server --model /models/llama-3.3-70b-instruct-fp8-dynamic --host localhost --port 10000

The following error occurs:

INFO 01-14 13:24:33 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-14 13:24:33 api_server.py:713] args: Namespace(host='localhost', port=10000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/textgen_cache/models/llama-3.3-70b-instruct-fp8-dynamic', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 01-14 13:24:33 api_server.py:199] Started engine process with PID 11114
INFO 01-14 13:24:37 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
WARNING 01-14 13:24:38 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 01-14 13:24:38 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 774, in <module>
    uvloop.run(run_server(args))
  File "/home/ubuntu/venv/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 120, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/ubuntu/venv/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 740, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.11/contextlib.py", line 204, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 118, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.11/contextlib.py", line 204, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 210, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 1238, in create_engine_config
    config = VllmConfig(
             ^^^^^^^^^^^
  File "<string>", line 18, in __init__
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/config.py", line 3114, in __post_init__
    self.quant_config = VllmConfig._get_quantization_config(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/config.py", line 3058, in _get_quantization_config
    quant_config = get_quant_config(model_config, load_config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 151, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 96, in from_config
    sparsity_scheme_map = cls._sparsity_scheme_map_from_config(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 119, in _sparsity_scheme_map_from_config
    sparsity_config = SparsityCompressionConfig.model_validate(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv/lib/python3.11/site-packages/pydantic/main.py", line 627, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for SparsityCompressionConfig
format
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing

The model in question is cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic from huggingface: https://huggingface.co/cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic/blob/main/config.json

The error does not occur with vllm 0.6.4.post1, and 0.6.5, it starts to happen with 0.6.6.

When the line containing "sparsity_config": {} is removed from the model's config.json the error doesn't happen and the model works fine even with 0.6.6.post1. While this may be considered a workaround, it's better to fix this issue as there are potentially many models with empty sparsity_config.

{
  "_name_or_path": "/output/Llama-3.3-70B-Instruct-FP8-Dynamic",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "config_groups": {
      "group_0": {
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": true,
          "group_size": null,
          "num_bits": 8,
          "observer": null,
          "observer_kwargs": {},
          "strategy": "token",
          "symmetric": true,
          "type": "float"
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "minmax",
          "observer_kwargs": {},
          "strategy": "channel",
          "symmetric": true,
          "type": "float"
        }
      }
    },
    "format": "float-quantized",
    "global_compression_ratio": 1.463543865167781,
    "ignore": [
      "lm_head"
    ],
    "kv_cache_scheme": null,
    "quant_method": "compressed-tensors",
    "quantization_status": "compressed",
    "sparsity_config": {}
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.47.0",
  "use_cache": true,
  "vocab_size": 128256
}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@lksj92hs lksj92hs added the bug Something isn't working label Jan 14, 2025
@dsikka
Copy link
Contributor

dsikka commented Jan 14, 2025

Hi @lksj92hs can you share your recipe used to quantize?

@dsikka
Copy link
Contributor

dsikka commented Jan 14, 2025

Can you also share your version of compressed-tensors?

@lksj92hs
Copy link
Author

Thank you for responding!

I haven't quantized the model myself but it is on huggingface: https://huggingface.co/cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic But it is a regression since 0.6.5, so I think it's worth fixing and it seems that fix will be quite simple - just ignore sparsity_config if it is an empty dict (or any other falsy value).

The version of compressed-tensors is listed in the output of pip list in the initial submission:

compressed-tensors                0.8.1

@rahul-tuli
Copy link
Contributor

Hi @lksj92hs,

Thank you for bringing up this issue! The bug related to the ValidationError when loading fp8-dynamic models with an empty "sparsity_config" has been resolved. The fix is included in PR #12057.

Please pull the latest changes and let us know if you encounter any further issues. We appreciate your contribution to improving the project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants