Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fp8 Quantization Support #62

Merged
merged 26 commits into from
Jun 20, 2024
Merged

Fp8 Quantization Support #62

merged 26 commits into from
Jun 20, 2024

Conversation

Satrat
Copy link

@Satrat Satrat commented May 22, 2024

Adds new fp8 quantization format. For now fp8 is assumed to be torch.float8_e4m3fn, in the future we could expand to support torch.float8_e5m2 as well by expanding QuantizationArgs.

The main change here is adding some additional checks to deal with float vs int quantization, as the range and rounding is calculated differently. Since the logic for fp8 compression is the same as int8 aside from a difference in the cast, I merged them into a single compressor. However the int8 compressor can still be referenced by its original "int-quantized" name so this won't break anything on the sparseml or vllm side

Testing

Added additional unit tests to test compression/decompression and scale/zp calculations.

Example/Evaluation

Requires sparseml FP8 PR to run: neuralmagic/sparseml#2306

import torch
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot


# define a sparseml recipe for GPTQ floating pointW8A8 quantization
recipe = """
test_stage:
    quant_modifiers:
        GPTQModifier:
            sequential_update: false
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: "float"
                        symmetric: true
                        strategy: "tensor"
                    input_activations:
                        num_bits: 8
                        type: "float"
                        symmetric: true
                        strategy: "tensor"
                    targets: ["Linear"]
"""

# setting device_map to auto to spread the model evenly across all available GPUs
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.float32, device_map="auto"
)

# uses SparseML's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# save location of quantized model out
output_dir = "/network/sadkins/llama1.1b_fp8_gptq"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]"}
max_seq_length = 512
pad_to_max_length = False
num_calibration_samples = 512

# apply recipe to the model and save quantized output compressed to fp8
oneshot(
    model=model,
    dataset=dataset,
    recipe=recipe,
    output_dir=output_dir,
    splits=splits,
    max_seq_length=max_seq_length,
    pad_to_max_length=pad_to_max_length,
    num_calibration_samples=num_calibration_samples,
    save_compressed=True
)

Evaluated with sparseml.evaluate /network/sadkins/llama1.1b_fp8_gptq -d wikitext -i lm-evaluation-harness. Perplexity looks good at 14.53, it was 14.43 for the dense input model

@Satrat Satrat marked this pull request as ready for review May 22, 2024 19:22
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

bfineran
bfineran previously approved these changes May 28, 2024
src/compressed_tensors/quantization/utils/helpers.py Outdated Show resolved Hide resolved
@Satrat Satrat requested a review from bfineran May 29, 2024 21:51
@Satrat Satrat requested a review from mgoin May 30, 2024 14:28
dbogunowicz
dbogunowicz previously approved these changes Jun 12, 2024
bfineran
bfineran previously approved these changes Jun 17, 2024
@Satrat Satrat requested a review from dbogunowicz June 17, 2024 18:19
dbogunowicz
dbogunowicz previously approved these changes Jun 18, 2024
@Satrat Satrat dismissed stale reviews from dbogunowicz and bfineran via 7101f33 June 19, 2024 17:45
@Satrat Satrat requested a review from dbogunowicz June 20, 2024 13:58
@Satrat Satrat merged commit 75436f6 into main Jun 20, 2024
1 check passed
@Satrat Satrat deleted the sa/fp8 branch June 20, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants