Fp8 Quantization Support #62

Satrat · 2024-05-22T12:53:34Z

Adds new fp8 quantization format. For now fp8 is assumed to be torch.float8_e4m3fn, in the future we could expand to support torch.float8_e5m2 as well by expanding QuantizationArgs.

The main change here is adding some additional checks to deal with float vs int quantization, as the range and rounding is calculated differently. Since the logic for fp8 compression is the same as int8 aside from a difference in the cast, I merged them into a single compressor. However the int8 compressor can still be referenced by its original "int-quantized" name so this won't break anything on the sparseml or vllm side

Testing

Added additional unit tests to test compression/decompression and scale/zp calculations.

Example/Evaluation

Requires sparseml FP8 PR to run: neuralmagic/sparseml#2306

import torch
from sparseml.transformers import SparseAutoModelForCausalLM, oneshot


# define a sparseml recipe for GPTQ floating pointW8A8 quantization
recipe = """
test_stage:
    quant_modifiers:
        GPTQModifier:
            sequential_update: false
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: "float"
                        symmetric: true
                        strategy: "tensor"
                    input_activations:
                        num_bits: 8
                        type: "float"
                        symmetric: true
                        strategy: "tensor"
                    targets: ["Linear"]
"""

# setting device_map to auto to spread the model evenly across all available GPUs
model_stub = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.float32, device_map="auto"
)

# uses SparseML's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# save location of quantized model out
output_dir = "/network/sadkins/llama1.1b_fp8_gptq"

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]"}
max_seq_length = 512
pad_to_max_length = False
num_calibration_samples = 512

# apply recipe to the model and save quantized output compressed to fp8
oneshot(
    model=model,
    dataset=dataset,
    recipe=recipe,
    output_dir=output_dir,
    splits=splits,
    max_seq_length=max_seq_length,
    pad_to_max_length=pad_to_max_length,
    num_calibration_samples=num_calibration_samples,
    save_compressed=True
)

Evaluated with sparseml.evaluate /network/sadkins/llama1.1b_fp8_gptq -d wikitext -i lm-evaluation-harness. Perplexity looks good at 14.53, it was 14.43 for the dense input model

mgoin

nice work!

src/compressed_tensors/compressors/model_compressor.py

src/compressed_tensors/quantization/utils/helpers.py

Sara Adkins added 6 commits May 15, 2024 22:21

small fixes

4ecc35d

initial commit

d53eac1

Merge branch 'main' into sa/fp8

89e5d33

bug fixes

816a0e1

cleanup

6ca7843

clarity comments

9fdb764

Satrat requested review from bfineran, dsikka, rahul-tuli, horheynm and dbogunowicz May 22, 2024 12:53

Sara Adkins added 2 commits May 22, 2024 15:15

clean up compression classes

a2cdba6

fixing zero point issues

1854bb5

Satrat marked this pull request as ready for review May 22, 2024 19:22

mgoin reviewed May 23, 2024

View reviewed changes

src/compressed_tensors/compressors/model_compressor.py Show resolved Hide resolved

Sara Adkins added 2 commits May 28, 2024 15:40

comment for hack

6a8420a

t Merge branch 'main' into sa/fp8

6981d4e

Satrat mentioned this pull request May 28, 2024

FP8 Quantization Support neuralmagic/sparseml#2306

Closed

bfineran previously approved these changes May 28, 2024

View reviewed changes

src/compressed_tensors/quantization/utils/helpers.py Outdated Show resolved Hide resolved

Merge branch 'main' into sa/fp8

0d9001f

Satrat dismissed bfineran’s stale review via 0d9001f May 29, 2024 19:44

Sara Adkins added 5 commits May 29, 2024 19:46

update quant check

ef57cf4

cleanup fp8 dtypes

3baefc5

cleanup

6443bb8

clean up observer

3a42557

dtype fix

3a4be13

Satrat requested a review from bfineran May 29, 2024 21:51

docstrings

bc98eee

Satrat requested a review from mgoin May 30, 2024 14:28

Merge branch 'main' into sa/fp8

944e27f

Merge branch 'main' into sa/fp8

ac10ffd

dbogunowicz previously approved these changes Jun 12, 2024

View reviewed changes

Merge branch 'main' into sa/fp8

18dad65

Satrat dismissed dbogunowicz’s stale review via 18dad65 June 14, 2024 13:57

Sara Adkins added 2 commits June 14, 2024 18:35

fixes after rebase

18846ec

test fixes

7b08b9d

bfineran previously approved these changes Jun 17, 2024

View reviewed changes

Satrat requested a review from dbogunowicz June 17, 2024 18:19

dbogunowicz previously approved these changes Jun 18, 2024

View reviewed changes

Merge branch 'main' into sa/fp8

7101f33

Satrat dismissed stale reviews from dbogunowicz and bfineran via 7101f33 June 19, 2024 17:45

Sara Adkins added 3 commits June 19, 2024 17:47

style

755fee5

get rid of broken segment

6836c80

fix broken code

b6d2470

bfineran approved these changes Jun 19, 2024

View reviewed changes

Satrat requested a review from dbogunowicz June 20, 2024 13:58

dbogunowicz approved these changes Jun 20, 2024

View reviewed changes

Satrat merged commit 75436f6 into main Jun 20, 2024
1 check passed

Satrat deleted the sa/fp8 branch June 20, 2024 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 Quantization Support #62

Fp8 Quantization Support #62

Satrat commented May 22, 2024 •

edited

Loading

mgoin left a comment

Fp8 Quantization Support #62

Fp8 Quantization Support #62

Conversation

Satrat commented May 22, 2024 • edited Loading

Testing

Example/Evaluation

mgoin left a comment

Choose a reason for hiding this comment

Satrat commented May 22, 2024 •

edited

Loading