[Backend support] Allow `num_logits_to_keep` as Tensor and change it to `logits_to_keep` + add flag #35757

Cyrilvallez · 2025-01-17T16:07:33Z

What does this PR do?

As per the title. Allowing num_logits_to_keep as a Tensor allow efficient slicing when using packed tensor format. It will be useful for us in the future as well as we integrate packed format for FA2 path.

ArthurZucker

Sounds good!
Let's maybe allow for full tensor?

src/transformers/models/aria/modeling_aria.py

ArthurZucker

Thanks for making sure it's compile compatible!

…ggingface#35757) * support * Update modeling_utils.py * style * most models * Other models * fix-copies * tests + generation utils

## Purpose ## * SparseGPT * Fix behavior where `targets` specifies which modules to sparsity, not which layers to target * Fix broken behavior with `_infer_owl_layer_sparsity` and add test * Fix owl argument validation * Add type hints and abstract methods for clarity * Pipelines * Fix bug revealed by decorators added to the llama model definition in the latest transformers release * huggingface/transformers#35757 * For the sequential pipeline, this revealed a bug in torch.fx._symbolic_trace where wrapped functions were not being handled properly * Future work could involve upstreaming a bug fix * Fix issue caused by changes to llama model definition * huggingface/transformers#34858 * For the layer sequential pipeline, this challenges the assumption that each layer input is the previous layer's output (which was known to be a fragile assumption) * Fix issue related to basic pipeline slowdowns and inaccuracy ## Changes ## * SparseGPT * Fully separate `targets` and `sequential_targets` * Modify hooks adding logic to reflect this change * Fix behavior of `_infer_owl_layer_sparsity` and add test * Code clarity * Add additional type hints * Designate `calibrate_module` as an abstract method on the sgpt mixin * Pipelines * Sequential pipeline: unwrap model forward function to avoid issues with pytorch function patching * Layer Sequential Pipeline: Add `maybe_inject_pos_embeddings` to sequential pipeline to hackily support models with `position_embeddings` * Basic Pipeline: Fix `on_sequential_batch_end` to call on the end of epoch, rather than every batch * Calling every batch was likely causing slowdowns ## Followups ## * Remove deprecated `sequential_update` option from examples and tests ## Testing ## * Added `tests/llmcompressor/transformers/obcq/test_obcq_owl.py` * Tested OBCQ+llama with sequential, layer sequential, and basic pipelines independently ## Regression Evaluations ## Models were compressed using `examples/sparse_2of4_quantization_fp8/llama3_8b_2of4.py` without fp8 option <details><summary>sparsegpt</summary> Main ``` vllm (pretrained=/home/kyle/llm-compressor/Meta-Llama-3-8B-InstructSparseGPTModifierMAIN,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1 | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.6243|± |0.0136| ``` This branch ``` vllm (pretrained=/home/kyle/llm-compressor/Meta-Llama-3-8B-InstructSparseGPTModifierFEATURE,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1 | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.6306|± |0.0136| ``` </details> To test wanda, the `SparseGPTModifier` was replaced with the `WandaPruningModifier` <details><summary>wanda</summary> Main ``` vllm (pretrained=/home/kyle/llm-compressor/Meta-Llama-3-8B-InstructWandaPruningModifierMAIN,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1 | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.5912|± |0.0138| ``` This branch ``` vllm (pretrained=/home/kyle/llm-compressor/Meta-Llama-3-8B-InstructWandaPruningModifierFEATURE,dtype=bfloat16,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1 | Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr| |----------|------:|------|-----:|------|---|-----:|---|-----:| |winogrande| 1|none | 5|acc |↑ |0.5817|± |0.0139| ``` </details> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

…ggingface#35757) * support * Update modeling_utils.py * style * most models * Other models * fix-copies * tests + generation utils

Cyrilvallez requested review from zucchini-nlp, ArthurZucker and Rocketknight1 as code owners January 17, 2025 16:07

Cyrilvallez mentioned this pull request Jan 17, 2025

Efficient Transformers backend support huggingface/text-generation-inference#2858

Closed

ArthurZucker approved these changes Jan 20, 2025

View reviewed changes

src/transformers/models/aria/modeling_aria.py Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request Jan 20, 2025

[Model]: Add transformers backend support vllm-project/vllm#11330

Merged

Cyrilvallez added 7 commits January 21, 2025 17:08

support

2bec2c4

Update modeling_utils.py

9c8c58f

style

7232008

most models

75612ba

Other models

65df3d2

fix-copies

c26ff5f

tests + generation utils

e35d31d

Cyrilvallez force-pushed the tgi-support branch from 73f5f84 to e35d31d Compare January 21, 2025 16:10

ArthurZucker approved these changes Jan 23, 2025

View reviewed changes

Cyrilvallez merged commit d3af76d into main Jan 23, 2025
26 checks passed

Cyrilvallez deleted the tgi-support branch January 23, 2025 08:47

Cyrilvallez changed the title ~~[Backend support] Allow num_logits_to_keep as Tensor + add flag~~ [Backend support] Allow num_logits_to_keep as Tensor and change it to logits_to_keep + add flag Jan 23, 2025

kylesayrs mentioned this pull request Feb 7, 2025

[Bugfix] SparseGPT, Pipelines vllm-project/llm-compressor#1130

Merged

yundai424 mentioned this pull request Feb 20, 2025

[transformers][FLCE] make compatible with latest (>=4.49.0) XXXForCausalLM.forward APIs linkedin/Liger-Kernel#573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend support] Allow `num_logits_to_keep` as Tensor and change it to `logits_to_keep` + add flag #35757

[Backend support] Allow `num_logits_to_keep` as Tensor and change it to `logits_to_keep` + add flag #35757

Cyrilvallez commented Jan 17, 2025

ArthurZucker left a comment

ArthurZucker left a comment

[Backend support] Allow num_logits_to_keep as Tensor and change it to logits_to_keep + add flag #35757

[Backend support] Allow num_logits_to_keep as Tensor and change it to logits_to_keep + add flag #35757

Conversation

Cyrilvallez commented Jan 17, 2025

What does this PR do?

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

[Backend support] Allow `num_logits_to_keep` as Tensor and change it to `logits_to_keep` + add flag #35757

[Backend support] Allow `num_logits_to_keep` as Tensor and change it to `logits_to_keep` + add flag #35757