Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 dynamic prefill weight only decode #1436

Merged
merged 63 commits into from
Dec 30, 2024

Conversation

jcaip
Copy link
Contributor

@jcaip jcaip commented Dec 18, 2024

This PR adds in weight_only_decode option to int8_dynamic_activation_int8_weight, which when set will use dynamic quantization for matmuls of shape (> 1, x) * (x, n) and weight only quantization for the batch_size=1 case.

It also updates generate.py to take in a text file for the prompt, we use this to demonstrate these prefill speedups with sh demo_summarize.sh.

Copy link

pytorch-bot bot commented Dec 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1436

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b144a53 with merge base 567cb46 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2024
@jcaip jcaip changed the title Jcaip/prefill 24 sparse benchmarking int8 dynamic prefill weight only decode Dec 30, 2024
@jcaip jcaip added topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) topic: performance Use this tag if this PR improves the performance of a feature labels Dec 30, 2024
@jcaip jcaip merged commit 52b6f4d into main Dec 30, 2024
19 of 20 checks passed
amdfaa pushed a commit that referenced this pull request Jan 10, 2025
This PR adds in weight_only_decode option to int8_dynamic_activation_int8_weight, which when set will use dynamic quantization for matmuls of shape (> 1, x) * (x, n) and weight only quantization for the batch_size=1 case.

It also updates generate.py to take in a text file for the prompt, we use this to demonstrate these prefill speedups with sh demo_summarize.sh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) topic: performance Use this tag if this PR improves the performance of a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants