-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] Add Bitblas Format/Kernel Support #39
Conversation
@ZX-ModelCloud rename usage of BaseQuantizeConfig to QuantizeConfig. BaseQuantizeConfig is only used for backward compat with old AutoGPTQ users. |
@ZX-ModelCloud Make sure all new methods are properly type hinted. |
For now, we will monkeypatch bitblas so we can get it working with this PR in our multi-gpu env. Thanks! |
Hi @Qubitium, thank you for your attention. BitBlas will take several minutes for compilation for dynamic shapes when encountering new operator configurations. The tuning results will be saved by default in the database located at And feel free to make a pull request to BitBLAS for the patch of multi-gpus' detection! |
PR created microsoft/BitBLAS#58 We will do some benchmarks on quantized llama7-8b model. If it beats out marlin, exllama v1/v2, we will use as default =) |
Status update: Basic inference tests passed. Now we are moving to benchmarks. |
…s: 'sym' and 'desc_act'
…led when use_bitblas is True
… already a converted model, and we no longer need to call load_checkpoint_in_model()
Thanks for all of your hard efforts! @Qubitium @ZX-ModelCloud |
* Fix model loader code is using bad default of float16 * Try to fix from_quantized error * set dtype as fp16 in inference mode
* support bitblas. * Add use_bitblas flag to load_model_tokenizer function and main function * Fix bitblas backend initialization and matrix multiplication * Remove print statement and save quantized model * Fix bitblas backend initialization and matrix multiplication * BitBLAS Support * revert example * cleanup * add apache * clean * cleanup * clean hip * msg * fix package name * monkey patch auto_detect_nvidia_target() * Fix safe_save code that was mistakenly removed * pass GPU ID * fix wrong FORMAT state set at end of conversion + remove compat and replace print with logger.info * bitblas will be compiled by default * add SUPPORTED_BITS and SUPPORTED_GROUP_SIZES to BaseQuantLinear so we can consolidate validation code * Remove impossible code * refractor structure and make sure params are passed in the same order * fix pg509-232 not mapped to a100 * resolve commit issues * resolve commit issues * refract * resolve commit * lock pci id order for tests * fix issue: convert_to_marlin() missing 2 required positional arguments: 'sym' and 'desc_act' * resolve test * Fix if statement * accelerate.utils.modeling.load_checkpoint_in_model() is no longer called when use_bitblas is True * add refractor todo lists * If we use marlin or bitblas to load the quantized model, the model is already a converted model, and we no longer need to call load_checkpoint_in_model() * Now supports saving quant models in BitBLAS format * cleanup * add test_q4_bitblas.py * Fix unpacked_zeros calculation error * add FORMAT.BITBLAS * cleanup * BITBLAS Only supported when desc_act is False. --------- Co-authored-by: LeiWang <leiwang1999@outlook.com> Co-authored-by: diegomontoya <xing@fictionpress.com> Co-authored-by: Qubitium <Qubitium@modelcloud.ai>
No description provided.