Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Add Bitblas Format/Kernel Support #39

Merged
merged 49 commits into from
Jun 25, 2024
Merged

[CORE] Add Bitblas Format/Kernel Support #39

merged 49 commits into from
Jun 25, 2024

Conversation

ZX-ModelCloud
Copy link
Collaborator

No description provided.

@Qubitium
Copy link
Collaborator

Qubitium commented Jun 21, 2024

@ZX-ModelCloud rename usage of BaseQuantizeConfig to QuantizeConfig. BaseQuantizeConfig is only used for backward compat with old AutoGPTQ users.

@Qubitium
Copy link
Collaborator

@ZX-ModelCloud Make sure all new methods are properly type hinted.

@Qubitium Qubitium changed the title support bitblas Add Bitblas qlinear Jun 21, 2024
@Qubitium Qubitium changed the title Add Bitblas qlinear [CORE] Add Bitblas qlinear Jun 21, 2024
@Qubitium
Copy link
Collaborator

Qubitium commented Jun 21, 2024

@LeiWang1999

  1. I have pushed PR to fix Bitblas as there is bad/incomplete support for gpu detection via nvidia-smi. PR [FIX] GPU detection in multigpu env and OEM A100 not matching TVM microsoft/BitBLAS#58
  2. Using tinyllama quantized as test, the bitblas conversion and for the first layer it takes about 300s to convert and 2nd layer forward becomes very fast. Is TVM/bitblas JIT compiling the code the layer 0 and reusing for subsequent layers?

For now, we will monkeypatch bitblas so we can get it working with this PR in our multi-gpu env.

Thanks!

@LeiWang1999
Copy link
Contributor

Hi @Qubitium, thank you for your attention. BitBlas will take several minutes for compilation for dynamic shapes when encountering new operator configurations. The tuning results will be saved by default in the database located at ~/.cache/bitblas. Therefore, when layer 0 is initialized next time, the operator should load directly from the database, bypassing the 5-minute wait.

And feel free to make a pull request to BitBLAS for the patch of multi-gpus' detection!

@Qubitium
Copy link
Collaborator

And feel free to make a pull request to BitBLAS for the patch of multi-gpus' detection!

PR created microsoft/BitBLAS#58

We will do some benchmarks on quantized llama7-8b model. If it beats out marlin, exllama v1/v2, we will use as default =)

@Qubitium
Copy link
Collaborator

Status update: Basic inference tests passed. Now we are moving to benchmarks.

@Qubitium Qubitium marked this pull request as ready for review June 24, 2024 14:34
@Qubitium Qubitium merged commit 22c0276 into main Jun 25, 2024
2 of 3 checks passed
@Qubitium Qubitium deleted the zx_support_BitBLAS branch June 25, 2024 09:00
@LeiWang1999
Copy link
Contributor

Thanks for all of your hard efforts! @Qubitium @ZX-ModelCloud

DeJoker pushed a commit to DeJoker/GPTQModel that referenced this pull request Jul 19, 2024
* Fix model loader code is using bad default of float16

* Try to fix from_quantized error

* set dtype as fp16 in inference mode
DeJoker pushed a commit to DeJoker/GPTQModel that referenced this pull request Jul 19, 2024
* support bitblas.

* Add use_bitblas flag to load_model_tokenizer function and main function

* Fix bitblas backend initialization and matrix multiplication

* Remove print statement and save quantized model

* Fix bitblas backend initialization and matrix multiplication

* BitBLAS Support

* revert example

* cleanup

* add apache

* clean

* cleanup

* clean hip

* msg

* fix package name

* monkey patch auto_detect_nvidia_target()

* Fix safe_save code that was mistakenly removed

* pass GPU ID

* fix wrong FORMAT state set at end of conversion + remove compat and replace print with logger.info

* bitblas will be compiled by default

* add SUPPORTED_BITS and SUPPORTED_GROUP_SIZES to BaseQuantLinear so we can consolidate validation code

* Remove impossible code

* refractor structure and make sure params are passed in the same order

* fix pg509-232 not mapped to a100

* resolve commit issues

* resolve commit issues

* refract

* resolve commit

* lock pci id order for tests

* fix issue: convert_to_marlin() missing 2 required positional arguments: 'sym' and 'desc_act'

* resolve test

* Fix if statement

* accelerate.utils.modeling.load_checkpoint_in_model() is no longer called when use_bitblas is True

* add refractor todo lists

* If we use marlin or bitblas to load the quantized model, the model is already a converted model, and we no longer need to call load_checkpoint_in_model()

* Now supports saving quant models in BitBLAS format

* cleanup

* add test_q4_bitblas.py

* Fix unpacked_zeros calculation error

* add FORMAT.BITBLAS

* cleanup

* BITBLAS Only supported when desc_act is False.

---------

Co-authored-by: LeiWang <leiwang1999@outlook.com>
Co-authored-by: diegomontoya <xing@fictionpress.com>
Co-authored-by: Qubitium <Qubitium@modelcloud.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants