[CORE] Add Bitblas Format/Kernel Support #39

ZX-ModelCloud · 2024-06-21T03:35:04Z

No description provided.

…tBLAS

Qubitium · 2024-06-21T04:02:03Z

@ZX-ModelCloud rename usage of BaseQuantizeConfig to QuantizeConfig. BaseQuantizeConfig is only used for backward compat with old AutoGPTQ users.

Qubitium · 2024-06-21T04:11:06Z

@ZX-ModelCloud Make sure all new methods are properly type hinted.

Qubitium · 2024-06-21T06:52:41Z

@LeiWang1999

I have pushed PR to fix Bitblas as there is bad/incomplete support for gpu detection via nvidia-smi. PR [FIX] GPU detection in multigpu env and OEM A100 not matching TVM microsoft/BitBLAS#58
Using tinyllama quantized as test, the bitblas conversion and for the first layer it takes about 300s to convert and 2nd layer forward becomes very fast. Is TVM/bitblas JIT compiling the code the layer 0 and reusing for subsequent layers?

For now, we will monkeypatch bitblas so we can get it working with this PR in our multi-gpu env.

Thanks!

LeiWang1999 · 2024-06-21T07:01:05Z

Hi @Qubitium, thank you for your attention. BitBlas will take several minutes for compilation for dynamic shapes when encountering new operator configurations. The tuning results will be saved by default in the database located at ~/.cache/bitblas. Therefore, when layer 0 is initialized next time, the operator should load directly from the database, bypassing the 5-minute wait.

And feel free to make a pull request to BitBLAS for the patch of multi-gpus' detection!

Qubitium · 2024-06-21T07:15:19Z

And feel free to make a pull request to BitBLAS for the patch of multi-gpus' detection!

PR created microsoft/BitBLAS#58

We will do some benchmarks on quantized llama7-8b model. If it beats out marlin, exllama v1/v2, we will use as default =)

Qubitium · 2024-06-21T12:57:54Z

Status update: Basic inference tests passed. Now we are moving to benchmarks.

…s: 'sym' and 'desc_act'

…ort_BitBLAS

…led when use_bitblas is True

… already a converted model, and we no longer need to call load_checkpoint_in_model()

…ort_BitBLAS

LeiWang1999 · 2024-06-25T09:02:35Z

Thanks for all of your hard efforts! @Qubitium @ZX-ModelCloud

* Fix model loader code is using bad default of float16 * Try to fix from_quantized error * set dtype as fp16 in inference mode

* support bitblas. * Add use_bitblas flag to load_model_tokenizer function and main function * Fix bitblas backend initialization and matrix multiplication * Remove print statement and save quantized model * Fix bitblas backend initialization and matrix multiplication * BitBLAS Support * revert example * cleanup * add apache * clean * cleanup * clean hip * msg * fix package name * monkey patch auto_detect_nvidia_target() * Fix safe_save code that was mistakenly removed * pass GPU ID * fix wrong FORMAT state set at end of conversion + remove compat and replace print with logger.info * bitblas will be compiled by default * add SUPPORTED_BITS and SUPPORTED_GROUP_SIZES to BaseQuantLinear so we can consolidate validation code * Remove impossible code * refractor structure and make sure params are passed in the same order * fix pg509-232 not mapped to a100 * resolve commit issues * resolve commit issues * refract * resolve commit * lock pci id order for tests * fix issue: convert_to_marlin() missing 2 required positional arguments: 'sym' and 'desc_act' * resolve test * Fix if statement * accelerate.utils.modeling.load_checkpoint_in_model() is no longer called when use_bitblas is True * add refractor todo lists * If we use marlin or bitblas to load the quantized model, the model is already a converted model, and we no longer need to call load_checkpoint_in_model() * Now supports saving quant models in BitBLAS format * cleanup * add test_q4_bitblas.py * Fix unpacked_zeros calculation error * add FORMAT.BITBLAS * cleanup * BITBLAS Only supported when desc_act is False. --------- Co-authored-by: LeiWang <leiwang1999@outlook.com> Co-authored-by: diegomontoya <xing@fictionpress.com> Co-authored-by: Qubitium <Qubitium@modelcloud.ai>

LeiWang1999 and others added 10 commits March 10, 2024 22:38

support bitblas.

605b114

Add use_bitblas flag to load_model_tokenizer function and main function

9fddb8e

Fix bitblas backend initialization and matrix multiplication

30e6cc0

Remove print statement and save quantized model

7505e60

Fix bitblas backend initialization and matrix multiplication

08a269b

Add pytest.ini and MANIFEST.in files, and update GPTQ module imports

205f74a

BitBLAS Support

29a9b70

revert example

cf538d3

Merge remote-tracking branch 'LeiWang1999/bitblas' into zx_support_Bi…

ba4fa5b

…tBLAS

cleanup

74cf292

Qubitium added 2 commits June 21, 2024 04:07

add apache

5e04723

clean

4ca9b61

Qubitium changed the title ~~support bitblas~~ Add Bitblas qlinear Jun 21, 2024

Qubitium changed the title ~~Add Bitblas qlinear~~ [CORE] Add Bitblas qlinear Jun 21, 2024

ZX-ModelCloud and others added 4 commits June 21, 2024 04:36

cleanup

18709ee

clean hip

115b6ff

msg

24d935c

fix package name

0c02e3f

Qubitium mentioned this pull request Jun 21, 2024

[FIX] GPU detection in multigpu env and OEM A100 not matching TVM microsoft/BitBLAS#58

Merged

2 tasks

monkey patch auto_detect_nvidia_target()

26edbd7

Qubitium mentioned this pull request Jun 21, 2024

[FEATURE] Support BitBLAS Backend for QuantLinear AutoGPTQ/AutoGPTQ#662

Open

Fix safe_save code that was mistakenly removed

5594b03

ZX-ModelCloud added 2 commits June 21, 2024 13:21

pass GPU ID

d4eebe9

Merge branch 'main' into zx_support_BitBLAS

4b22c1f

Qubitium and others added 18 commits June 22, 2024 02:19

fix pg509-232 not mapped to a100

3bd9f6d

resolve commit issues

88688d8

resolve commit issues

9c35f6e

refract

74a9755

resolve commit

367866d

lock pci id order for tests

cdeadd2

fix issue: convert_to_marlin() missing 2 required positional argument…

72db38d

…s: 'sym' and 'desc_act'

Merge remote-tracking branch 'origin/zx_support_BitBLAS' into zx_supp…

2cfcf63

…ort_BitBLAS

resolve test

6a98ab2

Fix if statement

7a3a568

accelerate.utils.modeling.load_checkpoint_in_model() is no longer cal…

b4c1e0f

…led when use_bitblas is True

add refractor todo lists

5b4d77b

If we use marlin or bitblas to load the quantized model, the model is…

ff7d8be

… already a converted model, and we no longer need to call load_checkpoint_in_model()

Merge remote-tracking branch 'origin/zx_support_BitBLAS' into zx_supp…

9d810f6

…ort_BitBLAS

Now supports saving quant models in BitBLAS format

f72abd3

cleanup

715b402

add test_q4_bitblas.py

14990be

Merge branch 'main' into zx_support_BitBLAS

e760078

Qubitium marked this pull request as ready for review June 24, 2024 14:34

ZX-ModelCloud added 6 commits June 25, 2024 05:10

Fix unpacked_zeros calculation error

832dca7

Merge remote-tracking branch 'origin/zx_support_BitBLAS' into zx_supp…

aaf1bd8

…ort_BitBLAS

add FORMAT.BITBLAS

3527d1c

Merge branch 'main' into zx_support_BitBLAS

025708f

cleanup

cf0e7ca

BITBLAS Only supported when desc_act is False.

842a90c

Qubitium merged commit 22c0276 into main Jun 25, 2024
2 of 3 checks passed

Qubitium deleted the zx_support_BitBLAS branch June 25, 2024 09:00

DeJoker pushed a commit to DeJoker/GPTQModel that referenced this pull request Jul 19, 2024

Fix loading (ModelCloud#39)

29560e5

* Fix model loader code is using bad default of float16 * Try to fix from_quantized error * set dtype as fp16 in inference mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] Add Bitblas Format/Kernel Support #39

[CORE] Add Bitblas Format/Kernel Support #39

ZX-ModelCloud commented Jun 21, 2024

Qubitium commented Jun 21, 2024 •

edited

Loading

Qubitium commented Jun 21, 2024

Qubitium commented Jun 21, 2024 •

edited

Loading

LeiWang1999 commented Jun 21, 2024

Qubitium commented Jun 21, 2024

Qubitium commented Jun 21, 2024

LeiWang1999 commented Jun 25, 2024

[CORE] Add Bitblas Format/Kernel Support #39

[CORE] Add Bitblas Format/Kernel Support #39

Conversation

ZX-ModelCloud commented Jun 21, 2024

Qubitium commented Jun 21, 2024 • edited Loading

Qubitium commented Jun 21, 2024

Qubitium commented Jun 21, 2024 • edited Loading

LeiWang1999 commented Jun 21, 2024

Qubitium commented Jun 21, 2024

Qubitium commented Jun 21, 2024

LeiWang1999 commented Jun 25, 2024

Qubitium commented Jun 21, 2024 •

edited

Loading

Qubitium commented Jun 21, 2024 •

edited

Loading