BitsandBytes Enablement on ROCm #1207

pnunna93 · 2024-05-10T21:07:37Z

Overview

This PR introduces bitsandbytes enablement on ROCm for AMD GPUs. It adds hipified versions of CUDA kernels and ops which allow the flow to route bitsandbytes API function calls to use optimized version of HIP kernels for AMD GPUs.

In the multi-backend-refactor branch, there is a proposal to separate various backends to support multiple GPUs/accelerators. The core of bitsandbytes is built on top of PyTorch and decides the API function call of individual GPUs/accelerators based on the device_type of the tensor as highlighted here. ROCm recognizes cuda device type in PyTorch and runs seamlessly without the need to change anything in the application code. Hence, this PR updates cuda backend in bitsandbytes to enable its functionality on ROCm for AMD GPUs. This PR also adds support for ROCm in the cmake build and enables key functionality of bitsandbytes on AMD GPUs.

Summary of Changes

Updated CUDA backend to work seamlessly on ROCm
Integrated HIP environment into bitsandbytes through hipified versions of CUDA kernels and ops
Cmake build updates for ROCm
Enabled key features in bitsandbytes functional and autograd api

Impact

It enables to build and support bitsandbytes on ROCm for AMD GPUs . Bitsandbytes users can port applications smoothly onto AMD gpus as it requires minimal changes on their front. In addition to this, it also ensures that ROCm changes do not affect CUDA environment, thereby not affecting existing CUDA users.

CC: @Titus-von-Koeller @matthewdouglas @arlo-phoenix

Skip rocm failing tests

skip failing triton tests on rocm

This reverts commit b7ca5cf.

Support extract_outliers, quantize_4bit and dequantize_4bit with Device Abstraction PR.

…ev_abs_IFU

Device Abstraction IFU

Add install steps for ROCm

matthewdouglas · 2024-05-14T21:08:03Z

bitsandbytes/backends/cuda.py

+        if blocksize is None:
+            blocksize = 64 if not HIP_ENVIRONMENT else 128


Is there a short explanation we can add here to explain why this is the default, and likewise below why 64 is not supported?

Its because of warpsize difference between AMD and NVIDIA GPUs. I have added comments - 410f499

tpimh · 2024-05-15T13:01:10Z

Issue #149: can Intel Arc GPUs be supported in a similar manner?

matthewdouglas · 2024-05-15T19:44:32Z

Issue #149: can Intel Arc GPUs be supported in a similar manner?

@tpimh There's separate work in progress for Intel. So far there's been work on CPU with IPEX (#1178, #1206) and separately a SYCL port: #747.

tpimh · 2024-05-16T03:50:38Z

Thanks! This looks promising.

I will try on both AMD and Intel Arc.

Titus-von-Koeller · 2024-05-24T10:03:40Z

Dear @pnunna93,

thanks to you and your team for the amazing work. We're super excited about this and I'm very happy with what I'm seeing at an initial superficial review.

It would be great to have the AMD runner available relatively soon, otherwise it remains quite messy and work intensive to keep track of the correctness of the various backend implementations. Please let me know what I can do to help and I'll make sure to pull the right strings.

Regarding the review, as communicated in Slack, I have to first focus on wrapping up my deep dive in evaluating tensor-driven dispatch by integration with the PyTorch dispatcher via the torch.library APIs. I don't see any reason to not merge your PR, but I need to take another thorough look and I think it would be helpful for everyone to have clarity on the backend abstraction / dispatch mechanism asap and am therefore prioritizing that; so everyone can then refactor their code to account for that.

In that context, one important question came up:

Our paged optimizers use CUDA unified memory, as described in detail here.

Is that feature available on ROCm devices in one way or another? This would be quite important to understand for my analysis, as the handling of unified memory in relation to PyTorch is one of my last open questions. It's quite a special case, because it's a cornerstone of preventing OOMs in low resource environments -- a key feature for our user group -- and is not implemented/ accounted for in PyTorch and, therefore, we use that feature directly through CUDA related APIs: The underlying CUDA function is cudaMemPrefetchAsync AFAICT.

Thanks 🤗 and congrats on the great work in this PR, we're super excited about this ❤️

Titus-von-Koeller · 2024-05-24T14:32:32Z

Dear @pnunna93 et al,

Unfortunately we're (mostly me alone) quite resource constrained and humbled by the workload associated with the multi-backend-refactor. I just talked with my colleague @younesbelkada about the topic how to best handle the next steps.

We both took a look at this PR and the one from Intel and think that at first glance everything looks really good. At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete.

However, for the purpose of iterative progress (as agreed in our prior conversations), we've decided to already go ahead and merge both the open Intel and AMD branches into multi-backend-refactor, where interested parties can then compile from source and give the new functionality (we're so excited and grateful about this!) a thorough testing.

Once we've made some progress on the torch.library based refactor, I'll next focus on enabling the nightly releases for that branch as well. We're also looking forward to your feedback on the this torch.library / tensor-driven dispatch topic once the code is there on the basis of which to discuss (and refactor the backend specific code towards that new target, after we agreed with all of you that this is the right path).

Among other things, there's also been extensive ongoing work in the background on things like moving BNB to a new independent/non-profit Github org, but under the umbrella of Hugging Face and the support of their infra team for managing the complexities of the CI/CD backend and runners. Also, we're working to make Github runners for the different hardware platforms a reality (thanks for your help on that!).

Thanks again for the good work and active collaboration! ❤️ 🚀

Titus-von-Koeller · 2024-05-24T14:52:00Z

P.S. Also see this: README: asking for help from volunteer alpha testers

Let us know if you have further thoughts on this and how you think it's best to communicate about this.

Lzy17 and others added 30 commits July 7, 2023 16:55

hipify the csrc repo

2e10f67

hipify pythoninterface

1928960

copy from agrocylo

8ca0b5c

hipify cuparse and cublas calls

8acbcf2

fix compile error and Makefile

e80a60c

fixed runtime error (low accuracy)

fb780a0

FIX LOW ACCURACY

1048264

Update README.md

c330020

add benchmarks

fcee2d6

Update README.md

4c0ca08

First draft, getting error

c798616

Small transform fix, still errors on igemm

37045e5

create HIP_ENVIRONMENT variable

524fa57

Skip failing tests on rocm

d7f7a82

Add default value for HIP_ENVIRONMENT

28b8056

Merge pull request #1 from ROCmSoftwarePlatform/skip_rocm_failing_tests

9dca4fa

Skip rocm failing tests

skip failing triton tests on rocm

38c934e

Merge pull request #2 from ROCmSoftwarePlatform/skip_triton

71bf2df

skip failing triton tests on rocm

Enable col to row transformation

657ca4b

Add make functions for row to col transformation

a390e0c

Update get_transform_buffer for row to col in HIP

99ad6b5

Update igemmlt for col format

039b808

Unskip test_igemmlt_int on ROCm

1a052ee

Update igemmlt_int test for col inputs

b7ca5cf

Skip transpose igemmlt test on ROCm

a2cd90d

Revert "Update igemmlt_int test for col inputs"

5b6c5ac

This reverts commit b7ca5cf.

Return nvidia_transform from transform for HIP

218bf66

Fix syntax error

8bb5c2f

Add comment for shape change

eb2edf7

Enable nvidia_transform tests

a38ea0f

lcskrishna and others added 19 commits May 7, 2024 13:21

linter updates

9cd1d8c

Merge branch 'device_abstraction' into cl/update-device-abs

62f8ed9

Merge pull request #23 from ROCm/cl/update-device-abs

d9e4803

Support extract_outliers, quantize_4bit and dequantize_4bit with Device Abstraction PR.

Merge remote-tracking branch 'upstream/multi-backend-refactor' into d…

2af8568

…ev_abs_IFU

skip linear no igemmlt test

06f6b25

Remove archive functional file

2359452

Sync README with upstream

f76d6ab

Remove bnb_accuracy file

576b62c

Remove cuda_setup

dfb531b

Remove test_delete_later.c

31b1cbc

Sync with upstream

ed77476

Sync files with upstream

943c57a

Fix lint errors

71d1702

Exclude hip files from typo checks

6886bc8

update ops.hip

0d445f4

Merge pull request #27 from ROCm/dev_abs_IFU

bc6d0b7

Device Abstraction IFU

Add install steps for ROCm

15c7f77

Fix lint error

d62c835

Merge pull request #28 from ROCm/dev_abs_add_install_steps

8aae7c9

Add install steps for ROCm

Titus-von-Koeller self-assigned this May 13, 2024

matthewdouglas reviewed May 14, 2024

View reviewed changes

Add comments for HIP changes

410f499

Titus-von-Koeller merged commit eb3b816 into bitsandbytes-foundation:multi-backend-refactor May 24, 2024
1 of 2 checks passed

pnunna93 mentioned this pull request Jul 8, 2024

ROCm Backend Status Tracker #1271

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitsandBytes Enablement on ROCm #1207

BitsandBytes Enablement on ROCm #1207

pnunna93 commented May 10, 2024

matthewdouglas May 14, 2024

pnunna93 May 15, 2024

tpimh commented May 15, 2024

matthewdouglas commented May 15, 2024

tpimh commented May 16, 2024

Titus-von-Koeller commented May 24, 2024

Titus-von-Koeller commented May 24, 2024

Titus-von-Koeller commented May 24, 2024 •

edited

Loading

		if blocksize is None:
		blocksize = 64 if not HIP_ENVIRONMENT else 128

BitsandBytes Enablement on ROCm #1207

BitsandBytes Enablement on ROCm #1207

Conversation

pnunna93 commented May 10, 2024

Overview

Summary of Changes

Impact

matthewdouglas May 14, 2024

Choose a reason for hiding this comment

pnunna93 May 15, 2024

Choose a reason for hiding this comment

tpimh commented May 15, 2024

matthewdouglas commented May 15, 2024

tpimh commented May 16, 2024

Titus-von-Koeller commented May 24, 2024

Titus-von-Koeller commented May 24, 2024

Titus-von-Koeller commented May 24, 2024 • edited Loading

Titus-von-Koeller commented May 24, 2024 •

edited

Loading