Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stub out additional backends #1173

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
67e7ee3
first draft
stevhliu Mar 26, 2024
e3376ab
toctree
stevhliu Mar 26, 2024
c17fb8e
Fix 4bit quantization with blocksize=4096
matthewdouglas Mar 29, 2024
a471456
fix formatting for install_cuda.py
matthewdouglas Mar 29, 2024
494de20
Bump the minor-patch group with 1 update (#1162)
dependabot[bot] Apr 2, 2024
bed0860
Tests: improve memory usage (#1147)
matthewdouglas Apr 2, 2024
2965c76
CHANGELOG.md: mention accuracy changes when quantizing post v0.42
Titus-von-Koeller Apr 2, 2024
76885a4
Merge pull request #1160 from matthewdouglas/quant4bit-blocksize4096
Titus-von-Koeller Apr 2, 2024
bfe2118
README: include download badges
Titus-von-Koeller Apr 4, 2024
0c64a0d
Merge pull request #1148 from stevhliu/fsdp-qlora
Titus-von-Koeller Apr 5, 2024
b2a85a4
Update matplotlib requirement from ~=3.8.3 to ~=3.8.4 in the major group
dependabot[bot] Apr 8, 2024
c0ad874
Build workflow: Add CUDA 12.4 to build matrix
matthewdouglas Apr 8, 2024
ebac862
Exclude Windows from CUDA 12.4.0 build for now
matthewdouglas Apr 8, 2024
af9a073
Merge pull request #1171 from matthewdouglas/build-cu124
Titus-von-Koeller Apr 9, 2024
0c887b7
Merge pull request #1169 from TimDettmers/dependabot/pip/major-45b123…
Titus-von-Koeller Apr 9, 2024
6be3d0f
[docs] Install from source (#1149)
stevhliu Apr 9, 2024
c54053d
Bump scipy from 1.12.0 to 1.13.0 in the minor-patch group (#1170)
dependabot[bot] Apr 9, 2024
7449d71
[`Core`] Change 8-bit serialization weight format format (#1164)
younesbelkada Apr 10, 2024
d62516f
(backends) Stub out additional backends; move more functions to backe…
matthewdouglas Apr 11, 2024
4743ff0
CHANGELOG: to reverse chron order + mdformat
Titus-von-Koeller Apr 11, 2024
0c33c0d
ignore CHANGELOG reordering + formatting commit
Titus-von-Koeller Apr 11, 2024
f92c536
CHANGELOG: add v0.43.1
Titus-von-Koeller Apr 11, 2024
4a6fb35
bump version to 0.43.1
Titus-von-Koeller Apr 11, 2024
7b0c4cd
small fix in changelog
Titus-von-Koeller Apr 11, 2024
127788a
bump version to next dev
Titus-von-Koeller Apr 11, 2024
6cecb65
Update pandas requirement from ~=2.2.1 to ~=2.2.2 in the major group …
dependabot[bot] Apr 17, 2024
ffd7d0d
(docs) integrations: fix omission in bf16 related warning (#1183)
Titus-von-Koeller Apr 17, 2024
5b9ef77
Bump the minor-patch group with 2 updates (#1192)
dependabot[bot] Apr 30, 2024
7f13c8f
merge changes from main
Titus-von-Koeller May 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ ea7c14f8ef64924f2d0ff80df3cdabf2c7299848

# Reformat with ruff-format
5a4263f4dc05fe8f78f4111beab9f68a81deeab1

# CHANGELOG: to reverse chron order + mdformat
4743ff0d43e04e4cc3e5d8b9e7cd016c0defa36d
4 changes: 3 additions & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,12 @@ jobs:
os: [ubuntu-latest, windows-latest]
arch: [x86_64, aarch64]
cuda_version:
["11.7.1", "11.8.0", "12.0.1", "12.1.1", "12.2.2", "12.3.2"]
["11.7.1", "11.8.0", "12.0.1", "12.1.1", "12.2.2", "12.3.2", "12.4.0"]
exclude:
- os: windows-latest # This probably requires arm64 Windows agents
arch: aarch64
- os: windows-latest # The Jimver/cuda-toolkit is action used for Windows builds is not updated for 12.4 yet.
cuda_version: "12.4.0"
- os: ubuntu-latest # Temporary. Takes too long, not ready yet.
arch: aarch64
runs-on: ${{ matrix.os }} # One day, we could run them on native agents. Azure supports this now but it's planned only for Q3 2023 for hosted agents
Expand Down
511 changes: 291 additions & 220 deletions CHANGELOG.md

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# `bitsandbytes`

[![Downloads](https://static.pepy.tech/badge/bitsandbytes)](https://pepy.tech/project/bitsandbytes) [![Downloads](https://static.pepy.tech/badge/bitsandbytes/month)](https://pepy.tech/project/bitsandbytes) [![Downloads](https://static.pepy.tech/badge/bitsandbytes/week)](https://pepy.tech/project/bitsandbytes)

The `bitsandbytes` library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.

The library includes quantization primitives for 8-bit & 4-bit operations, through `bitsandbytes.nn.Linear8bitLt` and `bitsandbytes.nn.Linear4bit` and 8-bit optimizers through `bitsandbytes.optim` module.
Expand Down
48 changes: 42 additions & 6 deletions bitsandbytes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import torch

from . import research, utils
from .autograd._functions import (
MatmulLtState,
Expand All @@ -12,19 +14,53 @@
matmul_cublas,
mm_cublas,
)
from .backends import register_backend
from .backends.cpu import CPUBackend
from .cextension import lib
from .nn import modules

if lib and lib.compiled_with_cuda:
from .backends import register_backend
from .backends.cuda import CUDABackend
from .optim import adam
# Always register the CPU backend.
register_backend("cpu", CPUBackend())

# Register either CUDA or ROCm backend, if available.
# Only one of these backends can be used at a time, since the torch.device semantics are
# the same for both torch+rocm and torch+cuda (e.g. device name is "cuda")
if torch.cuda.is_available():
# TODO: Consider deferring loading of cextension - should backend class implement that?

if torch.version.cuda:
from .backends.cuda import CUDABackend

register_backend("cuda", CUDABackend())
elif torch.version.hip:
from .backends.rocm import ROCmBackend

register_backend("cuda", ROCmBackend())

# Register MPS backend, if available.
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
from .backends.mps import MPSBackend

register_backend("mps", MPSBackend())

# Register Intel XPU backend, if available.
if hasattr(torch, "xpu") and torch.xpu.is_available():
from .backends.xpu import XPUBackend

register_backend("xpu", XPUBackend())

# TODO: Other potential backends:
# XLA - Google TPU / PJRT runtime
# HPU - Habana / Intel Gaudi
# IPU - Graphcore
# NPU - Ascend
# Note that we may not map 1:1 with a device type, e.g. SYCL, XLA
# In this case, it will be up to each backend to dispatch as needed

register_backend("cuda", CUDABackend())
__pdoc__ = {
"libbitsandbytes": False,
"optim.optimizer.Optimizer8bit": False,
"optim.optimizer.MockArgs": False,
}

__version__ = "0.44.0.dev"
__version__ = "0.43.2.dev"
185 changes: 162 additions & 23 deletions bitsandbytes/backends/base.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from abc import ABC, abstractmethod
from typing import Optional, Tuple
from typing import Literal, Optional, Tuple, Union

import torch

Expand All @@ -12,48 +12,62 @@ class Backend(ABC):
@abstractmethod
def double_quant(
self,
A,
col_stats=None,
row_stats=None,
out_col=None,
out_row=None,
A: torch.Tensor,
col_stats: Optional[torch.Tensor] = None,
row_stats: Optional[torch.Tensor] = None,
out_col: Optional[torch.Tensor] = None,
out_row: Optional[torch.Tensor] = None,
threshold=0.0,
):
raise NotImplementedError

@abstractmethod
def transform(
self,
A,
to_order,
A: torch.Tensor,
to_order: str,
from_order="row",
out=None,
out: Optional[torch.Tensor] = None,
transpose=False,
state=None,
state: Optional[Tuple[torch.Size, str]] = None,
ld=None,
):
raise NotImplementedError

@abstractmethod
def igemmlt(self, A, B, SA, SB, out=None, Sout=None, dtype=torch.int32):
def igemmlt(
self,
A: torch.Tensor,
B: torch.Tensor,
SA: Tuple[torch.Size, str],
SB: Tuple[torch.Size, str],
out: Optional[torch.Tensor] = None,
Sout: Optional[Tuple[torch.Size, str]] = None,
dtype=torch.int32,
) -> Union[torch.Tensor, Tuple[Optional[Tuple[torch.Tensor, Tuple[torch.Size, str]]]]]:
raise NotImplementedError

@abstractmethod
def mm_dequant(
self,
A,
quant_state,
row_stats,
col_stats,
out=None,
new_row_stats=None,
new_col_stats=None,
bias=None,
):
A: torch.Tensor,
quant_state: Tuple[torch.Size, str],
row_stats: torch.Tensor,
col_stats: torch.Tensor,
out: Optional[torch.Tensor] = None,
new_row_stats: Optional[torch.Tensor] = None,
new_col_stats: Optional[torch.Tensor] = None,
bias: Optional[torch.Tensor] = None,
) -> torch.Tensor:
raise NotImplementedError

@abstractmethod
def extract_outliers(self, A, SA, idx):
def extract_outliers(
self,
A: torch.Tensor,
SA: Tuple[torch.Size, str],
idx: torch.Tensor,
) -> torch.Tensor:
raise NotImplementedError

@abstractmethod
Expand All @@ -64,7 +78,7 @@ def quantize_4bit(
out: Optional[torch.Tensor] = None,
blocksize=64,
compress_statistics=False,
quant_type="fp4",
quant_type: Literal["fp4", "nf4"] = "fp4",
quant_storage=torch.uint8,
) -> Tuple[torch.Tensor, QuantState]:
"""
Expand Down Expand Up @@ -102,7 +116,7 @@ def dequantize_4bit(
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize: int = 64,
quant_type="fp4",
quant_type: Literal["fp4", "nf4"] = "fp4",
) -> torch.Tensor:
"""
Dequantizes FP4 blockwise quantized values.
Expand Down Expand Up @@ -131,3 +145,128 @@ def dequantize_4bit(
Dequantized tensor.
"""
raise NotImplementedError

@abstractmethod
def gemv_4bit(
self,
A: torch.Tensor,
B: torch.Tensor,
out: Optional[torch.Tensor] = None,
transposed_A=False,
transposed_B=False,
state: QuantState = None,
) -> torch.Tensor:
raise NotImplementedError

@abstractmethod
def quantize_blockwise(
self,
A: torch.Tensor,
code: Optional[torch.Tensor] = None,
absmax: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize=4096,
nested=False,
) -> Tuple[torch.Tensor, QuantState]:
raise NotImplementedError

@abstractmethod
def dequantize_blockwise(
self,
A: torch.Tensor,
quant_state: Optional[QuantState] = None,
absmax: Optional[torch.Tensor] = None,
code: Optional[torch.Tensor] = None,
out: Optional[torch.Tensor] = None,
blocksize: int = 4096,
nested=False,
) -> torch.Tensor:
raise NotImplementedError

@abstractmethod
def optimizer_update_8bit_blockwise(
self,
optimizer_name: str,
g: torch.Tensor,
p: torch.Tensor,
state1: torch.Tensor,
state2: Optional[torch.Tensor],
beta1: float,
beta2: float,
eps: float,
step: int,
lr: float,
qmap1: torch.Tensor,
qmap2: Optional[torch.Tensor],
absmax1: torch.Tensor,
absmax2: Optional[torch.Tensor],
weight_decay: float = 0.0,
gnorm_scale: float = 1.0,
skip_zeros=False,
) -> None:
"""
Performs an in-place optimizer update with one or two optimizer states.

Args:
optimizer_name (`str`): The name of the optimizer, e.g. `adam`
g (`torch.Tensor`): Gradient tensor.
p (`torch.Tensor`): Parameter tensor.
state1 (`torch.Tensor`): Optimizer state 1.
state2 (`torch.Tensor`, optional): Optimizer state 2.
beta1 (`float`): Optimizer beta1.
beta2 (`float`): Optimizer beta2.
eps (`float`): Optimizer epsilon.
step (`int`): Current optimizer step.
lr (`float`): The learning rate.
qmap1 (`torch.Tensor`): Quantization map for the first state.
qmap2 (`torch.Tensor`, optional): Quantization map for the second state.
absmax1 (`torch.Tensor`): Max value for the first state update.
absmax2 (`torch.Tensor`, optional): Max value for the second state update.
weight_decay (`float`, optional): Weight decay. Defaults to 0.0.
gnorm_scale (`float`, optional): The factor to rescale the gradient to the max clip value. Defaults to 1.0.
skip_zeros (`bool`, optional): Whether to skip zero-valued gradients or not. Defaults to False.
"""
raise NotImplementedError

@abstractmethod
def optimizer_update_32bit(
self,
optimizer_name: str,
g: torch.Tensor,
p: torch.Tensor,
state1: torch.Tensor,
beta1: float,
eps: float,
step: int,
lr: float,
state2: Optional[torch.Tensor] = None,
beta2: float = 0.0,
weight_decay: float = 0.0,
gnorm_scale: float = 1.0,
unorm_vec: Optional[torch.Tensor] = None,
max_unorm: float = 0.0,
skip_zeros=False,
) -> None:
"""
Performs an in-place optimizer update with one or two optimizer states.

Universal optimizer update for 32-bit state and 32/16-bit gradients/weights

Args:
optimizer_name (`str`): The name of the optimizer, e.g. `adam`
g (`torch.Tensor`): Gradient tensor.
p (`torch.Tensor`): Parameter tensor.
state1 (`torch.Tensor`): Optimizer state 1.
beta1 (`float`): Optimizer beta1.
eps (`float`): Optimizer epsilon.
step (`int`): Current optimizer step.
lr (`float`): The learning rate.
state2 (`torch.Tensor`, optional): Optimizer state 2. Defaults to None.
beta2 (`float`, optional): Optimizer beta2. Defaults to 0.0.
weight_decay (`float`, optional): Defaults to 0.0.
gnorm_scale (`float`, optional): The factor to rescale the gradient to the max clip value. Defaults to 1.0.
unorm_vec (`torch.Tensor`, optional): The tensor for the update norm. Defaults to None.
max_unorm (`float`, optional): The maximum update norm relative to the weight norm.. Defaults to 0.0.
skip_zeros (`bool`, optional): Whether to skip zero-valued gradients or not. Defaults to False.
"""
raise NotImplementedError
Loading
Loading