BesTLA

BesTLA is a lightweight, header-only acceleration library for high-performance GEMM and related computations on Intel platform. Inspired by Cutlass, it provides high-level template class abstractions for various elements required for computation, and allows flexible kernel construction through template combinations to meet specific needs, maximizing the reuse of existing template classes. Users can also develop custom template classes to expand BesTLA’s computational capabilities. BesTLA includes several different types of template classes, specifically:

Launcher: Schedules computation-related template classes, allowing users to specify their own computation-related template classes, including GemmCore, Prologue, and Epilogue.
Parallel: Specifies data splitting strategy for task distribution among different cores. BesTLA’s default Parallel template class adopts an L2-cache-fusion concept, i.e., each core tries to temporarily store the data it processes in its L2-cache during each round of gemm-tile computation.
GemmCore: A computation-related template class that provides a micro-kernel for performing a tile gemm computation with a specific ISA. It is the most important template class in BesTLA. Currently, GemmCore supports the following ISAs:
- AVX2: sgemm, u8s8 igemm
- AVX_VNNI: u8s8&s8s8 igemm
- AVX512F: sgemm
- AVX512BW: u8s8 igemm
- AVX512_VNNI: u8s8 igemm
- AMX_BF16: bf16 hgemm
- AMX_INT8: s8s8&u8s8&u8u8&s8u8 igemm
- AVX512_FP16: fp16 hgemm
Prologue: A computation-related template class that preprocesses (such as data type conversion/padding) input data to meet GemmCore’s input data requirements.
Epilogue: A computation-related template class that post-processes (such as eltwiseop-fusion) the results of gemm-core computations to expand BesTLA’s application scenarios. BesTLA supports users to configure thread libraries for multi-core parallelism (e.g. openMP), greatly facilitating user integrate BesTLA into their own projects. BesTLA also supports specifying the number of computing-threads at runtime, making the allocation of computing resources more flexible.

Highlights

Weight-only

BesTLA provides weight-only linear computational capabilities for LLM inference. We provide a series of Prologues for quantize/compress/serialize/deserialize fp32 weights in different ways. Specifically, the weight-only-quantization configs we support are given in the table below:

Weight dtype	Compute dtype	Scale dtype	algo
INT8	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
INT4	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
INT3	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
INT2	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
INT5	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
INT6	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
INT7	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
INT1	INT8 / BF16 / FP32	BF16 / FP32	sym / asym
FP8 (E4M3, E5M2)	BF16 / FP32	FP32 / FP8 (E8M0)	sym
FP4 (E2M1)	BF16 / FP32	BF16 / FP32	sym
NF4	BF16 / FP32	BF16 / FP32	sym

Config description of the table:

Config	Description
Weight dtype	Data type of quantized weight
Compute dtype	Data type of BesTLA internal Gemm computation
Scale dtype	Data type of scales
algo	Quantization algorithm to use(symmetric/asymmetric)

Postop-fusion

BesTLA provides assembly-level postop-fusion through epilogue to minimize the overhead caused by data movement. Specifically, we support the following postop-fusions:

GELU
SWISH
RELU
EXP
TANH

Optimized thread pool for hybrid CPUs

Our thread pool is optimized for hybrid CPUs (client CPUs after Core 11th). It will be much faster than OMP thread pool. Recommend to use all threads and cores on hybrid CPU: P cores * 2 + E cores.

Compilation Requirements and Usage

Compile:

GCC version >= 9.0
CMake version >= 3.12
MSVC version >= 1900
oneAPI version >= 2024.0

Best Performance:

GCC >= 11.0.0
MSVC >= 1930
DPCPP >= 2024.0

Usage:

add_subdirectory(bestla)
target_link_libraries("${YOUR_PROJECT}" neural_speed::bestla)

Benchmark

Build with:

mkdir build
cd build
cmake .. -DBTLA_UT_BENCHMARK=ON -DBTLA_UT_ALL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./bin/bestla_benchmark

More template usages, please refer code in bestla_benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BesTLA

Highlights

Weight-only

Postop-fusion

Optimized thread pool for hybrid CPUs

Compilation Requirements and Usage

Benchmark

Files

README.md

Latest commit

History

README.md

File metadata and controls

BesTLA

Highlights

Weight-only

Postop-fusion

Optimized thread pool for hybrid CPUs

Compilation Requirements and Usage

Benchmark