JIT compilation #170

yzh119 · 2024-03-11T02:23:42Z

As the combination of shapes and configurations increases, our pip wheel size grows and the compilation time becomes long.

PyTorch supports Just-In-Time compilation of extensions:
https://pytorch.org/tutorials/advanced/cpp_extension.html#jit-compiling-extensions, which makes it possible to only compile kernels corresponding to certain configurations/shapes, thus reducing both the wheel size and the development overhead on the codebase.

We can release a flashinfer_jit wheel where all kernels are compiled with JIT.

The text was updated successfully, but these errors were encountered:

Qubitium · 2024-05-16T03:45:14Z

For reference the following CI command on main will compile in 142minutes on a 32 core Zen3 4GHZ server container.

# time FLASHINFER_BUILD_VERSION="999.0.4+cu124torch2.3_gpPadded8_v4_arch8x" FLASHINFER_GROUP_SIZES="1,4,5,6,7,8"  TORCH_CUDA_ARCH_LIST="8.0 8.9" python -m build --no-isolation

real    141m49.680s
user    2389m59.725s
sys     200m31.997s

Without flashinfer_jit the only way to speed up flashinfer whl compilation to reasonable timeframe is to lock group_size to only 1 value needed by the intended model as that will cut down compilation steps/time by ~8x in my tests.

Changing TORCH_CUDA_ARCH_LIST doesn't have much impact on speed.

This PR implements the JIT compilation (#170 ) of flashinfer, after this PR, flashinfer will compile kernels just-in-time for different input data types and shapes, and cached the kernels at the disk, instead of pre-compile a set of kernels in the wheel. # Motivation The pip wheel size is exploding as we add support to more data types, more head dimensions, more attention variants and more kernel implementation. Pre-compile everything is not sustainable, and impedes development speed. This PR refactors the codebase to use torch's [JIT Compiling Extensions](https://pytorch.org/tutorials/advanced/cpp_extension.html#jit-compiling-extensions) feature instead of pre-compile kernels in the wheel. ## Attention Variants We learned from [FlexAttention](https://pytorch.org/blog/flexattention/) and describes every attention variant as a template class, each instance of the struct can carry some closure variable defined in local memory or shared memory, below are two examples (logits soft cap and alibi attention, the programming interface is tentative and will be updated as we improve the programmability of the JIT template): ```cuda template <typename ParamsT> struct LogitsSoftCap { using DTypeQ = typename ParamsT::DTypeQ; using DTypeKV = typename ParamsT::DTypeKV; using DTypeO = typename ParamsT::DTypeO; uint32_t qo_len, kv_len; uint32_t window_left; __device__ __host__ LogitsSoftCap(const ParamsT& params, uint32_t batch_idx, uint8_t* smem_ptr) { qo_len = params.get_qo_len(batch_idx); kv_len = params.get_kv_len(batch_idx); window_left = kv_len; } template <typename T> __device__ __forceinline__ T QueryTransform(const ParamsT& params, T q) { return float(q) * params.sm_scale * math::ptx_rcp(params.logits_soft_cap); } template <typename T> __device__ __forceinline__ T LogitsTransform(const ParamsT& params, T logits, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return params.logits_soft_cap * math::log2e * float(math::tanh(logits)); } __device__ __forceinline__ bool LogitsMask(const ParamsT& params, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return true; } }; template <typename ParamsT> struct ALIBIAttention { using DTypeQ = typename ParamsT::DTypeQ; using DTypeKV = typename ParamsT::DTypeKV; using DTypeO = typename ParamsT::DTypeO; using IdType = typename ParamsT::IdType; uint32_t qo_len, kv_len; uint32_t window_left; __device__ __host__ ALIBIAttention(const ParamsT& params, uint32_t batch_idx, uint8_t* smem_ptr) { qo_len = params.get_qo_len(batch_idx); kv_len = params.get_kv_len(batch_idx); window_left = kv_len; } template <typename T> __device__ __forceinline__ T QueryTransform(const ParamsT& params, T q) { return float(q) * params.sm_scale * math::log2e; } template <typename T> __device__ __forceinline__ T LogitsTransform(const ParamsT& params, T logits, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return logits + params.alibi_slopes[qo_head_idx] * float(int(kv_idx) - int(qo_idx)); } __device__ __forceinline__ bool LogitsMask(const ParamsT& params, uint32_t batch_idx, uint32_t qo_idx, uint32_t kv_idx, uint32_t qo_head_idx, uint32_t kv_head_idx) { return true; } }; ``` User can customize their own `ParamsT` class and variants class to define their own attention variants, we hope such refactor will make the codebase more concise and extensive. # Roadmap After this PR, we will add support for: 1. PyPI wheels #153 2. fp8 tensor cores attention: #502 3. different head dimensions: #142 #454 #455 4. flashattention3 #369 5. multi-head latency attention #237 6. Generate ParamsT and Attention variants description from python dsl The development of this features have been blocked by the limitation of wheel size (binary size >= 2GB will trigger some linking issues), I hope this PR will make development easier in the future.

yzh119 added the priority: high label Mar 11, 2024

yzh119 mentioned this issue Sep 25, 2024

feat: JIT compilation #507

Merged

yzh119 mentioned this issue Oct 9, 2024

Will AOT compilation still be supported after JIT compilation is added? #510

Closed

yzh119 closed this as completed Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT compilation #170

JIT compilation #170

yzh119 commented Mar 11, 2024 •

edited

Loading

Qubitium commented May 16, 2024 •

edited

Loading

JIT compilation #170

JIT compilation #170

Comments

yzh119 commented Mar 11, 2024 • edited Loading

Qubitium commented May 16, 2024 • edited Loading

yzh119 commented Mar 11, 2024 •

edited

Loading

Qubitium commented May 16, 2024 •

edited

Loading