Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore(main): release 0.2.0 (flashinfer-ai#476)
🤖 I have created a release *beep* *boop* --- ## [0.2.0](flashinfer-ai/flashinfer@v0.1.6...v0.2.0) (2024-12-17) [Release Blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html). ### Features * add `rotary_dim` argument to rope APIs for partial apply rope ([flashinfer-ai#599](flashinfer-ai#599)) ([eb9bc71](flashinfer-ai@eb9bc71)) * add a `use_softmax` field in variant class ([flashinfer-ai#533](flashinfer-ai#533)) ([d81af97](flashinfer-ai@d81af97)) * add an option `non_blocking` to plan function ([flashinfer-ai#622](flashinfer-ai#622)) ([560af6f](flashinfer-ai@560af6f)) * add gemma_rmsnorm and gemma_fused_add_rmsnorm ([flashinfer-ai#477](flashinfer-ai#477)) ([1a6b17e](flashinfer-ai@1a6b17e)) * add group size 3 to GQA decode dispatch ([flashinfer-ai#558](flashinfer-ai#558)) ([6227562](flashinfer-ai@6227562)) * add JIT compilation support for FA3 templates ([flashinfer-ai#672](flashinfer-ai#672)) ([d4e8d79](flashinfer-ai@d4e8d79)) * allow the cascade kernels to be executed using varying sequence lenghts ([flashinfer-ai#627](flashinfer-ai#627)) ([92ac440](flashinfer-ai@92ac440)) * CUDAGraph compatibility of multi-level cascade inference APIs ([flashinfer-ai#586](flashinfer-ai#586)) ([2332e8a](flashinfer-ai@2332e8a)) * fix the maximal grid dimension in prefill planning with CUDA graphs ([flashinfer-ai#639](flashinfer-ai#639)) ([86ca89a](flashinfer-ai@86ca89a)) * improve the precision of the FusedAddRMSNormKernel function ([flashinfer-ai#587](flashinfer-ai#587)) ([c7dc921](flashinfer-ai@c7dc921)) * JIT compilation ([flashinfer-ai#507](flashinfer-ai#507)) ([3613a5b](flashinfer-ai@3613a5b)) * modify group-gemm stage number ([flashinfer-ai#497](flashinfer-ai#497)) ([52dab1d](flashinfer-ai@52dab1d)) * non-contiguous query with paged kv cache ([flashinfer-ai#553](flashinfer-ai#553)) ([89f2c4a](flashinfer-ai@89f2c4a)) * pass a dynamic token count to the cascade kernels ([flashinfer-ai#635](flashinfer-ai#635)) ([5fe9f7d](flashinfer-ai@5fe9f7d)) * simplify prefill JIT compilation ([flashinfer-ai#605](flashinfer-ai#605)) ([fe4f898](flashinfer-ai@fe4f898)) * specify gemm backend ([flashinfer-ai#648](flashinfer-ai#648)) ([0cc1a51](flashinfer-ai@0cc1a51)) * support cached cos/sin in rope APIs ([flashinfer-ai#585](flashinfer-ai#585)) ([83e541d](flashinfer-ai@83e541d)) * support huggingface transformer style rope interface ([flashinfer-ai#568](flashinfer-ai#568)) ([4f40420](flashinfer-ai@4f40420)) * support sm90 cutlass group gemm ([flashinfer-ai#509](flashinfer-ai#509)) ([794bdda](flashinfer-ai@794bdda)) * torch custom_op fix for rope ([flashinfer-ai#569](flashinfer-ai#569)) ([3e104bc](flashinfer-ai@3e104bc)) * torch custom_op support: norm ([flashinfer-ai#552](flashinfer-ai#552)) ([f6e0010](flashinfer-ai@f6e0010)) * torch.compile and custom_op support ([flashinfer-ai#554](flashinfer-ai#554)) ([9bf916f](flashinfer-ai@9bf916f)) * warmup for jit kernel tests ([flashinfer-ai#629](flashinfer-ai#629)) ([8f5f349](flashinfer-ai@8f5f349)) ### Bug Fixes * AOT compiler flags on non-sm90 ([flashinfer-ai#522](flashinfer-ai#522)) ([0aa4726](flashinfer-ai@0aa4726)) * batch decode kernel redundant store output to gmem ([flashinfer-ai#505](flashinfer-ai#505)) ([90e42a7](flashinfer-ai@90e42a7)) * compatible with torch 2.2 ([flashinfer-ai#478](flashinfer-ai#478)) ([ac41d1b](flashinfer-ai@ac41d1b)) * flashinfer-ai#452 ([b53a46f](flashinfer-ai@b53a46f)) * remove redundant load ([flashinfer-ai#495](flashinfer-ai#495)) ([2de16b0](flashinfer-ai@2de16b0)) * update bmm fp8 test ([flashinfer-ai#487](flashinfer-ai#487)) ([45eac04](flashinfer-ai@45eac04)) ### Performance Improvements * accelerate JIT compilation speed ([flashinfer-ai#618](flashinfer-ai#618)) ([eaf73fd](flashinfer-ai@eaf73fd)) * Dense and sparse customizable flashattention-3 template ([flashinfer-ai#667](flashinfer-ai#667)) ([51236c9](flashinfer-ai@51236c9)) * fix prefill kernel performance degradation (step 1) ([flashinfer-ai#602](flashinfer-ai#602)) ([595cf60](flashinfer-ai@595cf60)) * fix the performance issue of `append_paged_kv_cache` ([flashinfer-ai#588](flashinfer-ai#588)) ([e15f7c9](flashinfer-ai@e15f7c9)) * improve parallelism in RoPE with pos_ids ([flashinfer-ai#609](flashinfer-ai#609)) ([ff05155](flashinfer-ai@ff05155)) * improve plan performance by using non-blocking memcpy ([flashinfer-ai#547](flashinfer-ai#547)) ([41ebe6d](flashinfer-ai@41ebe6d)) * reduce the read and write of shared memory in the FusedAddRMSNormKernel ([flashinfer-ai#592](flashinfer-ai#592)) ([2043ca2](flashinfer-ai@2043ca2)) * reduce total_num_tiles_q by one ([flashinfer-ai#644](flashinfer-ai#644)) ([553ace5](flashinfer-ai@553ace5)) * remove unnecessary contiguous operation in block sparse attention ([flashinfer-ai#561](flashinfer-ai#561)) ([7a7ad46](flashinfer-ai@7a7ad46)) * speedup jit compilation of prefill attention kernels ([flashinfer-ai#632](flashinfer-ai#632)) ([a059586](flashinfer-ai@a059586)) * use cuda-core implemention for io-bound block-sparse attention ([flashinfer-ai#560](flashinfer-ai#560)) ([3fbf028](flashinfer-ai@3fbf028)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>
- Loading branch information