Releases: EricLBuehler/mistral.rs
Releases Β· EricLBuehler/mistral.rs
v0.3.4
New features
- Qwen2-VL support
- Idefics 3/SmolVLM support
- οΈβπ₯ 6x prompt performance boost (all benchmarks faster than or comparable to MLX, llama.cpp)!
- ποΈ More efficient non-PagedAttention KV cache implementation!
- Public tokenization API
Python wheels
The wheels now include support for Windows, Linux, and Mac with x84_64 and aarch64.
MSRV
1.79.0
What's Changed
- Update Dockerfile by @Reckon-11 in #895
- Add the Qwen2-VL model by @EricLBuehler in #894
- ISQ for mistralrs-bench by @EricLBuehler in #902
- Use tokenizers v0.20 by @EricLBuehler in #904
- Fix metal sdpa for v stride by @EricLBuehler in #905
- Better parsing of the image path by @EricLBuehler in #906
- Add some Metal kernels for HQQ dequant by @EricLBuehler in #907
- Handle assistant messages with 'tool_calls' by @Jeadie in #824
- Attention-fused softmax for Metal by @EricLBuehler in #908
- Metal qmatmul mat-mat product (5.4x performance increase) by @EricLBuehler in #909
- Support --dtype in mistralrs bench by @EricLBuehler in #911
- Metal: Use mtl resource shared to avoid one copy by @EricLBuehler in #914
- Preallocated KV cache by @EricLBuehler in #916
- Fixes for kv cache grow by @EricLBuehler in #917
- Dont always compile with fp8, bf16 for cuda by @EricLBuehler in #920
- Expand attnmask on cuda by @EricLBuehler in #923
- Faster CUDA prompt speeds by @EricLBuehler in #925
- Paged Attention alibi support by @EricLBuehler in #926
- Default to SDPA for faster VLlama PP T/s by @EricLBuehler in #927
- VLlama vision model ISQ support by @EricLBuehler in #928
- Support fp8 on Metal by @EricLBuehler in #930
- Bump rustls from 0.23.15 to 0.23.18 by @dependabot in #932
- Calculate perplexity of ISQ models by @EricLBuehler in #931
- Integrate fast MLX kernel for SDPA with long seqlen by @EricLBuehler in #933
- Always cast image to rgb8 for qwenvl2 by @EricLBuehler in #936
- Fix etag missing in hf hub by @EricLBuehler in #934
- Fix some examples for vllama 3.2 by @EricLBuehler in #937
- Improve memory efficency of vllama by @EricLBuehler in #938
- Implement the Idefics 3 models (Idefics 3, SmolVLM-Instruct) by @EricLBuehler in #939
- Expose a public tokenization API by @EricLBuehler in #940
- Prepare for v0.3.4 by @EricLBuehler in #942
New Contributors
- @Reckon-11 made their first contribution in #895
Full Changelog: v0.3.2...v0.3.4
v0.3.2
Key changes
- General improvements and fixes
- ISQ FP8
- GPTQ Marlin
- 26% performance boost on Metal
- Python package wheels are available. See below and the various PyPi packages.
What's Changed
- Update docs and deps by @EricLBuehler in #804
- Support Qwen 2.5 by @EricLBuehler in #805
- Update docs with clarifications and notes by @EricLBuehler in #806
- Improved inverting for Attention Mask by @EricLBuehler in #811
- Fix
repeat_interleave
by @EricLBuehler in #812 - Use f32 for neg inf in cross attn mask by @EricLBuehler in #814
- Improve UQFF memory efficiency by @EricLBuehler in #813
- Update Metal, CUDA Candle impls and ISQ by @EricLBuehler in #816
- chore: update pagedattention.cu by @eltociear in #822
- MLlama - if f16, load vision model in f32 by @EricLBuehler in #820
- ci: Upgrade actions by @polarathene in #823
- docs: added a top button because of readme length by @bhargavshirin in #833
- Typo in error of model architecture enum by @nikolaydubina in #835
- Expose config for Rust api, tweak modekind by @EricLBuehler in #841
- Add ISQ FP8 by @EricLBuehler in #832
- Fix Metal F8 build errors by @EricLBuehler in #846
- Bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #854
- Generate standalone UQFF models by @EricLBuehler in #849
- Update README.MD by @kaleaditya779 in #848
- Add GPTQ Marlin support for 4 and 8 bit by @EricLBuehler in #856
- Adds wrap_help feature to clap by @DaveTJones in #858
- Patch UQFF metal generation by @EricLBuehler in #857
- Add GGUF Qwen 2 by @EricLBuehler in #860
- Avoid duplicate Metal command buffer encodings during ISQ by @EricLBuehler in #861
- Fix for isnanf by @EricLBuehler in #859
- Fix some metal warnings by @EricLBuehler in #862
- Support interactive mode markdown bold/italics via ANSI codes by @EricLBuehler in #879
- Even better V-Llama accuracy by @EricLBuehler in #881
- Trim whitespace (such as carriage returns) from nvidia-smi output. by @asaddi in #880
- MODEL_ID not "MODEL_ID" by @simonw in #863
- Sync ggml metal kernels by @EricLBuehler in #885
- Increase Metal decoding T/s by 26% by @EricLBuehler in #887
- Remove pretty-printer by @EricLBuehler in #889
- Fix typo in documentation by @msk in #888
- fix Half-Quadratic Quantization and Dequantization on CPU by @haricot in #873
- Prepare for v0.3.2 by @EricLBuehler in #891
New Contributors
- @bhargavshirin made their first contribution in #833
- @nikolaydubina made their first contribution in #835
- @kaleaditya779 made their first contribution in #848
- @DaveTJones made their first contribution in #858
- @asaddi made their first contribution in #880
- @simonw made their first contribution in #863
- @msk made their first contribution in #888
- @haricot made their first contribution in #873
Full Changelog: v0.3.1...v0.3.2
v0.3.1
Highlights
- UQFF
- FLUX model
- Llama 3.2 Vision model
MSRV
The MSRV of this release is 1.79.0.
What's Changed
- Enable automatic determination of normal loader type by @EricLBuehler in #742
- Add the
ForwardInputsResult
api by @EricLBuehler in #745 - Implement Mixture of Quantized Experts (MoQE) by @EricLBuehler in #747
- Bump quinn-proto from 0.11.6 to 0.11.8 by @dependabot in #748
- Fix f64-f32 type mismatch for Metal/Accelerate by @EricLBuehler in #752
- Nicer error when misconfigured PagedAttention input metadata by @EricLBuehler in #753
- Update deps, support CUDA 12.6 by @EricLBuehler in #755
- Patch bug when not using PagedAttention by @EricLBuehler in #759
- Fix
MistralRs
Drop impl in tokio runtime by @EricLBuehler in #762 - Use nicer Candle Error APIs by @EricLBuehler in #767
- Support setting seed by @EricLBuehler in #766
- Fix Metal build error with seed by @EricLBuehler in #771
- Fix and add checks for no kv cache by @EricLBuehler in #776
- UQFF: The uniquely powerful quantized file format. by @EricLBuehler in #770
- Add
Scheduler::running_len
by @EricLBuehler in #780 - Deduplicate RoPE caches by @EricLBuehler in #787
- Easier and simpler Rust-side API by @EricLBuehler in #785
- Add some examples for AnyMoE by @EricLBuehler in #788
- Rust API for sampling by @EricLBuehler in #790
- Our first Diffusion model: FLUX by @EricLBuehler in #758
- Fix build bugs with metal, NSUInteger by @EricLBuehler in #792
- Support weight tying in Llama 3.2 GGUF models by @EricLBuehler in #801
- Implement the Llama 3.2 vision models by @EricLBuehler in #796
Full Changelog: v0.3.0...v0.3.1
v0.3.0
Highlights
- New model topology feature: ISQ and device mapping
- π₯Faster FlashAttention support when batching
- Removed
plotly
and associated JS dependencies - ΟΒ³ Support Phi 3.5, Phi 3.5 vision, Phi 3.5 MoE
- Improved Rust API ergonomics
- Support multiple (shaded) GGUF files
MSRV
The Rust MSRV of this version is 1.79.0
What's Changed
- Fixes for auto dtype selection with RUST_BACKTRACE=1 by @EricLBuehler in #690
- Add support multiple GGUF files by @EricLBuehler in #692
- Refactor normal and vision loaders by @EricLBuehler in #693
- Fix
split.count
GGUF duplication handling by @EricLBuehler in #695 - Batching example by @EricLBuehler in #694
- Some fixes by @EricLBuehler in #697
- Improve vision rust examples by @EricLBuehler in #698
- Add ISQ topology by @EricLBuehler in #701
- Add custom logits processor API by @EricLBuehler in #702
- Add Gemma 2 PagedAttention support by @EricLBuehler in #704
- Faster RmsNorm in Gemma/Gemma2 by @EricLBuehler in #703
- Fix bug in Metal ISQ by @EricLBuehler in #706
- Support GGUF BF16 tensors by @EricLBuehler in #691
- Better support for FlashAttention: real batching + sliding window + softcap by @EricLBuehler in #707
- Remove some usages of
pub
in models by @EricLBuehler in #708 - Support the Phi 3.5 V model by @EricLBuehler in #710
- Implement the Phi 3.5 MoE model by @EricLBuehler in #709
- Device map topology by @EricLBuehler in #717
- Implement DRY penalty by @EricLBuehler in #637
- Remove plotly and just output CSV loss file by @EricLBuehler in #700
- Using once_cell to reduce MSRV by @EricLBuehler in #724
- Fixes for Windows build by @EricLBuehler in #729
- Even more phi3.5moe fix attempts by @EricLBuehler in #731
- Add example for Phi 3.5 MoE by @EricLBuehler in #733
- Add Phi 3.5 chat template by @EricLBuehler in #734
- Patch ISQ for Mixtral by @EricLBuehler in #730
- Gracefully handle Engine Drop with termination request by @EricLBuehler in #735
- feat(vision): add support for proper file and data image URLs by @Schuwi in #727
- Add new parsing to Python API by @EricLBuehler in #737
- Remove test and add custom error type to Python API by @EricLBuehler in #738
- Update kernels for metal bf16 by @EricLBuehler in #719
- Better
Response
Result API by @EricLBuehler in #739 - More Metal quantized kernel fixes by @EricLBuehler in #740
- [Breaking] Bump version to v0.3.0 by @EricLBuehler in #736
- Final changes for v0.3.0 by @EricLBuehler in #741
New Contributors
Full Changelog: v0.2.5...v0.3.0
v0.2.5
What's Changed
- Refactor ISQ quant parsing by @EricLBuehler in #664
- Refactor server examples to use OpenAI Python client by @EricLBuehler in #665
- Implement prompt chunking by @EricLBuehler in #623
- Python example and server example cleanup by @EricLBuehler in #668
- Implement GPTQ quantization by @EricLBuehler in #467
- Update deps by @EricLBuehler in #672
- Rework the automatic dtype selection feature by @EricLBuehler in #676
- Fix backend Candle fork Metal, flash attn, also Llama linear by @EricLBuehler in #681
- Use converted tokenizer.json in tests by @EricLBuehler in #682
- Refactor ISQ and mistralrs-quant by @EricLBuehler in #683
- Fix metal build for isq by @EricLBuehler in #686
- Add missing error case in automatic dtype selection feature by @ac3xx in #685
- fix null in tool type response by @wseaton in #687
- Implement HQQ quantization by @EricLBuehler in #677
- Bump version to 0.2.5 by @EricLBuehler in #688
New Contributors
Full Changelog: v0.2.4...v0.2.5
Install mistralrs-server 0.2.5
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.5/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.5
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.4
What's Changed
- fix build on metal by returning Device by @rgbkrk in #642
- Add invite to Matrix chatroom by @EricLBuehler in #644
- Make sure we don't have dead links by @EricLBuehler in #647
- Fix more links by @EricLBuehler in #648
- Throughput for interactive mode by @EricLBuehler in #655
- Implement tool calling by @EricLBuehler in #649
- Fix device map check for paged attn by @EricLBuehler in #656
- Fix for mistral nemo in gguf by @EricLBuehler in #657
- Fix check of cache config when device mapping + PA by @EricLBuehler in #658
- Biollama in tool calling example by @EricLBuehler in #659
- Biollama in tool calling example by @EricLBuehler in #660
- Examples for simple tool calling by @EricLBuehler in #661
- Bump version to 0.2.4 by @EricLBuehler in #662
New Contributors
Full Changelog: v0.2.3...v0.2.4
MSRV
MSRV is 1.75
Install mistralrs-server 0.2.4
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.4/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.4
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.3
What's Changed
- Implement min-p sampling by @EricLBuehler in #625
- Tweak handling when PA cannot allocate by @EricLBuehler in #632
- Update deps by @EricLBuehler in #633
- Improve penalty context window calculation by @EricLBuehler in #636
- Allow setting PagedAttention KV cache allocation from context size by @EricLBuehler in #640
- Bump version to 0.2.3 by @EricLBuehler in #638
Full Changelog: v0.2.2...v0.2.3
Install mistralrs-server 0.2.3
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.3/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.3
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.2
What's Changed
- Fix ctrlc handling for scheduler v2 by @EricLBuehler in #614
- Make
sliding_window
optional for mixtral by @csicar in #616 - Support Llama 3.1 scaled rope by @EricLBuehler in #618
New Contributors
Full Changelog: v0.2.1...v0.2.2
MSRV
MSRV is 1.75
.
Install mistralrs-server 0.2.2
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.2/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.2
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.1
What's Changed
- Fix path normalize for mistralrs-paged-attn by @EricLBuehler in #592
- ISQ python example by @EricLBuehler in #593
- Add support for mistral nemo by @EricLBuehler in #595
- Fix dtype with QLinear by @EricLBuehler in #600
- Update paged-attn build.rs with NVCC flags by @joshpopelka20 in #604
- Bump openssl from 0.10.64 to 0.10.66 by @dependabot in #605
- Update GitHub issue templates by @EricLBuehler in #607
- Add server throughput logging by @EricLBuehler in #608
- Make the plotly feature optional by @EricLBuehler in #597
- Use OnceLock for Python bindings device by @EricLBuehler in #602
- Topk for X-LoRA scalings by @EricLBuehler in #609
- Fix server cross-origin errors by @openmynet in #610
- Refactor sampler by @EricLBuehler in #611
- Bump version to 0.2.1 by @EricLBuehler in #613
New Contributors
- @dependabot made their first contribution in #605
- @openmynet made their first contribution in #610
Full Changelog: v0.2.0...v0.2.1
Install mistralrs-server 0.2.1
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.1/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.1
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.0
New features
- Support .bin, .pt, .pth extensions
- Add Starcoder 2 GGUF
- π₯ PagedAttention - beating llama.cpp running GGUF plus all the throughput benefits π
- Optimized performance and memory usage
Rust MSRV
MSRV of mistral.rs
v0.2.0 is 1.75.
What's Changed
- Fix SWA order (flip it) for Gemma 2 by @EricLBuehler in #554
- Support .bin, .pt, .pth extensions by @EricLBuehler in #557
- Update readme by @EricLBuehler in #558
- Fix Starcoder 2 ISQ by @EricLBuehler in #559
- Update deps by @EricLBuehler in #560
- Add the starcoder2 GGUF arch by @EricLBuehler in #561
- Readme update for starcoder2 gguf by @EricLBuehler in #562
- Fix PyPI release trigger by @EricLBuehler in #566
- Optimize multi-batch and inference performance with PagedAttention by @EricLBuehler in #552
- [Breaking] Version 0.2.0 by @EricLBuehler in #527
- Paged attention support for vision models by @EricLBuehler in #567
- Automatically use paged attn on cuda, get memory size by @EricLBuehler in #568
- Add docs link for vision loader by @EricLBuehler in #570
- Add matching for valid model weight names by @EricLBuehler in #571
- Remove ensure about no paged attn for vision models by @EricLBuehler in #573
- Add percentage utilization support to paged attn by @EricLBuehler in #574
- Include block engine in paged attn metadata by @EricLBuehler in #576
- Update deps and sync Candle by @EricLBuehler in #578
- Optimize CLIP model by @EricLBuehler in #579
- Use softmax_last_dim in CLIP by @EricLBuehler in #580
- Fix method of calculating paged attn with util percent by @EricLBuehler in #581
- Handle windows in paged attn build by @EricLBuehler in #577
- Warn instead of error when paged attn not supported by @EricLBuehler in #583
- Warn instead of error when paged attn for adapters not supported by @EricLBuehler in #584
- Add support for lm_head to adapter models by @EricLBuehler in #586
- Add default plotly feature by @EricLBuehler in #587
- Improve memory handling of PagedAttention with GGUF by @EricLBuehler in #590
- Fix Windows build on cuda w/ PagedAttention by @EricLBuehler in #589
- Update cuda kernels build.rs on windows by @EricLBuehler in #591
- Bump version to 0.2.0 and update docs by @EricLBuehler in #582
Full Changelog: v0.1.26...v0.2.0
Install mistralrs-server 0.2.0
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.0/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.0
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |