Releases: Nexesenex/croco.cpp
Croco.Cpp_FrankenFork_v1.77005_b3962
Bugfix for 1.77004, with :
- K q8_0 V F16 FA quant dropped.
- Non FA Quants dropped for now, only K q6_0 V F16 works (and it's the best anyway).
- Algo to pass the quant, FA, and no-shift fixed.
- Up KLite to 182.
Croco.Cpp_FrankenFork_v1.77004_b3962
New lazy release with :
- @ikawrakow's recent work on KV Quants integrated (new KV quant IQ4_NL to replace Q4_0, with -1% PPL if both K and V, Q6_0 close to Q8_0 in the couple K Q6_0 / V Q5_0).
-
Use the GUI to discover the new modes.
- Some of Ikawrakow's work on Cuda (on the top of some of his work for CPU inference).
- Cuda Graph caching PR of Agray3.
- Some bugfixes (aka, the bugs created by yours truly). ^^
- Note : Llava users, be careful, it might not work or simply crash.
Second release : with the help fixed.
I'll make a longer readme when motivated.
Full Changelog: v1.76005_b3906...v1.77004_b3962
Croco.Cpp_FrankenFork_v1.77002_b3934
KVQ27 (iq4_nl) doesn't work, I leave it for further testing. I left the equivalences for the deleted KV quants to the closest equal or inferior bpw, so the config files keep working as they are, until a stable KVQ cocktail of quants is chosen.
Croco.Cpp_FrankenFork_v1.76007_b3917
v1.76007_b3917
Croco.Cpp_FrankenFork_v1.76005_b3906
v176005_b3906
Croco.Cpp_FrankenFork_v1.76004_b3896
v176004_b3896
Croco.CPP_FrankenFork_v1.75201_b3826
Cuda arch Pascal to Ada.
Full Changelog: v173001_b3524+7...v1.75201_b3826
Kobold.CPP_FrankenFork_v1.73007_b3599-2+8-4
Frankenstein 1.73007 fork of Concedo's KoboldCPP.
Official KoboldCPP's experimental branch up to the 16/08/2024, 18h GMT+2
KLite 1.64 up to the 16/08/2024, 18h GMT+2.
Based on Llama.CPP b3599 -2 commits +8 pertinent LCPP/KCPP commits/PRs + 4 IK_Llama.cpp pertinent commits.
Annniversary edition (it's been a little bit more than one year I'm amusing myself with KCPP, and it's been a while since the last KCPP-F release, so..).
I'm mostly attentive on the Cuda side of things. The rest might work, or not.
Feedback is always appreciated to spot mistakes so I can try to fix them.
Unroll DISCLAIMER:
The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps in the following way x.xxx : (KCPP) xx (KCPP-F).
They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.
For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases
FRANKENSTEIN TRICKS:
- Vast choice of context-size in the GUI slider, with 720 steps, up to 1M context, as well as BlastBatchSize (physical) from 2 to 4096 and logical from 32 to 4096.
- Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version..
- A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the L1/L2 rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2, Solar) or avoid to degrade too much the reasoning abilities (L3, not implemented yet) at equal context.
- More chat adapters, on the top of those provided on official.
- A slight deboost on the pipeline parallelization, set from 4 to 2. 0.5-1% VRAM saved, and less stress on the graphic cards.
- Full chat-window width for the text when you zoom out in the Corpo theme.
- 8 chats saving slots instead of 6.
- A multi-GPU capable layers autoloader.
Unroll the 26 KV cache options (all should be considered experimental except F16, KV Q8_0, and KV Q4_0)
With Flash Attention :
- F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
- K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
- K Q8_0 with : V F16, Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0 (maybe stable)
- K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
- K Q5_0 with : V Q5_0, Q4_1, V Q4_0
- K Q4_1 with : V Q4_1 (stable), Q4_0 (maybe stable)
- KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.
Without Flash Attention :
- V F16 with KQ8_0, Q5_1, Q5_0, Q4_1, and Q4_0.
FRANKENSTEIN integrates looted PRs :
-
Emphasis DFSM by Yoshqu, to try to fix the ** and "" frequent misplacements in chats on KCPP/SillyTavern, through a grammar hack. Tested to work properly in Silly Tavern to correct the placement of the * and " characters. Read the readme!
More infos:
LostRuins@a43da2f#diff-a298e8927af1245e0ec1308617c0fae4554e5dd6d6ef77818dfc93296de7cced -
1.625 bpw ternary packing for BitNet b1.58, by Compilade , works in CPU mode (only recent models Bitnet properly converted work, older ones do not), updated in August.
Example : https://huggingface.co/BoscoTheDog/bitnet_b1_58-xl_q8_0_gguf/tree/main -
Johannes Gaessler's Gemma v2 FA PR (still slow in PP and TG unless you run it on Turing or more recent), but allows KV Quantized cache.
-
A Cuda Graph PR (updated today) by Alan Gray.
-
A lookup PR by Johannes Gaessler.
-
A Tokenizer fixes PR by jaime-m-p.
-
A GELU (CPU perfs) PR by Justine Tunney.
-
A Chameleon support PR by nopperl.
AND... 4 relevant PR coming from Ikawrakow's custom Llama.cpp Repo ( https://github.com/ikawrakow/ik_llama.cpp ) (see KCPP-F commit list until a batch of commits with his name appears).
ARGUMENTS: (to be edited, check them in CLI or use the GUI)
Note : I had to use a simple 0-20 numbering scheme to allow the GUI and the kcpps preset saving to work properly with KVQ26. The problems with the previous 4 numbers quant scheme are fixed.
--quantkv",
help="Sets the KV cache data type quantization.
Unroll the options to set KV Quants :
KCPP official modes (modes 1 and 2 require Flash Attention) :
0 = 1616/F16 (16 BPW),
1 = FA8080/KVq8_0 (8.5 BPW),
2 = FA4040/KVq4_0 (4.5BPW),
KCPP-F unofficial modes (require flash attention) :
3 = FA1680/Kf16-Vq8_0 (12.25BPW),
4 = FA1651/Kf16-Vq5_1 (11BPW),
5 = FA1650/Kf16-Vq5_0 (10.75BPW),
6 = FA1641/Kf16-Vq4_1 (10.5BPW),
7 = FA1640/Kf16-Vq4_0 (10.25BPW),
8 = FA8051/Kq8_0-Vq5_1 (7.25BPW),
9 = FA8050/Kq8_0-Vq5_0 (7BPW),
10 = FA8041/Kq8_0-Vq4_1 (6.75BPW),
11 = FA8040/Kq8_0-Vq4_0 (6.5BPW),
12 = FA5151/KVq5_1 (6BPW),
13 = FA5150/Kq5_1-Vq5_0 (5.75BPW),
14 = FA5141/Kq5_1-Vq4_1 (5.5BPW),
15 = FA5140/Kq5_1-Vq4_0 (5.25BPW),
16 = FA5050/Kq5_0-Vq5_0 (5.5BPW),
17 = FA5041/Kq5_0-Vq4_1 (5.25BPW),
18 = FA5040/Kq5_0-Vq4_0 (5BPW),
19 = FA4141/Kq4_1-Vq4_1 (5BPW),
20 = FA4140/Kq4_1-Vq4_0 (4.75BPW)
21 = 1616/F16 (16 BPW), (same as 0, I just used it for the GUI slider).
22 = 8016/Kq8_0, Vf16 (12.25BPW), FA and no-FA both
23 = 5116/Kq5_1-Vf16 (11BPW), no-FA
24 = 5016/Kq5_1-Vf16 (10.75BPW), no-FA
25 = 4116/Kq4_1-Vf16 (10.50BPW), no-FA
26 = 4016/Kq4_0-Vf16 (10.25BPW), no-FA
choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26], default=0)
Note : Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.
Note : context shift doesn't seem to work with K_cache without FA either. But Smartcontext does!
REMARKS :
-
You MUST use Flash attention for anything else than QKV=0 (F16)
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does. -
BlasBatchSize 512 is still optimal in Cublas, 256 still the best compromise, but 128 is a saavy compromise, optimal with MMQ activated, and is now used by default. 64 is perfectly usable and optimal for VRAM-limited scenarios. 32/16 work also, but slower. 8 MMVQ is worth 16 in Cublas mode, 4, 2 and 1 are MMVQ but very slow as you can imagine.
CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.
Unroll the ARCHS and BUILDS
Archs :
# 37 == CUDA 11 standard for Kepler
# 52 == lowest CUDA 12 standard, for Maxwell
# 60 == f16 CUDA intrinsics
# 61 == integer CUDA intrinsics
# 70 == (assumed) compute capability at which unrolling a loop in mul_mat_q kernels is faster
# 75 == int8 tensor cores
Builds :
- Cublas 12.2 Win (arch 60 61 70 75) : Works Ada Lovelace, Ampere and Turing. It can work with Pascal or more recent as well, and has CUDA F16 activated during compilation.
- Cublas 12.2 Win (arch 52 61 70 75) and 12.1 Linux : Works for Maxwell v2 up to Ada, uses Integer Cuda Intrinsic.
- Cublas 11.4.4 Win / 11.5 Linux ("KepMax", arch 35 37 50 52 for Windows, 37 52 61 70 75 for Linux) : needed for Kepler and should work also on Maxwell. (experimental, tell me if that combo does work or not, if not, I'll come back on 37 52 61 70 75 for Cuda 11).
- the standard one is including only OpenBLAS, CLBLAST, and Vulkan support provided by the devs.
Missing builds that I compiled can be found in Github Actions, when I'm not messing it up:
https://github.com/Nexesenex/kobold.cpp/actions
What's Changed
- b3535 by @Nexesenex in #282
- b3542 by @Nexesenex in #283
- b3557 by @Nexesenex in #285
- b3569 by @Nexesenex in #288
- b3579 by @Nexesenex in #290
- b3583 by @Nexesenex in #293
- b3590 by @Nexesenex in #300
- b3596 by @Nexesenex in #303
- b3599 by @Nexesenex in #304
Full Changelog: v1.73001_b3524+7...v13007_b3599-2+8+4
Kobold.CPP_FrankenFork_v1.73001_b3524+7
Cuda 12 Only build for the greedy.
Full build likely after KCPP Official 1.73 is out.
Jart's new GELU PR is merged
What's Changed
- b3535 by @Nexesenex in #282
- b3542 by @Nexesenex in #283
- Compilade/faster session sizes by @Nexesenex in #284
- b3557 by @Nexesenex in #285
- b3569 by @Nexesenex in #288
- b3579 by @Nexesenex in #290
- b3583 by @Nexesenex in #293
- Dynamic ggml sched max splits by @Nexesenex in #301
Full Changelog: v173001_b3524+7...v1.73001_b3524+7