Preset Option | Compatible Parameters | Incompatible Parameters | Notes |
---|---|---|---|
Use CPU | --usecpu |
--usecublas , --usevulkan , --useclblast , --lowvram , --mmq , --rowsplit , --gpulayers , --tensor_split , --noavx2 |
Default option for CPU-only systems |
Use CLBlast | --useclblast , --gpulayers |
--usecublas , --usevulkan , --noblas , --lowvram , --mmq , --rowsplit , --noavx2 |
Must specify exactly 2 arguments for platform ID and device ID |
Use CuBLAS | --usecublas , --gpulayers , --tensor_split , --lowvram , --mmq , --rowsplit |
--usevulkan , --useclblast , --noblas , --noavx2 |
Best option for NVIDIA GPUs |
Use Vulkan | --usevulkan , --gpulayers , --tensor_split |
--usecublas , --useclblast , --noblas , --lowvram , --mmq , --noavx2 |
Compatible with a wide range of GPUs |
CLBlast NoAVX2 (Old CPU) | Same as "Use CLBlast" | Same as "Use CLBlast" | Automatically sets --noavx2 |
Vulkan NoAVX2 (Old CPU) | Same as "Use Vulkan" | Same as "Use Vulkan" | Automatically sets --noavx2 |
NoAVX2 Mode (Old CPU) | Most general parameters | --usecublas , --usevulkan , --useclblast |
Automatically sets --noavx2 |
Failsafe Mode (Old CPU) | Basic parameters only | --usecublas , --usevulkan , --useclblast , --lowvram , --mmq , --rowsplit , --gpulayers , --tensor_split |
Automatically sets --noavx2 , --noblas , and --nommap |
Notes:
- The
--gpulayers
option is only available for Vulkan, CLBlast, and CuBLAS modes.- The
--tensor_split
option is only available for CuBLAS and Vulkan modes.- The
--lowvram
,--mmq
, and--rowsplit
options are only available for CuBLAS mode.
Flag/Parameter | Compatible Flags | Incompatible Flags | Notes |
---|---|---|---|
--usecpu |
--noavx2 , --contextsize , --threads , --blasthreads , --blasbatchsize , --noshift , --nommap , --usemlock |
--usecublas , --usevulkan , --useclblast , --gpulayers , --tensor_split |
Default CPU mode |
--usecublas |
--gpulayers , --tensor_split , --lowvram , --mmq , --rowsplit , --contextsize , --threads , --blasthreads , --blasbatchsize , --noshift , --nommap , --usemlock |
--usecpu , --usevulkan , --useclblast , --noavx2 |
For NVIDIA GPUs. Can specify GPU ID |
--usevulkan |
--gpulayers , --tensor_split , --contextsize , --threads , --blasthreads , --blasbatchsize , --noshift , --nommap , --usemlock |
--usecpu , --usecublas , --useclblast , --noavx2 , --lowvram , --mmq |
For various GPUs. Can specify Device ID |
--useclblast |
--gpulayers , --contextsize , --threads , --blasthreads , --blasbatchsize , --noshift , --nommap , --usemlock |
--usecpu , --usecublas , --usevulkan , --noavx2 , --tensor_split , --lowvram , --mmq |
Requires platform ID and device ID |
--noavx2 |
--usecpu , --contextsize , --threads , --blasthreads , --blasbatchsize , --noshift , --nommap , --usemlock |
--usecublas , --usevulkan , --useclblast |
Compatibility mode for older CPUs |
--gpulayers |
--usecublas , --usevulkan , --useclblast |
--usecpu |
Specifies layers to offload to GPU |
--tensor_split |
--usecublas , --usevulkan |
--usecpu , --useclblast |
For multi-GPU setups |
--contextsize |
All flags | None | Sets maximum context size |
--threads |
All flags | None | Sets number of CPU threads |
--blasthreads |
All flags | None | Sets number of BLAS threads |
--blasbatchsize |
All flags | None | Sets BLAS batch size |
--noshift |
All flags | None | Disables context shifting |
--nommap |
All flags | None | Disables memory mapping for model loading |
--usemlock |
All flags | None | Enables mlock |
--lowvram |
--usecublas |
--usecpu , --usevulkan , --useclblast |
Low VRAM mode for CUDA |
--mmq |
--usecublas |
--usecpu , --usevulkan , --useclblast |
Enables QuantMatMul for CUDA |
--rowsplit |
--usecublas |
--usecpu , --usevulkan , --useclblast |
Enables row splitting for multi-GPU CUDA |
--lora |
All flags | None | Applies LoRA adapter |
--ropeconfig |
All flags | None | Custom RoPE scaling |
--smartcontext |
All flags | --noshift |
Enables smart context (not recommended) |
--debugmode |
All flags | None | Enables debug output |
--highpriority |
All flags | None | Sets high process priority |
--quiet |
All flags | None | Enables quiet mode |
--multiuser |
All flags | None | Enables multi-user mode |
--remotetunnel |
All flags | None | Creates remote tunnel |
--ssl |
All flags | None | Enables SSL |
--nocertify |
All flags | None | Allows insecure SSL connections |
--password |
All flags | None | Sets access password |
--flashattention |
All flags | None | Enables flash attention |
--quantkv |
All flags | --noshift |
Enables KV cache quantization |
Menu Option | koboldcpp_nocuda.exe | koboldcpp.exe | koboldcpp_oldcpu.exe | koboldcpp_cu12.exe |
---|---|---|---|---|
Use UseCPU | ✓ | ✓ | (no AVX2) | ✓ |
Use CLBlast | ✓ | ✓ | ✓ | ✓ |
Use CuBLAS | ✓ (CUDA 11) | ✓ (CUDA 11) | ✓ (CUDA 12) | |
Use Vulkan | ✓ | ✓ | ✓ | ✓ |
CLBlast NoAVX2 (Old CPU) | ✓ | ✓ | ✓ | ✓ |
Vulkan NoAVX2 (Old CPU) | ✓ | ✓ | ✓ | ✓ |
NoAVX2 Mode (Old CPU) | ✓ | ✓ | ✓ | ✓ |
Failsafe Mode (Old CPU) | ✓ | ✓ | ✓ | ✓ |
DLL Files | koboldcpp_nocuda.exe | koboldcpp.exe | koboldcpp_oldcpu.exe | koboldcpp_cu12.exe |
---|---|---|---|---|
OpenCL.dll | ✓ | ✓ | ✓ | ✓ |
koboldcpp_default.dll | ✓ | ✓ | ✓ | ✓ |
koboldcpp_openblas.dll | ✓ | ✓ | ✓ | |
koboldcpp_failsafe.dll | ✓ | ✓ | ✓ | ✓ |
koboldcpp_noavx2.dll | ✓ | ✓ | ✓ | ✓ |
libopenblas.dll | ✓ | ✓ | ✓ | |
koboldcpp_clblast.dll | ✓ | ✓ | ✓ | ✓ |
koboldcpp_clblast_noavx2.dll | ✓ | ✓ | ✓ | ✓ |
koboldcpp_vulkan_noavx2.dll | ✓ | ✓ | ✓ | ✓ |
clblast.dll | ✓ | ✓ | ✓ | ✓ |
koboldcpp_vulkan.dll | ✓ | ✓ | ✓ | ✓ |
vulkan-1.dll | ✓ | ✓ | ✓ | ✓ |
koboldcpp_cublas.dll | ✓ | ✓ | ✓ | |
cublas64_11.dll | ✓ | ✓ | ||
cublasLt64_11.dll | ✓ | ✓ | ||
cudart64_110.dll | ✓ | ✓ | ||
cublas64_12.dll | ✓ | |||
cublasLt64_12.dll | ✓ | |||
cudart64_12.dll | ✓ | |||
msvcp140.dll | ✓ | ✓ | ✓ | |
msvcp140_codecvt_ids.dll | ✓ | ✓ | ✓ | |
vcruntime140.dll | ✓ | ✓ | ✓ | |
vcruntime140_1.dll | ✓ | ✓ | ✓ |
Mode | Required DLLs |
---|---|
Default/No acceleration | koboldcpp_default.dll |
OpenBLAS | koboldcpp_openblas.dll libopenblas.dll (must exist alongside) |
CLBlast | koboldcpp_clblast.dll clblast.dll (must exist alongside) OpenCL.dll (implied by Makefile's use of OpenCL.lib) |
CuBLAS (NVIDIA GPU acceleration) | koboldcpp_cublas.dll cuda.dll cublas.dll cudart.dll cublasLt.dll |
HipBLAS (AMD GPU acceleration) | koboldcpp_hipblas.dll Additional AMD ROCm libraries (not explicitly listed in Makefile) |
Vulkan | koboldcpp_vulkan.dll vulkan-1.dll |
NoAVX2 variants (for older CPUs) | koboldcpp_noavx2.dll koboldcpp_clblast_noavx2.dll (requires same dependencies as CLBlast) koboldcpp_vulkan_noavx2.dll (requires same dependency as Vulkan) |
Failsafe mode | koboldcpp_failsafe.dll |
--model [filename]
: Model file to load.--port [portnumber]
: Port to listen on.- You can use the foregoing as "positional" arguments but just using the flags makes things more uniform.
--config [filename]
: Load settings from a.kcpps
file. Other arguments will be ignored.--noavx2
: Do not use AVX2 instructions; enables a slower compatibility mode for older devices.--nommap
: If set, do not usemmap
to load newer models.--usemlock
: For Apple systems. Forces the system to keep the model in RAM rather than swapping or compressing. On systems with limited RAM, setting--usemlock
can prevent frequent memory swapping and improve performance. Disabled by default.--skiplauncher
: Doesn't display or use the GUI launcher.--quiet
: Enable quiet mode, which hides generation inputs and outputs in the terminal. Quiet mode is automatically enabled when running a horde worker.--onready [shell command]
: An optional shell command to execute after the model has been loaded.- This is an advanced parameter intended for script or command line usage. You can pass a terminal command (e.g., start a Python script) to be executed after Koboldcpp has finished loading. This runs as a subprocess and can be useful for starting Cloudflare tunnels, displaying URLs, etc.
- Example:
--onready "python script.py"
runs the specified Python script after the model is loaded.
--threads [number]
: Specifies the number of CPU threads to use for text generation.- If a number is not specified a default value is calculated.
- If CPU core count > 1: Uses half the physical cores, with a minimum of 3 and maximum of (physical cores - 1)
- For systems with 1 core: Uses 1 thread.
- Intel CPU Specific: For Intel processors, the maximum default is capped at 8 threads to avoid using efficiency cores
- Usage:
--threads [number]
- Note: If not specified, the program uses the calculated default value
- If a number is not specified a default value is calculated.
The --usecublas
argument enables GPU acceleration using CuBLAS (for NVIDIA GPUs) or hipBLAS (for AMD GPUs). For hipBLAS binaries, check the YellowRoseCx ROCm fork.
--usecublas [lowvram|normal] [main GPU ID] [mmq] [rowsplit]
- Example:
--usecublas lowvram 0 mmq rowsplit
lowvram
ornormal
lowvram
: Prevents offloading to the GPU the KV layers. Suitable for GPUs with limited memory.normal
: Default mode.
main GPU ID
: A number (e.g. 0, 1, 2, or 3) selecting a specific GPU. If not specified, all available GPUs will be used.mmq
: Uses “quantized matrix multiplication” during prompt processing instead of cuBLAS. This is slightly faster and uses slightly less memory for Q4_0, but is slower for K-quants. Generally, cuBLAS is faster but uses slightly more VRAM.rowsplit
: If multiple GPUs are being used, splitting occurs by “rows” instead of “layers,” which can be beneficial on some older GPUs.
- Can use
--flashattention
, which can be faster and more memory efficient.- If
--flashattention
is used--quantkv [level]
can also be used but “context shifting” will be disabled. Here, level 0=f16, 1=q8, 2=q4.
- If
Enables GPU acceleration using Vulkan, which is compatible with a broader range of GPUs and iGPUs. See more info at Vulkan GPU Info.
Device ID
: An integer specifying which GPU device to use. If not provided, it defaults to the first available Vulkan-compatible GPU.
--usevulkan [Device ID]
- Example:
--usevulkan 0
The program first determines how many layers are computed on the GPU(s) based on --gpulayers
. Those layers are split according to the --tensor_split
parameter. Layers not offloaded will be computed on the CPU. It is possible to specify --usecublas
, --usevulkan
, or --useclblast
and not specify --gpulayers
, in which case the prompt processing will occur on the GPU(s) but the per-token inference will not.
- By default, if no GPU IDs are specified after
--usecublas
or--usevulkan
, all compatible GPUs will be used and layers will be distributed equally.- NOTE: This can be bad if the GPUs are different sizes.
- Use
--tensor_split
to control the ratio, e.g.,--tensor_split 4 1
for an 80%/20% split on two GPUs. - The number of values in
--tensor_split
should match the total number of available GPUs.
- Don't use
--tensor_split
. However, you can still use--gpulayers
.
- If some (but not all) GPU IDs are provided after
--usecublas
or--usevulkan
, only those GPUs will be used for layer offloading. - Use
--tensor_split
to control the distribution ratio among the specified GPUs. - The number of values in
--tensor_split
should match the number of GPUs selected.- Example: With four GPUs available but only specifying the last two with
--usecublas 2 3
, using--tensor_split 1 1
would offload an equal amount of layers to the third and fourth GPUs but none to the first two.
- Example: With four GPUs available but only specifying the last two with
- Use
--usecublas
or--usevulkan
without specifying the GPU Ids, which makes available all GPUs for prompt processing. - Only assign layers to certain GPUs. Example: Using
--usecublas
and--tensor_split 5 0 3 2
will offload 50% of the layers to the first GPU, 30% to the third, and 20% to the fourth. However, the second GPU will still be available for other processing that doesn't require layers of the model.
--gpulayers
is supported by--useclblast
but--tensor_split
is not.
Enables GPU acceleration using CLBlast, based on OpenCL. Compatible with a wide range of GPUs including NVIDIA, AMD, Intel, and Intel iGPUs. More info can be found at CLBlast README.
Platform ID
: An integer between 0 and 8 (inclusive).Device ID
: An integer between 0 and 8 (inclusive).
--useclblast [Platform ID] [Device ID]
- Platform ID: An integer between 0 and 8 (inclusive).
- Device ID: An integer between 0 and 8 (inclusive).
- Both arguments are required.
- Example:
--useclblast 1 0
- The API instructions are unclear whether more than one compatible device can be specified. In any event,
--tensor_split
cannot be used.
- The API instructions are unclear whether more than one compatible device can be specified. In any event,
- Only used by CPU, not GPU.
- Enabled in Windows by default, but other platforms require a separate installation.
All BLAS acceleration (including OpenBLAS) can be disabled using --noblas
or --blasbatchsize -1
. Setting to -1 disables BLAS mode but retains other benefits like GPU offload.
Sets the batch size used in BLAS processing.
- Default: 512
- Options: -1, 32, 64, 128, 256, 512, 1024, 2048
Specifies the number of threads to use during BLAS processing.
- If not specified, it uses the same value as
--threads
. - If left blank, it will automatically set to a value slightly less than the CPU count.
- Recommendation: When running with full GPU offload, setting it to 1 thread may be sufficient.
Samplers determine how the AI selects the next token from a list of possible tokens. There are various samplers with different properties, but generally, you will only need a few.
- Controls the sequence in which samplers are applied to the list of token candidates when choosing the next token.
- Hardcoded into the source code as
[6,0,1,3,4,2,5]
to avoid poor outputs (0 = Top K 1 = Top P, 2 = Typical P, 3 = Top A, 4 = Min P, 5 = Temperature, 6 = TFS) - Don't change.
top_p
: 0.92rep_pen
: 1.1Temperature
: 0.7- Leave everything else disabled by default
- Top-K:
- Parameter:
top_k
- Function: Limits the number of possible words to the top K most likely options, removing everything else.
- Usage: Can be used with Top-P. Set value to 0 to disable its effect.
- Parameter:
- Top-A:
- Parameter:
top_a
- Function: Alternative to Top-P. Removes all tokens with a softmax probability less than
top_a * m^2
wherem
is the maximum softmax probability. - Usage: Set value to 0 to disable its effect.
- Parameter:
- Top-P:
- Parameter:
top_p
- Function: Discards unlikely text during sampling. Considers words with the highest cumulative probabilities summing up to P.
- Effect: Low values make the text predictable by removing uncommon tokens.
- Usage: Set value to 1 to disable its effect.
- Parameter:
- TFS (Top-Filter Sampling):
- Parameter:
tfs
- Function: Alternative to Top-P. Removes the least probable words from consideration, using second-order derivatives.
- Benefit: Can improve the quality and coherence of generated text.
- Parameter:
- Typical:
- Parameter:
typical_p
- Function: Selects words randomly with equal probability.
- Effect: Produces more diverse but potentially less coherent text.
- Usage: Set value to 1 to disable its effect.
- Parameter:
- Temperature:
- Parameter:
temperature
- Function: Controls the randomness of the output by scaling probabilities without removing options.
- Effect: Lower values produce more logical, less creative text.
- Parameter:
- Repetition Penalty:
- Parameter:
rep_pen
- Function: Applies a penalty to reduce the usage of recently used words, making the output less repetitive.
- Parameter:
- Controls the memory allocated for maximum context size. Adjust this if you need more RAM for larger contexts.
- Default: 4096
- Supported Values:
- 256, 512, 1024, 2048, 3072, 4096, 6144, 8192, 12288, 16384, 24576, 32768, 49152, 65536, 98304, 131072
- Warning: Use values outside the supported range at your own risk.
--contextsize [Value]
- Example:
--contextsize 8192
allocates memory for a context size of 8192.
Context Shifting is a better version of Smart Context that only works for GGUF models. This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. It is on by default. To disable Context Shifting, use the flag --noshift
.
KoboldCpp now supports a variety of streaming options. Kobold Lite UI supports streaming out of the box, which can be toggled in Kobold Lite settings. Note: the --stream
parameter is now deprecated and should not be used.
- Polled-Streaming (Recommended):
- Default Method: Used by the Kobold Lite UI.
- Mechanism: Polls for updates on the
/api/extra/generate/check
endpoint every second. - Advantages: Relatively fast and simple to use.
- Drawback: Some may find it a bit "chunky" as it does not update instantaneously for every single token.
- Pseudo-Streaming:
- Status: An older method no longer recommended due to performance overheads.
- Usage with Kobold Lite: Enable streaming and append
&streamamount=x
to the end of the Lite URL, wherex
is the number of tokens per request. - Drawback: Has a negative performance impact.
- SSE (True Streaming):
- Supported by: A few third-party clients such as SillyTavern and Agnaistic, available only via the API.
- Mechanism: Provides instantaneous per-token updates.
- Requirements: Requires a persistent connection and special handling on the client side with SSE support.
- Usage: This mode is not used in Lite or the main KoboldAI client. It uses a different API endpoint, so configure it from your third-party client according to their provided instructions.