-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaMA NUMA could be better #1437
Comments
Dependency to But this analysis alone is very informative - thank you for posting this. Might be interesting to see how some of the AVX-512 formats in #1073 perform with this NUMA fine-tuning |
A Also I'm finding it interesting that hyper-threading is actually improving inference speeds in this example, since the general consensus in the past was that it degrades performance. Perhaps the high memory bandwidth is a factor here? |
I have 2S Xeon E5 2680 v4 256Gb DDR4-2400. I also encountered this problem. It would be great to make NUMA support. I can't help with the code, I'm not a programmer, but I can test it on my machine if needed. Maybe try to use something like "gpu-layers" just for each NUMA node? Then the memory bandwidth will be fully utilized. Good luck to you guys. |
Most generic motherboards are two channel, with 1 NUMA node. I have a four channel 1 NUMA node board, but that comes at a cost. If you can afford to care about NUMA, then GPUs are always going to provide better performance IMNSHO. |
Why do your 2 NUMA and 4 NUMA have similar speeds? |
So TR Pro 3995 shows only 1 NUMA nodes, 128 CPU. |
128 hypercores. 64 physical cores. I've observed using hypercores tends to slow things down. The theory behind hypercores is that if one thread is waiting on memory I/O another thread can be scheduled, however if all threads are memory I/O bound then hyperthreading simply causes a lot of context switching overhead. Linux's |
LLaMA 65B-f16 is ~122GB. You can get 128GB of DDR4-2400 for around $150 and a 2S 8-channel Xeon E5 v4 to put it in for around $200. How much is any GPU with 128GB of RAM?
It's not surprising SMT usually doesn't help. On my 6-core Ryzen 5 5600G the "prompt eval" time goes up proportionally from -t 6 to -t 4 but the "eval" time stays the same. It takes -t 3 before it even goes up a little. Presumably completely memory bound, what are more threads going to do? But slower CPUs with more memory channels might actually get compute-bound sometimes.
The 2N and 4N systems both have 8 memory channels total. The 4N system is DDR4 instead of DDR3, so it has about 2/3rds more memory bandwidth, but because it's 4N, 3/4ths of random memory accesses will be off the local node instead of 1/2. Node interleave doesn't change that, it just stops all the accesses from being to the same node, which is a larger improvement on the 4N system.
Threadripper 3000 series sort of has two memory nodes but is effectively configured for node interleave: https://www.anandtech.com/show/15044/the-amd-ryzen-threadripper-3960x-and-3970x-review-24-and-32-cores-on-7nm/3 Epyc and older Threadripper (and basically anything with multiple sockets) show the NUMA nodes unless the user enables node interleave in the BIOS. Interleave is a bandwidth/latency trade off, but turning it on at the system level prevents applications from doing any further optimization. The performance hierarchy for multi-threaded applications is like this: worst) threads on every node accessing data on a single node; better) node interleave; best) threads on each node accessing data on the local node. System-level interleave removes the possibility for the worst thing by removing the possiblity for the best thing. It looks like the biggest thing degrading performance on NUMA here is actually preload. The default memory policy is to prefer to load pages into the node of the thread that first accessed it. Call mmap() with MAP_POPULATE and the system will by default try to put the whole model on the node of the calling thread. Using 'numactl --interleave=all' changes the memory policy so that mmap stripes the allocation across all the nodes. We can get the same result by calling 'set_mempolicy(MPOL_INTERLEAVE, numa_all_nodes_ptr->maskp, numa_all_nodes_ptr->size)' at the beginning of main(), but set_mempolicy() requires -lnuma. (It's annoying that they put a syscall wrapper in a library that isn't installed by default. In theory we could use syscall(2) instead.) But blind interleave still isn't optimal, it's just better than what's happening now. And we can get a similar result without -lnuma just by not using preload on NUMA systems. Then the mapping gets faulted in during first use by all the different work threads, which spreads it out across different nodes. Better yet, the way matrix multiplication is currently done by llama_print_timings: sample time = 399.99 ms / 512 runs ( 0.78 ms per token) At least for the first batch of 512. After that it slows down some: llama_print_timings: sample time = 1594.96 ms / 2048 runs ( 0.78 ms per token) It could be that at the end of the first batch it generates some new soon-to-be frequently accessed data that all gets stuck on one node again. Currently the fastest thing to do is run once without preload or interleave so the model gets paged in with some relation to the access pattern, then after the OS has it in the page cache, use interleave so mutable state gets spread across nodes: llama_print_timings: sample time = 1593.20 ms / 2048 runs ( 0.78 ms per token) But that's only until we can figure out which data is causing the detriment without interleave. We can preferentially load data into any node we want without third party dependencies by setting thread affinity for the current thread to the cores on that node right before first access to that memory. Or if there is some specific data which inherently has to be accessed repeatedly by threads on multiple nodes, it may be faster to keep a copy of it on each node. I'm still trying to understand some of this code. Like this:
There are no operations while the lock is held? |
These lock / unlock are actually noops (the Lines 13683 to 13686 in d627025
I am following your analysis with interest. |
$ numactl --interleave=all ./main -n 512 -m models/65B/ggml-model-q4_0.bin --ignore-eos -p "Someone told me there is a configuration option called \"/proc/sys/vm/numa_interleave\" -- what" -t 32 $ ls /proc/sys/vm/numa_interleave /proc/sys/kernel/numa_balancing is enabled by default. Let's try turning it off: # echo 0 > /proc/sys/kernel/numa_balancing Now I'm not seeing any further benefit from numactl --interleave. Once you stop prefetching the model (0d23f8c), de facto turning that off was apparently the source of its advantage.
The trouble is you're threading something with a lot of serial dependencies in it, so the amount of work each thread can do before it has to synchronize with the other threads is small and you have to do thread synchronization thousands of times a second. |
I was fooled by chatgpt's fictional method, which told me that "echo 1 > / proc/sys/vm/numa_interleave" was a kernel-level setting instead of "numactl-- interleave=all", which actually didn't seem to exist. When I realized this, I deleted the previous comments. The method you just pull should be the right solution. I am excited about your improvement. I'm assembling a numa computer with 2 CPU and 8 channels of memory (E5-2698Bv3 x2 / DDR3 1866 32G x8), and I'll test it later. |
Does anybody know if I can rent one of these in a cloud to test? |
I couldn't find any with that exact CPU config. I'm going to try multiple processes with: https://github.com/huggingface/text-generation-inference |
If there is enough memory, it might be faster to load the model weights into memory on both NUMA nodes using numa_alloc_onnode from libnuma, then bind half the threads to node #1, and the other half to node #2. That way there isn't much crosstalk during inference, but it does consume 2x more memory. If batch size > num_nodes, then there isn't any need for cross-node at all (each generation can run in parallel). Edit: In simpler terms: Load the model into memory twice. Once on node A, once on node B. Would that be doable? |
There should be a high-performance storage device to save time loading the model. |
So what's new in here? @zrm @ggerganov |
I was considering nabbing a 2nd gen EPYC system for running models, but without better NUMA support that's not going to be anywhere near as effective as it could be. I've got a TB of DDR4 RAM that needs to be used in something... |
NUMA support has been merged for a while: #1556 Add |
Should I also be running with no mmap when using numa? |
@ggerganov
That means we'll need to know how many numa nodes exist in llm_load_tensors, and which cpuset to assign in the set_numa_thread_affinity function...I don't know if there's a best way to do this, whether its passing variables through structures/functions, or having some kind of mapping in hparams/g_state maybe? I was hoping to figure out something on my own as a challenge, but I'm burning a lot of time trying to puzzle my way around. Any hints as to where/how to apply these strategies, or ways that my thinking is fundamentally wrong would be very appreciated. I think making this approach general, as opposed to another #ifdef custom backend, would be a net positive. Most of the new AMD chips above entry-level are starting to expose CCX as numa nodes for numa aware apps. It is shaping up to be an industry trend, with a lot of chiplet-based chips on the horizon. Also, if you want to test any massively parallel code branches, I'm happy to run them on my rig and report results. |
To do that, you can check this example from the https://github.com/ggerganov/whisper.cpp/pull/1763/files Here, we limit each memory buffer to have maximum of 1GB data, but it can be easily modified so that each buffer contains a single tensor. Hope this is enough to get things started, though I don't have a full vision of how to implement the entire thing |
@slaren Pinging you for help. I've been at this for over two months now, and have failed in my attempts to even force through a dirty mechanism just to prove what the performance benefit of memory locality will be, I've yet to produce a usable branch. The gap between my knowledge of the codebase and the surgery required to pull this off has lead me to failure in dozens of attempts now. |
If you have any questions just ask. For llama.cpp and other projects using ggml-backend, the memory of the tensors is allocated in a |
Thanks @slaren.
I'm really struggling to understand the relationship between a model, context (seems to have multiple meanings based on which part of the code?), tensors, cgraphs, plans, layers, bufs/bufts and other structs and variables. I finally tried to cheese it by straight up creating one model/context object per numa node and attempting to reference the right model's data based on the pthread's CPU affinity, but couldn't reason my way through the different structs and the ways they are transformed as the model/context tuple is passed down through from main.cpp, into llama.cpp and eventually into ggml.c and the scheduled threads. Also, how to keep things consistent when re-integrating the hidden/shared context at the end so results are useful. Laying it all out like this, it seems like it should be so simple, but I'm unable to crack it, and I feel like I'm still failing to ask the right questions somehow. Does any of this make sense? |
Buffers that are used for model weights have the flag
That's not really feasible because all the threads work on the same tensors, just different slices of it. Lines 547 to 669 in 54ea069
|
When using ggml-backend, as is the case in llama.cpp,
|
Does that mean that my current approach of attempting to ensure memory locality at the buffer/tensor level is in vain? Or that I need to figure out a way to pre-determine which slices of the tensors will be worked on per thread? Is that related to layer or row splitting in any way? |
Is it possible to allocate a contiguous amount of memory and assign different slices of it to different NUMA nodes? |
As far as I am aware, no. Allocating on different nodes will result in non-contiguous memory addresses |
This was actually the mechanism I tried to make use on my very first attempt (ggml_backend_numa_buffer_type). I found it difficult to get over the initial learning curve of putting together new versions of all the nested objects, structs and function calls needed to even get a toy version of a new backend to compile. Maybe I should attempt this again if its the most likely to succeed? |
I think you can do it with |
If models are divisible by rows then I could potentially use |
The CPU backend expects the tensors to be in a contiguous buffer. The only alternative is to write a new backend and implement a new matrix multiplication algorithm that can take the data in a different way. You can simply add a |
That's definitely one side of the problem taken care of...the other side is how to marry up the pages from each numa node to threads scheduled on the same node. I didn't see any obvious way to communicate this information in the existing code paths. Would a tensor-to-page map be workable? |
Take a look at the implementation of |
@bmtwl 2 more questions related to the numa support. Can we add a fine grain binding in Current binding binds the threads to nodes (DISTRIBUTE) or current node (ISOLATE) or the cpuset numactl gives to llama.cpp (NUAMCTL). I found this sometimes cause high cpu usage in |
Yes, I think this is a very good idea as memory locality from the caches all the way up are the key to high performance. I've had a similar scheme in mind for a while now, but I don't see a simple way to add that without other changes farther up the stack as thread scheduling appears to be ad-hoc, assigning work to threads as work appears for threads to do. Maybe @slaren has ideas for a low-impact way to do this in the codebase as it exists today? I think that there will be an evolution towards moving the higher-level NUMA allocation and scheduling decisions out of ggml.c and into llama.cpp. That context feels like the best place to globally ensure optimal data and execution locality, but I'm still trying to acquire "the knowledge" of this project to the point where I can really understand where and how the changes should be introduced to both perform well and fit into the coding philosophy to be maintainable/extensible long term. At a high level I think this means that a scheduling pipeline needs to be written take all these factors into account, really before any inference starts, but that seems like an ambitious goal. I'm going to keep focused on my buffer-to-thread memory locality goal for now. |
llm.c made me curious about OMP https://github.com/karpathy/llm.c/blob/6396e393e319f899bb61ba53f8b70c22cf3b038b/Makefile#L17 openmp appears to only maybe? be enabled for Darwin and? Raspberry Pi? I see no reference to ggml_mpi_init anywhere. Line 16 in 4cc120c
Thread affinity in Slurm looks promising, but we're already stumbling on IPC, which is the issue at hand I guess. |
I think MPI has been broken for a long time. I got it working off an old branch as one of my first experiments way back when, but the performance wasn't good enough to pursue further. |
yeah. I was really referencing OMP.
#pragma omp parallel for collapse(2) easy to diagnose I'll try that with llm.c |
Update to this thread (I had posted a few updates in related PR #6915) Even without changing any CPU affinity code there is a huge shift in performance characteristics. I get a lot better efficiency with a low number of threads (40% speedup over master at 16 threads), probably because I'm better able to saturate the interconnects evenly. Ramping up the number of threads doesn't improve performance to match master with mmap, though. My branch with the in-progress changes are at https://github.com/bmtwl/llama.cpp/tree/numamovepages, but it is very messy and specific to my setup for now |
After #6915 is merged, there won't be a fixed relationship, it will be dynamic. You will need to write a different scheduling mechanism. |
ith = 0 is the main thread, it does most of the work for everything... except matrix multiplies. Most of the code is single threaded because it checks if ith != 0, and returns. But very little time is spent in those operations. During matrix multiplies, ith lets you figure out which section of the data the thread is supposed to work on. Before 6915, there's a formula that maps each ith to a specific section of the output matrix, it iterates through that output, and writes the data. After 6915, the threads basically check out an index of work to do. It'll need to use ith to figure out which numa node it needs to do work from. Current_chunk (the shared counter for the work queue) could be made into an array the size of the max number of numa nodes. Each numa node could have it's own queue, based on whatever is data is local to it. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
…is help message and exit --version show version and build info -i, --interactive run in interactive mode --special special tokens output enabled --interactive-specials allow special tokens in user text, in interactive mode --interactive-first run in interactive mode and wait for input right away -cnv, --conversation run in conversation mode (does not print special tokens and suffix/prefix) -ins, --instruct run in instruction mode (use with Alpaca models) -cml, --chatml run in chatml mode (use with ChatML-compatible models) --multiline-input allows you to write or paste multiple lines without ending each in '\' -r PROMPT, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode (can be specified more than once for multiple prompts). --color colorise output to distinguish prompt and user input from generations -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0) -t N, --threads N number of threads to use during generation (default: 4) -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads) -td N, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd N, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) -p PROMPT, --prompt PROMPT prompt to start generation with (default: empty) -e, --escape process prompt escapes sequences (\n, \r, \t, \', \", \\) --prompt-cache FNAME file to cache prompt state for faster startup (default: none) --prompt-cache-all if specified, saves user input and generations to cache as well. not supported with --interactive or other interactive options --prompt-cache-ro if specified, uses the prompt cache but does not update it. --random-prompt start with a randomized prompt. --in-prefix-bos prefix BOS to user inputs, preceding the --in-prefix string --in-prefix STRING string to prefix user inputs with (default: empty) --in-suffix STRING string to suffix after user inputs with (default: empty) -f FNAME, --file FNAME prompt file to start generation. -bf FNAME, --binary-file FNAME binary file containing multiple choice tasks. -n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) -c N, --ctx-size N size of the prompt context (default: 512, 0 = loaded from model) -b N, --batch-size N logical maximum batch size (default: 2048) -ub N, --ubatch-size N physical maximum batch size (default: 512) --samplers samplers that will be used for generation in the order, separated by ';' (default: topk;tfsz;typicalp;topp;minp;temperature) --sampling-seq simplified sequence for samplers that will be used (default: kfypmt) --top-k N top-k sampling (default: 40, 0 = disabled) --top-p N top-p sampling (default: 0.9, 1.0 = disabled) --min-p N min-p sampling (default: 0.1, 0.0 = disabled) --tfs N tail free sampling, parameter z (default: 1.0, 1.0 = disabled) --typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled) --repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctxsize) --repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) --presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled) --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled) --dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled) --dynatemp-exp N dynamic temperature exponent (default: 1.0) --mirostat N use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) --mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1) --mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0) -l TOKENID(+/-)BIAS, --logit-bias TOKENID(+/-)BIAS modifies the likelihood of token appearing in the completion, i.e. --logit-bias 15043+1 to increase likelihood of token ' Hello', or --logit-bias 15043-1 to decrease likelihood of token ' Hello' --grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/ dir) --grammar-file FNAME file to read grammar from -j SCHEMA, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g. {} for any JSON object. For schemas w/ external $refs, use --grammar + example/jsonschematogrammar.py instead --cfg-negative-prompt PROMPT negative prompt to use for guidance. (default: empty) --cfg-negative-prompt-file FNAME negative prompt file to use for guidance. (default: empty) --cfg-scale N strength of guidance (default: 1.000000, 1.0 = disable) --rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model --rope-scale N RoPE context scaling factor, expands context by a factor of N --rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from model) --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N --yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size) --yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation) --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0) --yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0) --yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0) --pooling {none,mean,cls} pooling type for embeddings, use model default if unspecified -dt N, --defrag-thold N KV cache defragmentation threshold (default: -1.0, < 0 - disabled) --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf) --penalize-nl penalize newline tokens --temp N temperature (default: 0.8) --all-logits return logits for all tokens in the batch (default: disabled) --hellaswag compute HellaSwag score over random tasks from datafile supplied with -f --hellaswag-tasks N number of tasks to use when computing the HellaSwag score (default: 400) --winogrande compute Winogrande score over random tasks from datafile supplied with -f --winogrande-tasks N number of tasks to use when computing the Winogrande score (default: 0) --multiple-choice compute multiple choice score over random tasks from datafile supplied with -f --multiple-choice-tasks N number of tasks to use when computing the multiple choice score (default: 0) --kl-divergence computes KL-divergence to logits provided via --kl-divergence-base --keep N number of tokens to keep from the initial prompt (default: 0, -1 = all) --draft N number of tokens to draft for speculative decoding (default: 5) --chunks N max number of chunks to process (default: -1, -1 = all) -np N, --parallel N number of parallel sequences to decode (default: 1) -ns N, --sequences N number of sequences to decode (default: 1) -ps N, --p-split N speculative decoding split probability (default: 0.1) -cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled) -fa, --flash-attn enable Flash Attention (default: disabled) --mmproj MMPROJFILE path to a multimodal projector file for LLaVA. see examples/llava/README.md --image IMAGEFILE path to an image file. use with multimodal models. Specify multiple times for batching --mlock force system to keep model in RAM rather than swapping or compressing --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) --numa TYPE attempt optimizations that help on some NUMA systems - distribute: spread execution evenly over all nodes - isolate: only spawn threads on CPUs on the node that execution started on - numactl: use the CPU map provided by numactl if run without this previously, it is recommended to drop the system page cache before using this see ggerganov/llama.cpp#1437 -ngl N, --n-gpu-layers N number of layers to store in VRAM -ngld N, --n-gpu-layers-draft N number of layers to store in VRAM for the draft model -sm SPLITMODE, --split-mode SPLITMODE how to split the model across multiple GPUs, one of: - none: use one GPU only - layer (default): split layers and KV across GPUs - row: split rows across GPUs -ts SPLIT, --tensor-split SPLIT fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 -mg i, --main-gpu i the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0) --rpc SERVERS comma separated list of RPC servers --verbose-prompt print a verbose prompt before generation (default: false) --no-display-prompt don't print prompt at generation (default: false) -gan N, --grp-attn-n N group-attention factor (default: 1) -gaw N, --grp-attn-w N group-attention width (default: 512.0) -dkvc, --dump-kv-cache verbose print of the KV cache -nkvo, --no-kv-offload disable KV offload -ctk TYPE, --cache-type-k TYPE KV cache data type for K (default: f16) -ctv TYPE, --cache-type-v TYPE KV cache data type for V (default: f16) --simple-io use basic IO for better compatibility in subprocesses and limited consoles --lora FNAME apply LoRA adapter (implies --no-mmap) --lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap) --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter --control-vector FNAME add a control vector --control-vector-scaled FNAME S add a control vector with user defined scaling S --control-vector-layer-range START END layer range to apply the control vector(s) to, start and end inclusive -m FNAME, --model FNAME model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf) -md FNAME, --model-draft FNAME draft model for speculative decoding (default: unused) -mu MODELURL, --model-url MODELURL model download url (default: unused) -hfr REPO, --hf-repo REPO Hugging Face model repository (default: unused) -hff FILE, --hf-file FILE Hugging Face model file (default: unused) -ld LOGDIR, --logdir LOGDIR path under which to save YAML logs (no logging if unset) -lcs FNAME, --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by generation) -lcd FNAME, --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by generation) --override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified multiple times. types: int, float, bool, str. example: --override-kv tokenizer.ggml.addbostoken=bool:false -ptc N, --print-token-count N print token count every N tokens (default: -1) --check-tensors check model tensor data for invalid valueslog options: --log-test Run simple logging test --log-disable Disable trace logs --log-enable Enable trace logs --log-file Specify a log filename (without extension) --log-new Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log" --log-append Don't truncate the old log file. I have lazily committed 9 times
…his help message and exit --version show version and build info -i, --interactive run in interactive mode --special special tokens output enabled --interactive-specials allow special tokens in user text, in interactive mode --interactive-first run in interactive mode and wait for input right away -cnv, --conversation run in conversation mode (does not print special tokens and suffix/prefix) -ins, --instruct run in instruction mode (use with Alpaca models) -cml, --chatml run in chatml mode (use with ChatML-compatible models) --multiline-input allows you to write or paste multiple lines without ending each in '\' -r PROMPT, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode (can be specified more than once for multiple prompts). --color colorise output to distinguish prompt and user input from generations -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0) -t N, --threads N number of threads to use during generation (default: 4) -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads) -td N, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd N, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) -p PROMPT, --prompt PROMPT prompt to start generation with (default: empty) -e, --escape process prompt escapes sequences (\n, \r, \t, \', \", \\) --prompt-cache FNAME file to cache prompt state for faster startup (default: none) --prompt-cache-all if specified, saves user input and generations to cache as well. not supported with --interactive or other interactive options --prompt-cache-ro if specified, uses the prompt cache but does not update it. --random-prompt start with a randomized prompt. --in-prefix-bos prefix BOS to user inputs, preceding the --in-prefix string --in-prefix STRING string to prefix user inputs with (default: empty) --in-suffix STRING string to suffix after user inputs with (default: empty) -f FNAME, --file FNAME prompt file to start generation. -bf FNAME, --binary-file FNAME binary file containing multiple choice tasks. -n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) -c N, --ctx-size N size of the prompt context (default: 512, 0 = loaded from model) -b N, --batch-size N logical maximum batch size (default: 2048) -ub N, --ubatch-size N physical maximum batch size (default: 512) --samplers samplers that will be used for generation in the order, separated by ';' (default: topk;tfsz;typicalp;topp;minp;temperature) --sampling-seq simplified sequence for samplers that will be used (default: kfypmt) --top-k N top-k sampling (default: 40, 0 = disabled) --top-p N top-p sampling (default: 0.9, 1.0 = disabled) --min-p N min-p sampling (default: 0.1, 0.0 = disabled) --tfs N tail free sampling, parameter z (default: 1.0, 1.0 = disabled) --typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled) --repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctxsize) --repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) --presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled) --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled) --dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled) --dynatemp-exp N dynamic temperature exponent (default: 1.0) --mirostat N use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) --mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1) --mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0) -l TOKENID(+/-)BIAS, --logit-bias TOKENID(+/-)BIAS modifies the likelihood of token appearing in the completion, i.e. --logit-bias 15043+1 to increase likelihood of token ' Hello', or --logit-bias 15043-1 to decrease likelihood of token ' Hello' --grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/ dir) --grammar-file FNAME file to read grammar from -j SCHEMA, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g. {} for any JSON object. For schemas w/ external $refs, use --grammar + example/jsonschematogrammar.py instead --cfg-negative-prompt PROMPT negative prompt to use for guidance. (default: empty) --cfg-negative-prompt-file FNAME negative prompt file to use for guidance. (default: empty) --cfg-scale N strength of guidance (default: 1.000000, 1.0 = disable) --rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model --rope-scale N RoPE context scaling factor, expands context by a factor of N --rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from model) --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N --yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size) --yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation) --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0) --yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0) --yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0) --pooling {none,mean,cls} pooling type for embeddings, use model default if unspecified -dt N, --defrag-thold N KV cache defragmentation threshold (default: -1.0, < 0 - disabled) --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf) --penalize-nl penalize newline tokens --temp N temperature (default: 0.8) --all-logits return logits for all tokens in the batch (default: disabled) --hellaswag compute HellaSwag score over random tasks from datafile supplied with -f --hellaswag-tasks N number of tasks to use when computing the HellaSwag score (default: 400) --winogrande compute Winogrande score over random tasks from datafile supplied with -f --winogrande-tasks N number of tasks to use when computing the Winogrande score (default: 0) --multiple-choice compute multiple choice score over random tasks from datafile supplied with -f --multiple-choice-tasks N number of tasks to use when computing the multiple choice score (default: 0) --kl-divergence computes KL-divergence to logits provided via --kl-divergence-base --keep N number of tokens to keep from the initial prompt (default: 0, -1 = all) --draft N number of tokens to draft for speculative decoding (default: 5) --chunks N max number of chunks to process (default: -1, -1 = all) -np N, --parallel N number of parallel sequences to decode (default: 1) -ns N, --sequences N number of sequences to decode (default: 1) -ps N, --p-split N speculative decoding split probability (default: 0.1) -cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled) -fa, --flash-attn enable Flash Attention (default: disabled) --mmproj MMPROJFILE path to a multimodal projector file for LLaVA. see examples/llava/README.md --image IMAGEFILE path to an image file. use with multimodal models. Specify multiple times for batching --mlock force system to keep model in RAM rather than swapping or compressing --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) --numa TYPE attempt optimizations that help on some NUMA systems - distribute: spread execution evenly over all nodes - isolate: only spawn threads on CPUs on the node that execution started on - numactl: use the CPU map provided by numactl if run without this previously, it is recommended to drop the system page cache before using this see ggerganov/llama.cpp#1437 -ngl N, --n-gpu-layers N number of layers to store in VRAM -ngld N, --n-gpu-layers-draft N number of layers to store in VRAM for the draft model -sm SPLITMODE, --split-mode SPLITMODE how to split the model across multiple GPUs, one of: - none: use one GPU only - layer (default): split layers and KV across GPUs - row: split rows across GPUs -ts SPLIT, --tensor-split SPLIT fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 -mg i, --main-gpu i the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0) --rpc SERVERS comma separated list of RPC servers --verbose-prompt print a verbose prompt before generation (default: false) --no-display-prompt don't print prompt at generation (default: false) -gan N, --grp-attn-n N group-attention factor (default: 1) -gaw N, --grp-attn-w N group-attention width (default: 512.0) -dkvc, --dump-kv-cache verbose print of the KV cache -nkvo, --no-kv-offload disable KV offload -ctk TYPE, --cache-type-k TYPE KV cache data type for K (default: f16) -ctv TYPE, --cache-type-v TYPE KV cache data type for V (default: f16) --simple-io use basic IO for better compatibility in subprocesses and limited consoles --lora FNAME apply LoRA adapter (implies --no-mmap) --lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap) --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter --control-vector FNAME add a control vector --control-vector-scaled FNAME S add a control vector with user defined scaling S --control-vector-layer-range START END layer range to apply the control vector(s) to, start and end inclusive -m FNAME, --model FNAME model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf) -md FNAME, --model-draft FNAME draft model for speculative decoding (default: unused) -mu MODELURL, --model-url MODELURL model download url (default: unused) -hfr REPO, --hf-repo REPO Hugging Face model repository (default: unused) -hff FILE, --hf-file FILE Hugging Face model file (default: unused) -ld LOGDIR, --logdir LOGDIR path under which to save YAML logs (no logging if unset) -lcs FNAME, --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by generation) -lcd FNAME, --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by generation) --override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified multiple times. types: int, float, bool, str. example: --override-kv tokenizer.ggml.addbostoken=bool:false -ptc N, --print-token-count N print token count every N tokens (default: -1) --check-tensors check model tensor data for invalid valueslog options: --log-test Run simple logging test --log-disable Disable trace logs --log-enable Enable trace logs --log-file Specify a log filename (without extension) --log-new Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log" --log-append Don't truncate the old log file. I have lazily committed 10 times
…his help message and exit --version show version and build info -i, --interactive run in interactive mode --special special tokens output enabled --interactive-specials allow special tokens in user text, in interactive mode --interactive-first run in interactive mode and wait for input right away -cnv, --conversation run in conversation mode (does not print special tokens and suffix/prefix) -ins, --instruct run in instruction mode (use with Alpaca models) -cml, --chatml run in chatml mode (use with ChatML-compatible models) --multiline-input allows you to write or paste multiple lines without ending each in '\' -r PROMPT, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode (can be specified more than once for multiple prompts). --color colorise output to distinguish prompt and user input from generations -s SEED, --seed SEED RNG seed (default: -1, use random seed for < 0) -t N, --threads N number of threads to use during generation (default: 4) -tb N, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads) -td N, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd N, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) -p PROMPT, --prompt PROMPT prompt to start generation with (default: empty) -e, --escape process prompt escapes sequences (\n, \r, \t, \', \", \\) --prompt-cache FNAME file to cache prompt state for faster startup (default: none) --prompt-cache-all if specified, saves user input and generations to cache as well. not supported with --interactive or other interactive options --prompt-cache-ro if specified, uses the prompt cache but does not update it. --random-prompt start with a randomized prompt. --in-prefix-bos prefix BOS to user inputs, preceding the --in-prefix string --in-prefix STRING string to prefix user inputs with (default: empty) --in-suffix STRING string to suffix after user inputs with (default: empty) -f FNAME, --file FNAME prompt file to start generation. -bf FNAME, --binary-file FNAME binary file containing multiple choice tasks. -n N, --n-predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled) -c N, --ctx-size N size of the prompt context (default: 512, 0 = loaded from model) -b N, --batch-size N logical maximum batch size (default: 2048) -ub N, --ubatch-size N physical maximum batch size (default: 512) --samplers samplers that will be used for generation in the order, separated by ';' (default: topk;tfsz;typicalp;topp;minp;temperature) --sampling-seq simplified sequence for samplers that will be used (default: kfypmt) --top-k N top-k sampling (default: 40, 0 = disabled) --top-p N top-p sampling (default: 0.9, 1.0 = disabled) --min-p N min-p sampling (default: 0.1, 0.0 = disabled) --tfs N tail free sampling, parameter z (default: 1.0, 1.0 = disabled) --typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled) --repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctxsize) --repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) --presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled) --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled) --dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled) --dynatemp-exp N dynamic temperature exponent (default: 1.0) --mirostat N use Mirostat sampling. Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) --mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1) --mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0) -l TOKENID(+/-)BIAS, --logit-bias TOKENID(+/-)BIAS modifies the likelihood of token appearing in the completion, i.e. --logit-bias 15043+1 to increase likelihood of token ' Hello', or --logit-bias 15043-1 to decrease likelihood of token ' Hello' --grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/ dir) --grammar-file FNAME file to read grammar from -j SCHEMA, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g. {} for any JSON object. For schemas w/ external $refs, use --grammar + example/jsonschematogrammar.py instead --cfg-negative-prompt PROMPT negative prompt to use for guidance. (default: empty) --cfg-negative-prompt-file FNAME negative prompt file to use for guidance. (default: empty) --cfg-scale N strength of guidance (default: 1.000000, 1.0 = disable) --rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by the model --rope-scale N RoPE context scaling factor, expands context by a factor of N --rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from model) --rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N --yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training context size) --yarn-ext-factor N YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation) --yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: 1.0) --yarn-beta-slow N YaRN: high correction dim or alpha (default: 1.0) --yarn-beta-fast N YaRN: low correction dim or beta (default: 32.0) --pooling {none,mean,cls} pooling type for embeddings, use model default if unspecified -dt N, --defrag-thold N KV cache defragmentation threshold (default: -1.0, < 0 - disabled) --ignore-eos ignore end of stream token and continue generating (implies --logit-bias 2-inf) --penalize-nl penalize newline tokens --temp N temperature (default: 0.8) --all-logits return logits for all tokens in the batch (default: disabled) --hellaswag compute HellaSwag score over random tasks from datafile supplied with -f --hellaswag-tasks N number of tasks to use when computing the HellaSwag score (default: 400) --winogrande compute Winogrande score over random tasks from datafile supplied with -f --winogrande-tasks N number of tasks to use when computing the Winogrande score (default: 0) --multiple-choice compute multiple choice score over random tasks from datafile supplied with -f --multiple-choice-tasks N number of tasks to use when computing the multiple choice score (default: 0) --kl-divergence computes KL-divergence to logits provided via --kl-divergence-base --keep N number of tokens to keep from the initial prompt (default: 0, -1 = all) --draft N number of tokens to draft for speculative decoding (default: 5) --chunks N max number of chunks to process (default: -1, -1 = all) -np N, --parallel N number of parallel sequences to decode (default: 1) -ns N, --sequences N number of sequences to decode (default: 1) -ps N, --p-split N speculative decoding split probability (default: 0.1) -cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: disabled) -fa, --flash-attn enable Flash Attention (default: disabled) --mmproj MMPROJFILE path to a multimodal projector file for LLaVA. see examples/llava/README.md --image IMAGEFILE path to an image file. use with multimodal models. Specify multiple times for batching --mlock force system to keep model in RAM rather than swapping or compressing --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock) --numa TYPE attempt optimizations that help on some NUMA systems - distribute: spread execution evenly over all nodes - isolate: only spawn threads on CPUs on the node that execution started on - numactl: use the CPU map provided by numactl if run without this previously, it is recommended to drop the system page cache before using this see ggerganov/llama.cpp#1437 -ngl N, --n-gpu-layers N number of layers to store in VRAM -ngld N, --n-gpu-layers-draft N number of layers to store in VRAM for the draft model -sm SPLITMODE, --split-mode SPLITMODE how to split the model across multiple GPUs, one of: - none: use one GPU only - layer (default): split layers and KV across GPUs - row: split rows across GPUs -ts SPLIT, --tensor-split SPLIT fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1 -mg i, --main-gpu i the GPU to use for the model (with split-mode = none), or for intermediate results and KV (with split-mode = row) (default: 0) --rpc SERVERS comma separated list of RPC servers --verbose-prompt print a verbose prompt before generation (default: false) --no-display-prompt don't print prompt at generation (default: false) -gan N, --grp-attn-n N group-attention factor (default: 1) -gaw N, --grp-attn-w N group-attention width (default: 512.0) -dkvc, --dump-kv-cache verbose print of the KV cache -nkvo, --no-kv-offload disable KV offload -ctk TYPE, --cache-type-k TYPE KV cache data type for K (default: f16) -ctv TYPE, --cache-type-v TYPE KV cache data type for V (default: f16) --simple-io use basic IO for better compatibility in subprocesses and limited consoles --lora FNAME apply LoRA adapter (implies --no-mmap) --lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap) --lora-base FNAME optional model to use as a base for the layers modified by the LoRA adapter --control-vector FNAME add a control vector --control-vector-scaled FNAME S add a control vector with user defined scaling S --control-vector-layer-range START END layer range to apply the control vector(s) to, start and end inclusive -m FNAME, --model FNAME model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf) -md FNAME, --model-draft FNAME draft model for speculative decoding (default: unused) -mu MODELURL, --model-url MODELURL model download url (default: unused) -hfr REPO, --hf-repo REPO Hugging Face model repository (default: unused) -hff FILE, --hf-file FILE Hugging Face model file (default: unused) -ld LOGDIR, --logdir LOGDIR path under which to save YAML logs (no logging if unset) -lcs FNAME, --lookup-cache-static FNAME path to static lookup cache to use for lookup decoding (not updated by generation) -lcd FNAME, --lookup-cache-dynamic FNAME path to dynamic lookup cache to use for lookup decoding (updated by generation) --override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified multiple times. types: int, float, bool, str. example: --override-kv tokenizer.ggml.addbostoken=bool:false -ptc N, --print-token-count N print token count every N tokens (default: -1) --check-tensors check model tensor data for invalid valueslog options: --log-test Run simple logging test --log-disable Disable trace logs --log-enable Enable trace logs --log-file Specify a log filename (without extension) --log-new Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log" --log-append Don't truncate the old log file. I have lazily committed 11 times
I want to mention the importance of QPI/UPI links between sockets, because with memory interleave they are the actual limiting factor from hardware aspect. |
llama.cpp is memory bound, let's see what has a lot of memory bandwidth:
NVIDIA V100 32GB: 900GB/s
2S Epyc 9000 (12xDDR5-4800/S): 922GB/s
NVIDIA A100 40GB: 1555GB/s
2S Xeon Max (HBM): 2TB/s
NVIDIA A100 80GB: 2TB/s
8S Xeon Scalable v4 (8xDDR5-4800/S): 2.45TB/s
NUMA systems have a lot because there are memory channels (or HBM for Xeon Max) on each socket. Okay, but the cheapest thing there is ~$6000. What if I'm not rich?
(~$350 w/ 16GB, max ~128GB) common PC (2xDDR4-3200): 51GB/s
(~$450 w/ 8GB, ~$600 w/ 16GB) Mac Mini M1: 68GB/s
(~$600 w/ 8GB, ~$800 w/ 16GB) Mac Mini M2: 100GB/s
(~$200 w/ 64GB, max ~768GB) 2S Xeon E5 v1 (4xDDR3-1600/S): 102GB/s [no F16C so f16 models slower]
(~$250 w/ 64GB, max ~768GB) 2S Xeon E5 v2 (4xDDR3-1866/S): 119GB/s
(~$350 w/ 128GB, max ~3000GB) 2S Xeon E5 v4 (4xDDR4-2400/S): 154GB/s
Hmm. Xeon E5-2690 v1 for $9 each on eBay. Let's see how we do.
$ lscpu
...
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:"
...
llama_print_timings: sample time = 406.79 ms / 512 runs ( 0.79 ms per token)
llama_print_timings: prompt eval time = 27899.73 ms / 271 tokens ( 102.95 ms per token)
llama_print_timings: eval time = 74773.93 ms / 510 runs ( 146.62 ms per token)
Not terrible for 11-year-old hardware. Let's try it with two sockets:
$ lscpu
...
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:"
...
llama_print_timings: sample time = 438.34 ms / 512 runs ( 0.86 ms per token)
llama_print_timings: prompt eval time = 27083.17 ms / 271 tokens ( 99.94 ms per token)
llama_print_timings: eval time = 129373.98 ms / 510 runs ( 253.67 ms per token)
Twice as many cores, twice as much memory bandwidth, and it's slower.
Oh, get_num_physical_cores() is broken, it's only returning 8/16 physical cores because "cpu cores" in /proc/cpuinfo is per-socket. I submitted a pull request.
$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 16
...
llama_print_timings: sample time = 451.48 ms / 512 runs ( 0.88 ms per token)
llama_print_timings: prompt eval time = 16092.04 ms / 271 tokens ( 59.38 ms per token)
llama_print_timings: eval time = 102018.05 ms / 510 runs ( 200.04 ms per token)
Well, the prompt eval time is better. Maybe it benefits from hyperthreading?
$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 399.47 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 14734.68 ms / 271 tokens ( 54.37 ms per token)
llama_print_timings: eval time = 97250.82 ms / 510 runs ( 190.69 ms per token)
Still something's not right.
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 96609 MB
node 0 free: 96320 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 64506 MB
node 1 free: 60183 MB
node distances:
node 0 1
0: 10 20
1: 20 10
There it is. The whole model is loaded into the memory of one node. Let's try node interleave.
# echo 3 > /proc/sys/vm/drop_caches
$ numactl --interleave=0-1 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 397.83 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 14894.56 ms / 271 tokens ( 54.96 ms per token)
llama_print_timings: eval time = 57045.66 ms / 510 runs ( 111.85 ms per token)
That's an improvement. Now it's >30% faster than a single socket and basically the same speed as my Ryzen 5 5600G from 2021, for about half the price. Let's see what happens on a machine with 4 NUMA nodes (16C/32T):
$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 16
...
llama_print_timings: sample time = 456.06 ms / 512 runs ( 0.89 ms per token)
llama_print_timings: prompt eval time = 13954.33 ms / 271 tokens ( 51.49 ms per token)
llama_print_timings: eval time = 108925.89 ms / 510 runs ( 213.58 ms per token)
$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 514.30 ms / 512 runs ( 1.00 ms per token)
llama_print_timings: prompt eval time = 14288.35 ms / 271 tokens ( 52.72 ms per token)
llama_print_timings: eval time = 109354.09 ms / 510 runs ( 214.42 ms per token)
# echo 3 > /proc/sys/vm/drop_caches
$ numactl --interleave=0-3 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 16
...
llama_print_timings: sample time = 477.99 ms / 512 runs ( 0.93 ms per token)
llama_print_timings: prompt eval time = 14164.87 ms / 271 tokens ( 52.27 ms per token)
llama_print_timings: eval time = 67402.83 ms / 510 runs ( 132.16 ms per token)
$ numactl --interleave=0-3 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 489.53 ms / 512 runs ( 0.96 ms per token)
llama_print_timings: prompt eval time = 14511.16 ms / 271 tokens ( 53.55 ms per token)
llama_print_timings: eval time = 48623.21 ms / 510 runs ( 95.34 ms per token)
125% faster is alright.
I can submit a pull request that does the same thing (with a dependency on libnuma) if you want it.
But this is not the best we can do. Interleave spreads the model across all nodes randomly and there is still heavy slow cross-node memory access, that's just better than all the cores contending for the memory of one node.
The better way is to explicitly load 1/Nth of the model on each node and then have a thread pool per node which is assigned the operations on that subset of the model.
The text was updated successfully, but these errors were encountered: