Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added options to --numa flag to add finegrained control over execution #5358

Closed
wants to merge 0 commits into from

Conversation

bmtwl
Copy link
Contributor

@bmtwl bmtwl commented Feb 6, 2024

Added 4 options to --numa cli flag

interleave: The current scheme as-is. Execute equally on all available threads on all available nodes
isolate: only execute threads on the current numa node. Will stop cross-node traffic
numactl: inherit the numa environment passed through via the numactl utility. Allows fine-grained execution control
mirror: mirror GGUF to all numa nodes to improve system bandwidth for inference (not implemented, hidden via #ifdefs)

(also added a couple of missing \n to the help text)

@ggerganov
Copy link
Owner

Can you provide some sample commands that you use and the performance results that you observe. This way people can try to reproduce these findings and get a feeling of what improvements we are looking at

ggml.h Outdated
GGML_API bool ggml_is_numa(void); // true if init detected that system has >1 NUMA node
GGML_API void ggml_numa_init(uint32_t numa); // call once for better performance on NUMA systems
GGML_API bool ggml_is_numa(void); // true if init detected that system has >1 NUMA node
GGML_API cpu_set_t ggml_get_numa_affinity(void); // get cpuset from numactl
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to expose this in the public API. Also remove the <sched.h> header from ggml.h

@bmtwl
Copy link
Contributor Author

bmtwl commented Feb 6, 2024

Can you provide some sample commands that you use and the performance results that you observe. This way people can try to reproduce these findings and get a feeling of what improvements we are looking at

I don't expect much in the way of large speedups until I start looking at ensuring memory locality, but there are still gains even with just this patch. The main advantage is that we are able to control where the threads execute with a high level of granularity, which may be very useful on larger systems with complicated interconnect structures.
Here is an example run with numactl forcing the patched branched to execute entirely on one numa node vs the unpatched master branch running the same command (with what would be the equivalent of the "--numa interleave" command after patching). Caches were dropped before each run:

numactl -N0 -m0 ./main -m /opt/text-generation-webui/models/miqu-70b-q5/miqu-1-70b.q5_K_M.gguf -p "Hello" -n 32 -t 32 --no-mmap -b 65535 -c 4096 -np 4096 -ns 65535 -cb --numa
numact

llama_print_timings: load time = 21958.00 ms
llama_print_timings: sample time = 4.79 ms / 32 runs ( 0.15 ms per token, 6676.40 tokens per second)
llama_print_timings: prompt eval time = 269.72 ms / 2 tokens ( 134.86 ms per token, 7.42 tokens per second)
llama_print_timings: eval time = 6280.50 ms / 31 runs ( 202.60 ms per token, 4.94 tokens per second)
llama_print_timings: total time = 6564.18 ms / 33 tokens

./main -m /opt/text-generation-webui/models/miqu-70b-q5/miqu-1-70b.q5_K_M.gguf -p "Hello" -n 32 -t 32 --no-mmap -b 65535 -c 4096 -np 4096 -ns 65535 -cb --numa

llama_print_timings: load time = 19808.41 ms
llama_print_timings: sample time = 4.68 ms / 32 runs ( 0.15 ms per token, 6834.69 tokens per second)
llama_print_timings: prompt eval time = 372.62 ms / 2 tokens ( 186.31 ms per token, 5.37 tokens per second)
llama_print_timings: eval time = 8886.55 ms / 31 runs ( 286.66 ms per token, 3.49 tokens per second)
llama_print_timings: total time = 9272.88 ms / 33 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants