Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h #5377

Merged
merged 46 commits into from
Feb 16, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
d919c6d
Added numa options to allow finer grained control as well as plumbing…
Feb 6, 2024
65792fa
Reverted Makefile
Feb 6, 2024
592e451
Fixed include
Feb 6, 2024
a69d6e2
Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c…
Feb 6, 2024
60b80b0
removed trailing whitespace
Feb 6, 2024
7aa974d
Added numa options to allow finer grained control as well as plumbing…
Feb 6, 2024
12789eb
Reverting Makefile
Feb 6, 2024
c43808c
Fixed a number of issues with the move from BOOL to ggml_numa_strateg…
Feb 7, 2024
3eccea1
Syncing to pr
Feb 7, 2024
61c37ba
Removing MIRROR_MODE code for this PR
Feb 7, 2024
d47f232
Removing last bit of MIRROR_MODE code for this PR
Feb 7, 2024
783b7ca
Removing unneeded branch in server.cpp example and moving get_numa_af…
Feb 7, 2024
f156112
Merge branch 'ggerganov:master' into master
bmtwl Feb 8, 2024
12c23b6
Fixed lingering init_llama_backend() bool calls in tests and examples
Feb 8, 2024
18fb9a5
Merge branch 'ggerganov:master' into master
bmtwl Feb 8, 2024
90668fb
Merge branch 'ggerganov:master' into master
bmtwl Feb 8, 2024
b65c863
Remote enum llama_numa_strategies
Feb 8, 2024
7bbe511
Revert bad merge with dynatemp flags
Feb 8, 2024
314174d
add missing enum ggml_numa_strategies declaration and revert sync pro…
Feb 8, 2024
c2c3166
add missing enum ggml_numa_strategies declaration
Feb 8, 2024
fecd66a
Merge branch 'ggerganov:master' into master
bmtwl Feb 8, 2024
e107c4c
fixed ggml_init_numa variable
Feb 8, 2024
16b91d1
Merge branch 'master' of https://github.com/bmtwl/llama.cpp
Feb 8, 2024
99a203d
Update ggml.h
bmtwl Feb 8, 2024
6d34ad7
Merge branch 'master' of https://github.com/bmtwl/llama.cpp
Feb 8, 2024
87f8d9e
Merge branch 'ggerganov:master' into master
bmtwl Feb 13, 2024
5a94209
Merge branch 'master' of https://github.com/bmtwl/llama.cpp
Feb 13, 2024
9d42825
Update READMEs with info about numa flags, change INTERLEAVE strategy…
Feb 13, 2024
e37b8f0
Merge branch 'ggerganov:master' into master
bmtwl Feb 13, 2024
0e05042
Merge branch 'ggerganov:master' into master
bmtwl Feb 14, 2024
0fb40ae
split numa init out from llama_backend_init and created llama_numa_in…
Feb 14, 2024
c590bce
Merge branch 'ggerganov:master' into master
bmtwl Feb 14, 2024
a47bb69
Merge branch 'ggerganov:master' into master
bmtwl Feb 14, 2024
7fb5427
Fix up some boolean vs enum comparisons
Feb 14, 2024
e237527
Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype
Feb 14, 2024
dc828c4
Update ggml.h
bmtwl Feb 15, 2024
4ffe18e
Update ggml.c
bmtwl Feb 15, 2024
1585fec
Update ggml.c
bmtwl Feb 15, 2024
c847828
Update examples/server/server.cpp
bmtwl Feb 15, 2024
377b58f
Update common/common.cpp
bmtwl Feb 15, 2024
5de34f5
Merge branch 'ggerganov:master' into master
bmtwl Feb 15, 2024
da65211
unified ggml_numa_strategy enum and fixed text alignment in server.cp…
Feb 15, 2024
7d1f026
Update ggml.c
bmtwl Feb 15, 2024
a5c9a5d
Merge branch 'ggerganov:master' into master
bmtwl Feb 15, 2024
a3cf7bf
removed redundant else from cli argument processing of --numa
Feb 15, 2024
26ea983
whitespace
cebtenzzre Feb 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -670,7 +670,16 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
} else if (arg == "--no-mmap") {
params.use_mmap = false;
} else if (arg == "--numa") {
params.numa = true;
if (++i >= argc) {
invalid_param = true;
break;
} else {
std::string value(argv[i]);
/**/ if (value == "interleave" || value == "" ) { params.numa = GGML_NUMA_STRATEGY_INTERLEAVE; }
else if (value == "isolate") { params.numa = GGML_NUMA_STRATEGY_ISOLATE; }
else if (value == "numactl") { params.numa = GGML_NUMA_STRATEGY_NUMACTL; }
else { invalid_param = true; break; }
}
} else if (arg == "--verbose-prompt") {
params.verbose_prompt = true;
} else if (arg == "--no-display-prompt") {
Expand Down Expand Up @@ -926,7 +935,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
printf(" -tb N, --threads-batch N\n");
printf(" number of threads to use during batch and prompt processing (default: same as --threads)\n");
printf(" -td N, --threads-draft N");
printf(" number of threads to use during generation (default: same as --threads)");
printf(" number of threads to use during generation (default: same as --threads)\n");
printf(" -tbd N, --threads-batch-draft N\n");
printf(" number of threads to use during batch and prompt processing (default: same as --threads-draft)\n");
printf(" -p PROMPT, --prompt PROMPT\n");
Expand Down Expand Up @@ -958,8 +967,6 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
printf(" --repeat-penalty N penalize repeat sequence of tokens (default: %.1f, 1.0 = disabled)\n", (double)sparams.penalty_repeat);
printf(" --presence-penalty N repeat alpha presence penalty (default: %.1f, 0.0 = disabled)\n", (double)sparams.penalty_present);
printf(" --frequency-penalty N repeat alpha frequency penalty (default: %.1f, 0.0 = disabled)\n", (double)sparams.penalty_freq);
printf(" --dynatemp-range N dynamic temperature range (default: %.1f, 0.0 = disabled)\n", (double)sparams.dynatemp_range);
printf(" --dynatemp-exp N dynamic temperature exponent (default: %.1f)\n", (double)sparams.dynatemp_exponent);
printf(" --mirostat N use Mirostat sampling.\n");
printf(" Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.\n");
printf(" (default: %d, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)\n", sparams.mirostat);
Expand Down Expand Up @@ -996,7 +1003,7 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
printf(" --winogrande-tasks N number of tasks to use when computing the Winogrande score (default: %zu)\n", params.winogrande_tasks);
printf(" --multiple-choice compute multiple choice score over random tasks from datafile supplied with -f\n");
printf(" --multiple-choice-tasks N number of tasks to use when computing the multiple choice score (default: %zu)\n", params.winogrande_tasks);
printf(" --kl-divergence computes KL-divergence to logits provided via --kl-divergence-base");
printf(" --kl-divergence computes KL-divergence to logits provided via --kl-divergence-base\n");
printf(" --keep N number of tokens to keep from the initial prompt (default: %d, -1 = all)\n", params.n_keep);
printf(" --draft N number of tokens to draft for speculative decoding (default: %d)\n", params.n_draft);
printf(" --chunks N max number of chunks to process (default: %d, -1 = all)\n", params.n_chunks);
Expand All @@ -1013,7 +1020,10 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
if (llama_supports_mmap()) {
printf(" --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock)\n");
}
printf(" --numa attempt optimizations that help on some NUMA systems\n");
printf(" --numa TYPE attempt optimizations that help on some NUMA systems\n");
printf(" - interleave: (default) spread execution evenly over all nodes\n");
printf(" - isolate: only spawn threads on CPUs on the node that execution started on\n");
printf(" - numactl: use the CPU map provided my numactl\n");
printf(" if run without this previously, it is recommended to drop the system page cache before using this\n");
printf(" see https://github.com/ggerganov/llama.cpp/issues/1437\n");
if (llama_supports_gpu_offload()) {
Expand Down Expand Up @@ -1639,7 +1649,6 @@ void dump_non_result_info_yaml(FILE * stream, const gpt_params & params, const l
fprintf(stream, "no_mmap: %s # default: false\n", !params.use_mmap ? "true" : "false");
fprintf(stream, "no_mul_mat_q: %s # default: false\n", !params.mul_mat_q ? "true" : "false");
fprintf(stream, "no_penalize_nl: %s # default: false\n", !sparams.penalize_nl ? "true" : "false");
fprintf(stream, "numa: %s # default: false\n", params.numa ? "true" : "false");
fprintf(stream, "ppl_output_type: %d # default: 0\n", params.ppl_output_type);
fprintf(stream, "ppl_stride: %d # default: 0\n", params.ppl_stride);
fprintf(stream, "presence_penalty: %f # default: 0.0\n", sparams.penalty_present);
Expand Down
2 changes: 1 addition & 1 deletion common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ struct gpt_params {
float yarn_beta_slow = 1.0f; // YaRN high correction dim
int32_t yarn_orig_ctx = 0; // YaRN original context length
int32_t rope_scaling_type = LLAMA_ROPE_SCALING_UNSPECIFIED;
ggml_numa_strategies numa = GGML_NUMA_STRATEGY_DISABLED;

// // sampling parameters
struct llama_sampling_params sparams;
Expand Down Expand Up @@ -134,7 +135,6 @@ struct gpt_params {
bool logits_all = false; // return logits for all tokens in the batch
bool use_mmap = true; // use mmap for faster loads
bool use_mlock = false; // use mlock to keep model in memory
bool numa = false; // attempt optimizations that help on some NUMA systems
bool verbose_prompt = false; // print prompt tokens before generation
bool display_prompt = true; // print prompt before generation
bool infill = false; // use infill mode
Expand Down
2 changes: 1 addition & 1 deletion examples/batched.swift/Sources/main.swift
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ let n_parallel: Int = arguments.count > 3 && Int(arguments[3]) != nil ? Int(argu
let n_len: Int = 32

// init LLM
llama_backend_init(false)
llama_backend_init(GGML_NUMA_STRATEGY_DISABLED)
defer {
llama_backend_free()
}
Expand Down
2 changes: 1 addition & 1 deletion examples/llama-bench/llama-bench.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1151,7 +1151,7 @@ int main(int argc, char ** argv) {
if (!params.verbose) {
llama_log_set(llama_null_log_callback, NULL);
}
bool numa = false;
enum ggml_numa_strategies numa = GGML_NUMA_STRATEGY_DISABLED;
llama_backend_init(numa);

// initialize printer
Expand Down
2 changes: 1 addition & 1 deletion examples/llama.android/app/src/main/cpp/llama-android.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,7 @@ Java_com_example_llama_Llm_new_1batch(JNIEnv *, jobject, jint n_tokens, jint emb

extern "C"
JNIEXPORT void JNICALL
Java_com_example_llama_Llm_backend_1init(JNIEnv *, jobject, jboolean numa) {
Java_com_example_llama_Llm_backend_1init(JNIEnv *, jobject, jint32 numa) {
llama_backend_init(numa);
}

Expand Down
2 changes: 1 addition & 1 deletion examples/llama.swiftui/llama.cpp.swift/LibLlama.swift
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ actor LlamaContext {
}

static func create_context(path: String) throws -> LlamaContext {
llama_backend_init(false)
llama_backend_init(GGML_NUMA_STRATEGY_DISABLED)
var model_params = llama_model_default_params()

#if targetEnvironment(simulator)
Expand Down
2 changes: 1 addition & 1 deletion examples/quantize/quantize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ int main(int argc, char ** argv) {
params.imatrix = &imatrix_data;
}

llama_backend_init(false);
llama_backend_init(GGML_NUMA_STRATEGY_DISABLED);

// parse command line arguments
const std::string fname_inp = argv[arg_idx];
Expand Down
19 changes: 15 additions & 4 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1820,7 +1820,10 @@ static void server_print_usage(const char *argv0, const gpt_params &params,
{
printf(" --no-mmap do not memory-map model (slower load but may reduce pageouts if not using mlock)\n");
}
printf(" --numa attempt optimizations that help on some NUMA systems\n");
printf(" --numa TYPE attempt optimizations that help on some NUMA systems\n");
printf(" - interleave: (default) spread execution evenly over all nodes\n");
printf(" - isolate: only spawn threads on CPUs on the node that execution started on\n");
printf(" - numactl: use the CPU map provided my numactl\n");
if (llama_supports_gpu_offload()) {
printf(" -ngl N, --n-gpu-layers N\n");
printf(" number of layers to store in VRAM\n");
Expand Down Expand Up @@ -2227,9 +2230,17 @@ static void server_params_parse(int argc, char **argv, server_params &sparams,
{
params.use_mmap = false;
}
else if (arg == "--numa")
{
params.numa = true;
else if (arg == "--numa") {
if (++i >= argc) {
invalid_param = true;
break;
} else {
std::string value(argv[i]);
/**/ if (value == "interleave" || value == "" ) { params.numa = GGML_NUMA_STRATEGY_INTERLEAVE; }
else if (value == "isolate") { params.numa = GGML_NUMA_STRATEGY_ISOLATE; }
else if (value == "numactl") { params.numa = GGML_NUMA_STRATEGY_NUMACTL; }
else { invalid_param = true; break; }
}
}
else if (arg == "--embedding")
{
Expand Down
2 changes: 1 addition & 1 deletion examples/tokenize/tokenize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ int main(int argc, char ** argv) {

const bool printing_ids = argc > 3 && std::string(argv[3]) == "--ids";

llama_backend_init(false);
llama_backend_init(GGML_NUMA_STRATEGY_DISABLED);

llama_model_params model_params = llama_model_default_params();
model_params.vocab_only = true;
Expand Down
73 changes: 57 additions & 16 deletions ggml.c
Original file line number Diff line number Diff line change
Expand Up @@ -1912,9 +1912,12 @@ struct ggml_numa_node {
};

struct ggml_numa_nodes {
uint32_t numa_strategy;
struct ggml_numa_node nodes[GGML_NUMA_MAX_NODES];
uint32_t n_nodes;
uint32_t total_cpus; // hardware threads on system
uint32_t current_node; // node on which main process is execting
cpu_set_t cpuset; // cpuset from numactl
};

//
Expand Down Expand Up @@ -1948,7 +1951,16 @@ inline static void ggml_critical_section_end(void) {
atomic_fetch_sub(&g_state_barrier, 1);
}

void ggml_numa_init(void) {
static cpu_set_t ggml_get_numa_affinity(void) {
cpu_set_t cpuset;
pthread_t thread;
thread = pthread_self();
CPU_ZERO(&cpuset);
pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
return cpuset;
}

void ggml_numa_init(uint32_t numa_flag) {
if (g_state.numa.n_nodes > 0) {
fprintf(stderr, "ggml_numa_init: NUMA already initialized\n");

Expand All @@ -1960,6 +1972,13 @@ void ggml_numa_init(void) {
char path[256];
int rv;

// set numa scheme
g_state.numa.numa_strategy = numa_flag;

GGML_PRINT_DEBUG("numa strategy %u\n",g_state.numa.numa_strategy);

g_state.numa.cpuset = ggml_get_numa_affinity();

// enumerate nodes
while (g_state.numa.n_nodes < GGML_NUMA_MAX_NODES) {
rv = snprintf(path, sizeof(path), "/sys/devices/system/node/node%u", g_state.numa.n_nodes);
Expand All @@ -1978,11 +1997,17 @@ void ggml_numa_init(void) {

GGML_PRINT_DEBUG("found %u numa nodes, %u CPUs\n", g_state.numa.n_nodes, g_state.numa.total_cpus);

if (g_state.numa.n_nodes < 1 || g_state.numa.total_cpus < 1) {
// figure out which node we're on
uint current_cpu;
int getcpu_ret = getcpu(&current_cpu, &g_state.numa.current_node);

if (g_state.numa.n_nodes < 1 || g_state.numa.total_cpus < 1 || getcpu_ret != 0) {
g_state.numa.n_nodes = 0;
return;
}

GGML_PRINT_DEBUG("found our process on numa node %u, CPU %u\n", g_state.numa.current_node, current_cpu);

for (uint32_t n = 0; n < g_state.numa.n_nodes; ++n) {
struct ggml_numa_node * node = &g_state.numa.nodes[n];
GGML_PRINT_DEBUG("CPUs on node %u:", n);
Expand Down Expand Up @@ -2470,8 +2495,7 @@ size_t ggml_get_max_tensor_size(const struct ggml_context * ctx) {
size_t max_size = 0;

for (struct ggml_tensor * tensor = ggml_get_first_tensor(ctx); tensor != NULL; tensor = ggml_get_next_tensor(ctx, tensor)) {
size_t bytes = ggml_nbytes(tensor);
max_size = MAX(max_size, bytes);
max_size = MAX(max_size, ggml_nbytes(tensor));
}

return max_size;
Expand Down Expand Up @@ -11888,10 +11912,8 @@ GGML_CALL void ggml_rope_yarn_corr_dims(
int n_dims, int n_orig_ctx, float freq_base, float beta_fast, float beta_slow, float dims[2]
) {
// start and end correction dims
float start = floorf(ggml_rope_yarn_corr_dim(n_dims, n_orig_ctx, beta_fast, freq_base));
float end = ceilf(ggml_rope_yarn_corr_dim(n_dims, n_orig_ctx, beta_slow, freq_base));
dims[0] = MAX(0, start);
dims[1] = MIN(n_dims - 1, end);
dims[0] = MAX(0, floorf(ggml_rope_yarn_corr_dim(n_dims, n_orig_ctx, beta_fast, freq_base)));
dims[1] = MIN(n_dims - 1, ceilf(ggml_rope_yarn_corr_dim(n_dims, n_orig_ctx, beta_slow, freq_base)));
}

static void ggml_compute_forward_rope_f32(
Expand Down Expand Up @@ -16587,21 +16609,41 @@ static void set_numa_thread_affinity(int thread_n, int n_threads) {
return;
}

// run thread on node_num thread_n / (threads per node)
const int node_num = thread_n / ((n_threads + g_state.numa.n_nodes - 1) / g_state.numa.n_nodes);
struct ggml_numa_node * node = &g_state.numa.nodes[node_num];
int node_num;
int rv;
size_t setsize = CPU_ALLOC_SIZE(g_state.numa.total_cpus);

switch(g_state.numa.numa_strategy) {
case GGML_NUMA_STRATEGY_INTERLEAVE:
// run thread on node_num thread_n / (threads per node)
node_num = thread_n / ((n_threads + g_state.numa.n_nodes - 1) / g_state.numa.n_nodes);
break;
case GGML_NUMA_STRATEGY_ISOLATE:
// run thread on current_node
node_num = g_state.numa.current_node;
break;
case GGML_NUMA_STRATEGY_NUMACTL:
// use the cpuset that numactl gave us
rv = pthread_setaffinity_np(pthread_self(), setsize, &g_state.numa.cpuset);
if (rv) {
fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n",strerror(rv));
}
return;
default:
return;
}

struct ggml_numa_node * node = &g_state.numa.nodes[node_num];

cpu_set_t * cpus = CPU_ALLOC(g_state.numa.total_cpus);
CPU_ZERO_S(setsize, cpus);
for (size_t i = 0; i < node->n_cpus; ++i) {
CPU_SET_S(node->cpus[i], setsize, cpus);
}

int rv = pthread_setaffinity_np(pthread_self(), setsize, cpus);
rv = pthread_setaffinity_np(pthread_self(), setsize, cpus);
if (rv) {
fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n",
strerror(rv));
fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n",strerror(rv));
}

CPU_FREE(cpus);
Expand All @@ -16622,8 +16664,7 @@ static void clear_numa_thread_affinity(void) {

int rv = pthread_setaffinity_np(pthread_self(), setsize, cpus);
if (rv) {
fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n",
strerror(rv));
fprintf(stderr, "warning: pthread_setaffinity_np() failed: %s\n",strerror(rv));
}

CPU_FREE(cpus);
Expand Down
14 changes: 12 additions & 2 deletions ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -647,6 +647,16 @@ extern "C" {
void * wdata;
};

// numa strategies
enum ggml_numa_strategies {
GGML_NUMA_STRATEGY_DISABLED = 0,
GGML_NUMA_STRATEGY_INTERLEAVE = 1,
GGML_NUMA_STRATEGY_ISOLATE = 2,
GGML_NUMA_STRATEGY_NUMACTL = 3,
GGML_NUMA_STRATEGY_MIRROR = 4,
GGML_NUMA_STRATEGY_MAX_VALUE = GGML_NUMA_STRATEGY_MIRROR,
};

// misc

GGML_API void ggml_time_init(void); // call this once at the beginning of the program
Expand All @@ -657,8 +667,8 @@ extern "C" {

GGML_API void ggml_print_backtrace(void);

GGML_API void ggml_numa_init(void); // call once for better performance on NUMA systems
GGML_API bool ggml_is_numa(void); // true if init detected that system has >1 NUMA node
GGML_API void ggml_numa_init(uint32_t numa); // call once for better performance on NUMA systems
GGML_API bool ggml_is_numa(void); // true if init detected that system has >1 NUMA node

GGML_API void ggml_print_object (const struct ggml_object * obj);
GGML_API void ggml_print_objects(const struct ggml_context * ctx);
Expand Down
10 changes: 5 additions & 5 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -974,7 +974,7 @@ struct llama_mmap {
int fd = fileno(file->fp);
int flags = MAP_SHARED;
// prefetch/readahead impairs performance on NUMA systems
if (numa) { prefetch = 0; }
if (numa > 0) { prefetch = 0; }
#ifdef __linux__
// advise the kernel to read the file sequentially (increases readahead)
if (posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL)) {
Expand All @@ -995,7 +995,7 @@ struct llama_mmap {
strerror(errno));
}
}
if (numa) {
if (numa > 0) {
// advise the kernel not to use readahead
// (because the next page might not belong on the same node)
if (posix_madvise(addr, file->size, POSIX_MADV_RANDOM)) {
Expand Down Expand Up @@ -10548,7 +10548,7 @@ bool llama_mlock_supported(void) {
return llama_supports_mlock();
}

void llama_backend_init(bool numa) {
void llama_backend_init(enum ggml_numa_strategies numa) {
ggml_time_init();

// needed to initialize f16 tables
Expand All @@ -10558,8 +10558,8 @@ void llama_backend_init(bool numa) {
ggml_free(ctx);
}

if (numa) {
ggml_numa_init();
if (numa > 0) {
ggml_numa_init(numa);
}

#ifdef GGML_USE_MPI
Expand Down
2 changes: 1 addition & 1 deletion llama.h
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,7 @@ extern "C" {
// Initialize the llama + ggml backend
// If numa is true, use NUMA optimizations
// Call once at the start of the program
LLAMA_API void llama_backend_init(bool numa);
LLAMA_API void llama_backend_init(enum ggml_numa_strategies numa);

// Call once at the end of the program - currently only used for MPI
LLAMA_API void llama_backend_free(void);
Expand Down
2 changes: 1 addition & 1 deletion tests/test-autorelease.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ int main(int argc, char ** argv) {
auto * model_path = get_model_or_exit(argc, argv);

std::thread([&model_path]() {
llama_backend_init(false);
llama_backend_init(GGML_NUMA_STRATEGY_DISABLED);
auto * model = llama_load_model_from_file(model_path, llama_model_default_params());
auto * ctx = llama_new_context_with_model(model, llama_context_default_params());
llama_free(ctx);
Expand Down
Loading