ggml : add graph tensor allocator #2411

slaren · 2023-07-26T23:25:08Z

Allocates tensors based on their usage in the computation graph and replaces scratch buffers. Depending on the batch size and context length, reduces the size of the compute buffers usually by 2x, up to 10x when using a lower batch size.

For example, currently the size of the compute and scratch buffers for 7B with n_ctx=2048, n_batch=512 is 398 MB. With this change, it is 153 MB instead, or 38 MB with n_batch=128.

The implementation is mostly self-contained. It is used by creating a ggml_context with the no_alloc flag, and then calling ggml_allocator_alloc_graph_tensors to allocate the tensors.

Not compatible with the GPU backends, only with CPU. It could possibly be used with Metal, but it would require disabling concurrent execution.

ggml.h

ggerganov · 2023-07-27T07:39:29Z

Not compatible with the GPU backends

What prevents CUDA and OpenCL from using it?

slaren · 2023-07-27T08:41:53Z

The CUDA backend has its own memory management for the compute buffer, and this change would interfere with it. At least, this wouldn't work anymore because the data of all tensors is NULL until the graph is built and the allocator is called:

llama.cpp/ggml-cuda.cu

Lines 3930 to 3932 in 1a94186

    
           const bool inplace = (tensor->src[0] != nullptr && tensor->src[0]->data == tensor->data) || 
        
               tensor->op == GGML_OP_VIEW || 
        
               force_inplace;

There may be other cases.

For OpenCL, after reviewing it again, it could work. Usually the OpenCL backend works in the same way as the CUDA backend, but in this case it seems that it doesn't use any kind of VRAM compute buffer, so it shouldn't be affected by this change.

…ing a view with an offset

ghost · 2023-07-27T16:33:54Z

For example, currently the size of the compute and scratch buffers for 7B with n_ctx=2048, n_batch=512 is 398 MB. With this change, it is 153 MB instead, or 38 MB with n_batch=128.

Android users benefit a lot from this. For my usage, I use batch size 10, so this is fantastic.

…e allocator

slaren · 2023-07-28T01:25:00Z

Memory usage comparison (compute buffers only):

Model	n_ctx	n_batch	Master	PR	Factor
LLaMA2 70B	512	512	565 MB	145.35 MB	3.9
LLaMA2 70B	1024	512	638 MB	177.35 MB	3.6
LLaMA2 70B	2048	512	784 MB	305.35 MB	2.6
LLaMA2 70B	4096	512	1077 MB	561.35 MB	1.9
LLaMA2 70B	8192	512	1662 MB	1073.35 MB	1.5
LLaMA2 70B	16384	512	2832 MB	2097.35 MB	1.4

Reduced n_batch:

Model	n_ctx	n_batch	Master	PR	Factor
LLaMA2 70B	2048	256	784 MB	153.35 MB	5.1
LLaMA2 70B	2048	128	784 MB	77.34 MB	10.1
LLaMA2 70B	2048	64	784 MB	39.34 MB	19.9
LLaMA2 70B	2048	32	784 MB	20.34 MB	38.5
LLaMA2 70B	2048	16	784 MB	10.84 MB	72.3

ggerganov

Cool stuff 🦙

I've reviewed the ggml.c and llama.cpp changes.
Will take a deeper look in ggml-alloc.c later and also try to run this with Metal.

It's OK to merge as it is since it is well isolated and backwards compatible

ggerganov · 2023-07-28T09:05:21Z

ggml-alloc.h

+GGML_API bool ggml_allocator_is_measure(struct ggml_allocator * alloc);
+GGML_API void ggml_allocator_reset(struct ggml_allocator * alloc);
+GGML_API void ggml_allocator_alloc_tensor(struct ggml_allocator * alloc, struct ggml_tensor * tensor);
+GGML_API size_t ggml_allocator_alloc_graph_tensors(struct ggml_allocator * alloc, struct ggml_cgraph * graph);


Maybe go with a shorter prefix: ggml_alloc

ggml_alloc_alloc_tensor or just ggml_alloc_tensor?

How about ggml_alloc_new_tensor?

new_tensor doesn't sound right to me because it only allocates data for the tensor, it doesn't really create a new tensor.

I went with the allocr prefix so that the difference between "allocator" and "allocate" is a bit clearer.

ggerganov · 2023-07-28T09:06:37Z

llama.cpp

+                   int   n_tokens,
+                   int   n_past,
+                   int   n_threads,
+            const char * cgraph_fname) {


Not for this PR - just a note for later: the cgraph_fname export thing is no longer needed. I don't see it useful anytime soon, so we should just delete all things related to that to cleanup a bit

ggerganov · 2023-07-28T09:08:11Z

llama.cpp

+#  define ggml_diag_mask_inf_inplace ggml_diag_mask_inf
+#  define ggml_soft_max_inplace ggml_soft_max
+    params.no_alloc = true;
+#endif


What happens if the ops are inplace?

The problem with inplace ops is that the allocator needs to know what tensors are views of another tensor to be able to know when it is safe to free a tensor - ie. a tensor cannot be freed as long there are views of it. inplace ops are effectively a view of their parent, and the allocator is not able to detect that.

In this case however, it wouldn't really matter since the context is no_alloc, the parent's data will be NULL, so the "inplace" op will also have a NULL pointer, and the result would be indistinguishable from a normal op. So these defines could be removed. The allocator could also try harder to determine if an op is inplace by checking if its data is the same as one of its parents. The code could also be simplified a bit if we simply added a pointer to the source of the view in ggml_tensor, though.

The allocator will already make all suitable ops inplace automatically to save memory.

ghost · 2023-07-28T09:12:45Z

Android loading 7b model, 2048 context, batch 7:

Master:

llama_model_load_internal: mem required  = 4013.73 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB

PR:

llama_model_load_internal: mem required  = 3615.73 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size =    3.42 MB

👍

cleanup ggml-ci

laurids-reichardt · 2023-07-30T15:45:52Z

Does this PR break Metal compilation?

❯ make clean && LLAMA_METAL=1 make -j
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o *.so *.dll main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0
common.o
ggml-alloc.o
ggml.o
grammar-parser.o
k_quants.o
llama.o
libembdinput.so
main
quantize
quantize-stats
perplexity
embedding
server
simple
vdot
train-text-from-scratch
embd-input-test
build-info.h
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/grammar-parser.cpp -o grammar-parser.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c -o k_quants.o k_quants.c
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml-alloc.c -o ggml-alloc.o
examples/common.cpp:569:122: warning: format specifies type 'int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
    fprintf(stdout, "  --hellaswag-tasks N   number of tasks to use when computing the HellaSwag score (default: %d)\n", params.hellaswag_tasks);
                                                                                                                 ~~      ^~~~~~~~~~~~~~~~~~~~~~
                                                                                                                 %zu
llama.cpp:1828:50: error: use of undeclared identifier 'cur'
        ggml_metal_get_tensor   (lctx.ctx_metal, cur);
                                                 ^
1 error generated.
make: *** [llama.o] Error 1
make: *** Waiting for unfinished jobs....
1 warning generated.

slaren · 2023-07-30T16:23:48Z

Yes. cur should be changed to res. Additionally, a ggml_metal_get_tensor should be added for embeddings to support embeddings output. I can open a PR, but I cannot test it.

slaren · 2023-07-30T16:29:25Z

I have submitted a fix in PR #2455, let me know if it works.

laurids-reichardt · 2023-07-30T16:37:47Z

Awesome! Fixed the compilation error. Thanks for your great work!

Tungsten842 · 2023-08-09T11:29:03Z

@slaren This commit broke opencl memory managment.
Before this commit i was able to offload over 37 layers to my gpu and use almost all 8GB of vram.
After this commit I can only offload 13 layers, only using 3GB of vram.
With more that 13 layers I get this error:

GGML_ASSERT: /build/c2lc9ww8i7y99ljdjhnlgbyjinwknsax-source/ggml-alloc.c:216: alloc->n_free_blocks < MAX_FREE_BLOCKS && "out of free blocks"
[nixos-alpha:60397] *** Process received signal ***
[nixos-alpha:60397] Signal: Aborted (6)
[nixos-alpha:60397] Signal code:  (-6)
[nixos-alpha:60397] [ 0] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(+0x38d60)[0x7f60ceb72d60]
[nixos-alpha:60397] [ 1] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(+0x87adc)[0x7f60cebc1adc]
[nixos-alpha:60397] [ 2] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(gsignal+0x16)[0x7f60ceb72cb6]
[nixos-alpha:60397] [ 3] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(abort+0xd7)[0x7f60ceb5c8ba]
[nixos-alpha:60397] [ 4] /nix/store/j16db245g445gfrx909wfynik4q6vifh-llama.cpp/lib/libllama.so(+0x66ca6)[0x7f60cf0cfca6]
[nixos-alpha:60397] [ 5] /nix/store/j16db245g445gfrx909wfynik4q6vifh-llama.cpp/lib/libllama.so(ggml_allocr_alloc_graph+0x30e)[0x7f60cf0d055e]
[nixos-alpha:60397] [ 6] /nix/store/j16db245g445gfrx909wfynik4q6vifh-llama.cpp/lib/libllama.so(llama_new_context_with_model+0x440)[0x7f60cf0857d0]
[nixos-alpha:60397] [ 7] llama[0x40b492]
[nixos-alpha:60397] [ 8] llama[0x406384]
[nixos-alpha:60397] [ 9] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(+0x23ace)[0x7f60ceb5dace]
[nixos-alpha:60397] [10] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89)[0x7f60ceb5db89]
[nixos-alpha:60397] [11] llama[0x408e75]
[nixos-alpha:60397] *** End of error message ***

slaren · 2023-08-09T11:39:59Z

@Tungsten842 I do not see how this assert could happen with OpenCL, so it may be related to the library or fork that you are using. If you are convinced that there is a bug here, please open a new issue and provide instructions to reproduce it with the tools in this repository.

slaren commented Jul 26, 2023

View reviewed changes

ggml.h Outdated Show resolved Hide resolved

slaren added 7 commits July 27, 2023 18:03

ggml : add graph tensor allocator

768ecfc

adjust buffer size to account for alignment

598a9ad

fix mpi build

8afe392

allow using the allocator with opencl

8fa5483

add list of ops that support in-place

f67179a

ggml : don't calculate data pointer of unallocated tensors when creat…

64584d5

…ing a view with an offset

llama.cpp : free allocator when deleting context, cleanup

af7bd42

replace n_views and n_children in ggml_tensor with a hash table in th…

e39e62b

…e allocator

slaren force-pushed the ggml-allocator branch from 122d7c6 to e39e62b Compare July 27, 2023 16:34

slaren added 2 commits July 27, 2023 18:40

ggml : refactor ggml_view_Nd into ggml_view_tensor_offset

e592a17

llama.cpp : fix embeddings output

ba0ab56

slaren marked this pull request as ready for review July 27, 2023 16:56

slaren added 2 commits July 27, 2023 19:03

llama.cpp : fix embeddings input

966c069

llama.cpp : better memory usage prints with allocator

cd4a8cd

ggerganov approved these changes Jul 28, 2023

View reviewed changes

ggerganov reviewed Jul 28, 2023

View reviewed changes

rename ggml_allocator to ggml_allocr

570aa7c

cleanup ggml-ci

slaren merged commit a113689 into master Jul 30, 2023

slaren deleted the ggml-allocator branch July 30, 2023 13:58

ggerganov mentioned this pull request Jul 31, 2023

metal : fix out-of-bounds access + style changes #2416

Merged

slaren mentioned this pull request Aug 3, 2023

Estimate memory requirements for graph ggerganov/ggml#260

Open

lshzh-ww mentioned this pull request Aug 16, 2023

metal: enable ggml-alloc #2627

Merged

africalimedrop mentioned this pull request Sep 3, 2023

llama.cpp/ggml-alloc.c:230: alloc->n_free_blocks < MAX_FREE_BLOCKS && "out of free blocks" #2993

Closed

4 tasks

Tianyue-Zhao mentioned this pull request Oct 30, 2024

Recommended ways to conserve memory - scratch buffers / graph allocator ggerganov/ggml#1001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add graph tensor allocator #2411

ggml : add graph tensor allocator #2411

slaren commented Jul 26, 2023

ggerganov commented Jul 27, 2023

slaren commented Jul 27, 2023

ghost commented Jul 27, 2023

slaren commented Jul 28, 2023

ggerganov left a comment

ggerganov Jul 28, 2023

slaren Jul 28, 2023

ggerganov Jul 28, 2023

slaren Jul 30, 2023 •

edited

Loading

ggerganov Jul 28, 2023

ggerganov Jul 28, 2023

slaren Jul 28, 2023 •

edited

Loading

ghost commented Jul 28, 2023

laurids-reichardt commented Jul 30, 2023

slaren commented Jul 30, 2023

slaren commented Jul 30, 2023

laurids-reichardt commented Jul 30, 2023

Tungsten842 commented Aug 9, 2023

slaren commented Aug 9, 2023

ggml : add graph tensor allocator #2411

ggml : add graph tensor allocator #2411

Conversation

slaren commented Jul 26, 2023

ggerganov commented Jul 27, 2023

slaren commented Jul 27, 2023

ghost commented Jul 27, 2023

slaren commented Jul 28, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Jul 28, 2023

Choose a reason for hiding this comment

slaren Jul 28, 2023

Choose a reason for hiding this comment

ggerganov Jul 28, 2023

Choose a reason for hiding this comment

slaren Jul 30, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov Jul 28, 2023

Choose a reason for hiding this comment

ggerganov Jul 28, 2023

Choose a reason for hiding this comment

slaren Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

ghost commented Jul 28, 2023

laurids-reichardt commented Jul 30, 2023

slaren commented Jul 30, 2023

slaren commented Jul 30, 2023

laurids-reichardt commented Jul 30, 2023

Tungsten842 commented Aug 9, 2023

slaren commented Aug 9, 2023

slaren Jul 30, 2023 •

edited

Loading

slaren Jul 28, 2023 •

edited

Loading