Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : add graph tensor allocator #2411

Merged
merged 13 commits into from
Jul 30, 2023
Merged

ggml : add graph tensor allocator #2411

merged 13 commits into from
Jul 30, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Jul 26, 2023

Fixes ggerganov/ggml#288

Allocates tensors based on their usage in the computation graph and replaces scratch buffers. Depending on the batch size and context length, reduces the size of the compute buffers usually by 2x, up to 10x when using a lower batch size.

For example, currently the size of the compute and scratch buffers for 7B with n_ctx=2048, n_batch=512 is 398 MB. With this change, it is 153 MB instead, or 38 MB with n_batch=128.

The implementation is mostly self-contained. It is used by creating a ggml_context with the no_alloc flag, and then calling ggml_allocator_alloc_graph_tensors to allocate the tensors.

Not compatible with the GPU backends, only with CPU. It could possibly be used with Metal, but it would require disabling concurrent execution.

ggml.h Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

Not compatible with the GPU backends

What prevents CUDA and OpenCL from using it?

@slaren
Copy link
Collaborator Author

slaren commented Jul 27, 2023

The CUDA backend has its own memory management for the compute buffer, and this change would interfere with it. At least, this wouldn't work anymore because the data of all tensors is NULL until the graph is built and the allocator is called:

llama.cpp/ggml-cuda.cu

Lines 3930 to 3932 in 1a94186

const bool inplace = (tensor->src[0] != nullptr && tensor->src[0]->data == tensor->data) ||
tensor->op == GGML_OP_VIEW ||
force_inplace;

There may be other cases.

For OpenCL, after reviewing it again, it could work. Usually the OpenCL backend works in the same way as the CUDA backend, but in this case it seems that it doesn't use any kind of VRAM compute buffer, so it shouldn't be affected by this change.

@ghost
Copy link

ghost commented Jul 27, 2023

For example, currently the size of the compute and scratch buffers for 7B with n_ctx=2048, n_batch=512 is 398 MB. With this change, it is 153 MB instead, or 38 MB with n_batch=128.

Android users benefit a lot from this. For my usage, I use batch size 10, so this is fantastic.

@slaren slaren marked this pull request as ready for review July 27, 2023 16:56
@slaren
Copy link
Collaborator Author

slaren commented Jul 28, 2023

Memory usage comparison (compute buffers only):

Model n_ctx n_batch Master PR Factor
LLaMA2 70B 512 512 565 MB 145.35 MB 3.9
LLaMA2 70B 1024 512 638 MB 177.35 MB 3.6
LLaMA2 70B 2048 512 784 MB 305.35 MB 2.6
LLaMA2 70B 4096 512 1077 MB 561.35 MB 1.9
LLaMA2 70B 8192 512 1662 MB 1073.35 MB 1.5
LLaMA2 70B 16384 512 2832 MB 2097.35 MB 1.4

Reduced n_batch:

Model n_ctx n_batch Master PR Factor
LLaMA2 70B 2048 256 784 MB 153.35 MB 5.1
LLaMA2 70B 2048 128 784 MB 77.34 MB 10.1
LLaMA2 70B 2048 64 784 MB 39.34 MB 19.9
LLaMA2 70B 2048 32 784 MB 20.34 MB 38.5
LLaMA2 70B 2048 16 784 MB 10.84 MB 72.3

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff 🦙

I've reviewed the ggml.c and llama.cpp changes.
Will take a deeper look in ggml-alloc.c later and also try to run this with Metal.

It's OK to merge as it is since it is well isolated and backwards compatible

ggml-alloc.h Outdated
GGML_API bool ggml_allocator_is_measure(struct ggml_allocator * alloc);
GGML_API void ggml_allocator_reset(struct ggml_allocator * alloc);
GGML_API void ggml_allocator_alloc_tensor(struct ggml_allocator * alloc, struct ggml_tensor * tensor);
GGML_API size_t ggml_allocator_alloc_graph_tensors(struct ggml_allocator * alloc, struct ggml_cgraph * graph);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe go with a shorter prefix: ggml_alloc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_alloc_alloc_tensor or just ggml_alloc_tensor?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ggml_alloc_new_tensor?

Copy link
Collaborator Author

@slaren slaren Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_tensor doesn't sound right to me because it only allocates data for the tensor, it doesn't really create a new tensor.

I went with the allocr prefix so that the difference between "allocator" and "allocate" is a bit clearer.

int n_tokens,
int n_past,
int n_threads,
const char * cgraph_fname) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR - just a note for later: the cgraph_fname export thing is no longer needed. I don't see it useful anytime soon, so we should just delete all things related to that to cleanup a bit

# define ggml_diag_mask_inf_inplace ggml_diag_mask_inf
# define ggml_soft_max_inplace ggml_soft_max
params.no_alloc = true;
#endif
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the ops are inplace?

Copy link
Collaborator Author

@slaren slaren Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with inplace ops is that the allocator needs to know what tensors are views of another tensor to be able to know when it is safe to free a tensor - ie. a tensor cannot be freed as long there are views of it. inplace ops are effectively a view of their parent, and the allocator is not able to detect that.

In this case however, it wouldn't really matter since the context is no_alloc, the parent's data will be NULL, so the "inplace" op will also have a NULL pointer, and the result would be indistinguishable from a normal op. So these defines could be removed. The allocator could also try harder to determine if an op is inplace by checking if its data is the same as one of its parents. The code could also be simplified a bit if we simply added a pointer to the source of the view in ggml_tensor, though.

The allocator will already make all suitable ops inplace automatically to save memory.

@ghost
Copy link

ghost commented Jul 28, 2023

Android loading 7b model, 2048 context, batch 7:

Master:

llama_model_load_internal: mem required  = 4013.73 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB

PR:

llama_model_load_internal: mem required  = 3615.73 MB (+ 1024.00 MB per state)
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size =    3.42 MB

👍

@slaren slaren merged commit a113689 into master Jul 30, 2023
@slaren slaren deleted the ggml-allocator branch July 30, 2023 13:58
@laurids-reichardt
Copy link

Does this PR break Metal compilation?

❯ make clean && LLAMA_METAL=1 make -j
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o *.so *.dll main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0
common.o
ggml-alloc.o
ggml.o
grammar-parser.o
k_quants.o
llama.o
libembdinput.so
main
quantize
quantize-stats
perplexity
embedding
server
simple
vdot
train-text-from-scratch
embd-input-test
build-info.h
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/grammar-parser.cpp -o grammar-parser.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c -o k_quants.o k_quants.c
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml-alloc.c -o ggml-alloc.o
examples/common.cpp:569:122: warning: format specifies type 'int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
    fprintf(stdout, "  --hellaswag-tasks N   number of tasks to use when computing the HellaSwag score (default: %d)\n", params.hellaswag_tasks);
                                                                                                                 ~~      ^~~~~~~~~~~~~~~~~~~~~~
                                                                                                                 %zu
llama.cpp:1828:50: error: use of undeclared identifier 'cur'
        ggml_metal_get_tensor   (lctx.ctx_metal, cur);
                                                 ^
1 error generated.
make: *** [llama.o] Error 1
make: *** Waiting for unfinished jobs....
1 warning generated.

@slaren
Copy link
Collaborator Author

slaren commented Jul 30, 2023

Yes. cur should be changed to res. Additionally, a ggml_metal_get_tensor should be added for embeddings to support embeddings output. I can open a PR, but I cannot test it.

@slaren
Copy link
Collaborator Author

slaren commented Jul 30, 2023

I have submitted a fix in PR #2455, let me know if it works.

@laurids-reichardt
Copy link

Awesome! Fixed the compilation error. Thanks for your great work!

@Tungsten842
Copy link
Contributor

@slaren This commit broke opencl memory managment.
Before this commit i was able to offload over 37 layers to my gpu and use almost all 8GB of vram.
After this commit I can only offload 13 layers, only using 3GB of vram.
With more that 13 layers I get this error:

GGML_ASSERT: /build/c2lc9ww8i7y99ljdjhnlgbyjinwknsax-source/ggml-alloc.c:216: alloc->n_free_blocks < MAX_FREE_BLOCKS && "out of free blocks"
[nixos-alpha:60397] *** Process received signal ***
[nixos-alpha:60397] Signal: Aborted (6)
[nixos-alpha:60397] Signal code:  (-6)
[nixos-alpha:60397] [ 0] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(+0x38d60)[0x7f60ceb72d60]
[nixos-alpha:60397] [ 1] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(+0x87adc)[0x7f60cebc1adc]
[nixos-alpha:60397] [ 2] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(gsignal+0x16)[0x7f60ceb72cb6]
[nixos-alpha:60397] [ 3] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(abort+0xd7)[0x7f60ceb5c8ba]
[nixos-alpha:60397] [ 4] /nix/store/j16db245g445gfrx909wfynik4q6vifh-llama.cpp/lib/libllama.so(+0x66ca6)[0x7f60cf0cfca6]
[nixos-alpha:60397] [ 5] /nix/store/j16db245g445gfrx909wfynik4q6vifh-llama.cpp/lib/libllama.so(ggml_allocr_alloc_graph+0x30e)[0x7f60cf0d055e]
[nixos-alpha:60397] [ 6] /nix/store/j16db245g445gfrx909wfynik4q6vifh-llama.cpp/lib/libllama.so(llama_new_context_with_model+0x440)[0x7f60cf0857d0]
[nixos-alpha:60397] [ 7] llama[0x40b492]
[nixos-alpha:60397] [ 8] llama[0x406384]
[nixos-alpha:60397] [ 9] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(+0x23ace)[0x7f60ceb5dace]
[nixos-alpha:60397] [10] /nix/store/x33pcmpsiimxhip52mwxbb5y77dhmb21-glibc-2.37-8/lib/libc.so.6(__libc_start_main+0x89)[0x7f60ceb5db89]
[nixos-alpha:60397] [11] llama[0x408e75]
[nixos-alpha:60397] *** End of error message ***

@slaren
Copy link
Collaborator Author

slaren commented Aug 9, 2023

@Tungsten842 I do not see how this assert could happen with OpenCL, so it may be related to the library or fork that you are using. If you are convinced that there is a bug here, please open a new issue and provide instructions to reproduce it with the tools in this repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ggml : improve memory management
4 participants