-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : add graph tensor allocator #2411
Conversation
What prevents CUDA and OpenCL from using it? |
The CUDA backend has its own memory management for the compute buffer, and this change would interfere with it. At least, this wouldn't work anymore because the Lines 3930 to 3932 in 1a94186
There may be other cases. For OpenCL, after reviewing it again, it could work. Usually the OpenCL backend works in the same way as the CUDA backend, but in this case it seems that it doesn't use any kind of VRAM compute buffer, so it shouldn't be affected by this change. |
…ing a view with an offset
Android users benefit a lot from this. For my usage, I use batch size 10, so this is fantastic. |
Memory usage comparison (compute buffers only):
Reduced
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff 🦙
I've reviewed the ggml.c
and llama.cpp
changes.
Will take a deeper look in ggml-alloc.c
later and also try to run this with Metal.
It's OK to merge as it is since it is well isolated and backwards compatible
ggml-alloc.h
Outdated
GGML_API bool ggml_allocator_is_measure(struct ggml_allocator * alloc); | ||
GGML_API void ggml_allocator_reset(struct ggml_allocator * alloc); | ||
GGML_API void ggml_allocator_alloc_tensor(struct ggml_allocator * alloc, struct ggml_tensor * tensor); | ||
GGML_API size_t ggml_allocator_alloc_graph_tensors(struct ggml_allocator * alloc, struct ggml_cgraph * graph); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe go with a shorter prefix: ggml_alloc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml_alloc_alloc_tensor
or just ggml_alloc_tensor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ggml_alloc_new_tensor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_tensor
doesn't sound right to me because it only allocates data for the tensor, it doesn't really create a new tensor.
I went with the allocr
prefix so that the difference between "allocator" and "allocate" is a bit clearer.
int n_tokens, | ||
int n_past, | ||
int n_threads, | ||
const char * cgraph_fname) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR - just a note for later: the cgraph_fname
export thing is no longer needed. I don't see it useful anytime soon, so we should just delete all things related to that to cleanup a bit
# define ggml_diag_mask_inf_inplace ggml_diag_mask_inf | ||
# define ggml_soft_max_inplace ggml_soft_max | ||
params.no_alloc = true; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the ops are inplace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with inplace
ops is that the allocator needs to know what tensors are views of another tensor to be able to know when it is safe to free a tensor - ie. a tensor cannot be freed as long there are views of it. inplace
ops are effectively a view of their parent, and the allocator is not able to detect that.
In this case however, it wouldn't really matter since the context is no_alloc
, the parent's data
will be NULL
, so the "inplace" op will also have a NULL
pointer, and the result would be indistinguishable from a normal op. So these defines could be removed. The allocator could also try harder to determine if an op is inplace
by checking if its data
is the same as one of its parents. The code could also be simplified a bit if we simply added a pointer to the source of the view in ggml_tensor
, though.
The allocator will already make all suitable ops inplace
automatically to save memory.
Android loading 7b model, 2048 context, batch 7: Master:
PR:
👍 |
cleanup ggml-ci
Does this PR break Metal compilation? ❯ make clean && LLAMA_METAL=1 make -j
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
rm -vf *.o *.so *.dll main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0
common.o
ggml-alloc.o
ggml.o
grammar-parser.o
k_quants.o
llama.o
libembdinput.so
main
quantize
quantize-stats
perplexity
embedding
server
simple
vdot
train-text-from-scratch
embd-input-test
build-info.h
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/grammar-parser.cpp -o grammar-parser.o
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c -o k_quants.o k_quants.c
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-alloc.c -o ggml-alloc.o
examples/common.cpp:569:122: warning: format specifies type 'int' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
fprintf(stdout, " --hellaswag-tasks N number of tasks to use when computing the HellaSwag score (default: %d)\n", params.hellaswag_tasks);
~~ ^~~~~~~~~~~~~~~~~~~~~~
%zu
llama.cpp:1828:50: error: use of undeclared identifier 'cur'
ggml_metal_get_tensor (lctx.ctx_metal, cur);
^
1 error generated.
make: *** [llama.o] Error 1
make: *** Waiting for unfinished jobs....
1 warning generated. |
Yes. |
I have submitted a fix in PR #2455, let me know if it works. |
Awesome! Fixed the compilation error. Thanks for your great work! |
@slaren This commit broke opencl memory managment.
|
@Tungsten842 I do not see how this assert could happen with OpenCL, so it may be related to the library or fork that you are using. If you are convinced that there is a bug here, please open a new issue and provide instructions to reproduce it with the tools in this repository. |
Fixes ggerganov/ggml#288
Allocates tensors based on their usage in the computation graph and replaces scratch buffers. Depending on the batch size and context length, reduces the size of the compute buffers usually by 2x, up to 10x when using a lower batch size.
For example, currently the size of the compute and scratch buffers for 7B with
n_ctx=2048
,n_batch=512
is398 MB
. With this change, it is153 MB
instead, or38 MB
withn_batch=128
.The implementation is mostly self-contained. It is used by creating a
ggml_context
with theno_alloc
flag, and then callingggml_allocator_alloc_graph_tensors
to allocate the tensors.Not compatible with the GPU backends, only with CPU. It could possibly be used with Metal, but it would require disabling concurrent execution.