Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation of per-token CPU activities for GPU inference #7456

Closed
agray3 opened this issue May 22, 2024 · 5 comments
Closed

Optimisation of per-token CPU activities for GPU inference #7456

agray3 opened this issue May 22, 2024 · 5 comments
Labels

Comments

@agray3
Copy link
Contributor

agray3 commented May 22, 2024

When using a GPU backend, for each token evaluation there exists not only computation on the GPU but also significant CPU computation which can potentially be optimized.

Here are some timing measurements of the critical path for each token for llama2 Q4_K_M 7B and 13B models on A100 and H100 GPUs.

Firstly, here are absolute times:

and here are the same data presented as a percentage breakdown in each case:

CUDA Graph Execution is the time spent executing the compute graph on the GPU, which is responsible for around 85-90% of the time taken in evaluating each token..

The remaining 10-15% of the time is taken by CPU activities, the most dominant of which are discussed below.

GGML Graph Preparation: llama_build_graph and ggml_backend_sched_split_graph are related to the building/preparation of the compute graph in GGML format for each token, which is ultimately translated into a CUDA graph for execution. However, we know from the CUDA graph implementation (#6763) that only very minor adjustments are required across the majority of tokens. Therefore, it seems that most of the work is not required and we should be able to cache/reuse components of the GGML graph across tokens, in a similar way that we reuse each CUDA graph with only minor adjustments. E.g. in build_llama() we could add some code to save state across tokens, rather than perform the full re-build every token.

Sampling: llama_sampling_sample uses the CPU to perform sampling on the logits that have been evaluated on the GPU, for each token. In principle this sampling could be ported to the GPU.

I will continue to investigate these optimization possibilities.

@agray3
Copy link
Contributor Author

agray3 commented May 22, 2024

@ggerganov @slaren FYI here is the info I promised when we met last week.

@mofosyne mofosyne added performance Speed related topics research 🔬 labels May 22, 2024
@slaren
Copy link
Member

slaren commented May 22, 2024

Interesting, thanks.

@freckletonj
Copy link

I'm not sure if it's related, but running llama-3-70B-base, 5bit gguf, I was getting ~2% gpu utilization on an a6000 and 32 cores of CPU were pinned at 100%, yielding ~3 tok/s. Peak VRAM usage was only like 11GB.

@github-actions github-actions bot added the stale label Jul 7, 2024
agray3 added a commit to agray3/llama.cpp that referenced this issue Jul 8, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph.

Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.

Refs ggml-org#7456
agray3 added a commit to agray3/llama.cpp that referenced this issue Jul 8, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.
@agray3
Copy link
Contributor Author

agray3 commented Jul 8, 2024

See #8366 which addresses the GGML Graph Preparation: part.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this issue Jul 8, 2024
ggml: avoid rebuild of GGML graph for each token (ggml-org#7456)
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 8, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.
@github-actions github-actions bot removed the stale label Jul 9, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 9, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 9, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 11, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 16, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 27, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

Up LCPP Graph PR by Agray3
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 27, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this issue Jul 27, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Jul 28, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this issue Jul 28, 2024
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this issue Jul 28, 2024
@github-actions github-actions bot added the stale label Aug 9, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Aug 26, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Aug 26, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Aug 26, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Aug 26, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Sep 19, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer

Fix Cuda graphs caching merge mistake and indent

ubatch instead of u_batch
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this issue Sep 19, 2024
agray3 added a commit to agray3/llama.cpp that referenced this issue Oct 11, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this issue Oct 12, 2024
ggml: avoid rebuild of GGML graph for each token (ggml-org#7456)
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Oct 12, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Oct 25, 2024
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Feb 25, 2025
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Feb 25, 2025
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

remove stale code

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this issue Feb 25, 2025
Introduces caching of GGML graph to avoid unnecessary full rebuild
between each token. KV cache parameters, which change with each token,
are updated directly in cached GGML graph. Can be disabled with
GGML_DISABLE_GRAPH_CACHING environment variable.

fix seg fault

restrict to nsplit=2

Improve identification of K and V nodes for param updates

remove stale code

Reworked to directly update KV cache params using info from name

make n_embd_v_gqa_* dependent on layer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants