Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : reduce hash table reset cost #8698

Merged
merged 4 commits into from
Jul 27, 2024
Merged

ggml : reduce hash table reset cost #8698

merged 4 commits into from
Jul 27, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Jul 26, 2024

Reduces the reset cost of hash tables by using a bit table to indicate if a slot is in use, instead of a NULL pointer. In this way, the amount of memory that needs to be cleared when resetting a hash table is reduced by a factor of 64 (32 in 32-bit systems). The difference in performance is only measurable in small models, but with very small models like stories260K, it can be significant. Since the CUDA backend support async computation, the hash table resets could be done in parallel with the computation, hiding most of this latency, however, in the case of the stories260k model the reset was more expensive than the computation.

Other changes included in this PR:

  • Refactor of GGML_ASSERT to move most of the code to an external function call ggml_abort to reduce code size
  • Adds GGML_ABORT to abort the process with a message
  • Replaces GGML_ASSERT(false) with GGML_ABORT
  • Adds GGML_NORETURN to tag functions that not return, used in ggml_abort
  • As a consequence of the previous points, code after GGML_ABORT will generate an "unreachable code" warning in clang (gcc does not have this warning). I have fixed these warnings in a separate commit to simplify the review.
  • Adds -g to all the make release builds to include debug symbols. This should not affect performance, but will improve the accuracy of call stacks.
  • Moves some of the ggml hash functions to the header file to allow the compiler to inline them
  • Minor improvements in reliability and performance of hash tables and ggml_backend_sched
GPU Model Test t/s master t/s sl/hash-improvements Speedup
RTX 3090 Ti (CUDA) stories260K pp512 329655.68 344158.50 1.04
RTX 3090 Ti (CUDA) stories260K tg128 1817.33 2171.60 1.19
RTX 3090 Ti (CUDA) llama 1B Q8_0 tg128 374.48 388.12 1.04
RTX 3090 Ti (CUDA) llama 7B Q4_0 tg128 162.92 164.17 1.01
i9-13900K (CPU) stories260K pp512 20665.15 21438.21 1.04
i9-13900K (CPU) stories260K tg128 3104.72 4727.22 1.52
i9-13900K (CPU) llama 1B Q8_0 tg128 65.99 67.75 1.03
M3 Max (Metal) stories260K pp512 62813.44 63433.15 1.01
M3 Max (Metal) stories260K tg128 1389.56 1449.33 1.04
M3 Max (CPU) stories260K pp512 48704.42 49237.79 1.01
M3 Max (CPU) stories260K tg128 2296.40 2516.05 1.10

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 26, 2024
@slaren
Copy link
Collaborator Author

slaren commented Jul 26, 2024

It seems that older versions of gcc are not capable of understanding that code after GGML_ASSERT(false) is unreachable, which generates a different warning. I can see three options:

  • Disable -Wunreachable-code in clang
  • Disable -Wimplicit-fallthrough in gcc
  • Rollback the change to GGML_ASSERT and continue adding useless breaks and returns after GGML_ASSERT(false)

@slaren slaren force-pushed the sl/hash-improvements branch 2 times, most recently from e0dbcdd to 5fd4cef Compare July 26, 2024 02:48
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve the implicit fallthrough warnings, maybe we can add GGML_ABORT macro like this:

diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index d0311f6c..527dadfb 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -272,7 +272,8 @@
 #define GGML_NORETURN _Noreturn
 #endif
 
-#define GGML_ASSERT(x) if (!(x)) ggml_abort(__FILE__, __LINE__, #x)
+#define GGML_ABORT(x) ggml_abort(__FILE__, __LINE__, #x)
+#define GGML_ASSERT(x) if (!(x)) GGML_ABORT(x)
 
 // used to copy the number of elements and stride in bytes of tensors into local variables.
 // main purpose is to reduce code duplication and improve readability.
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 3fd9c4fe..a6cbb0ae 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -17924,7 +17924,7 @@ static void ggml_compute_backward(struct ggml_context * ctx, struct ggml_tensor
                         } break;
                     case GGML_UNARY_OP_TANH:
                         {
-                            GGML_ASSERT(false); // TODO: not implemented
+                            GGML_ABORT("not implemented");
                         }
                     case GGML_UNARY_OP_ELU:
                         {

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Kompute https://github.com/KomputeProject/kompute/ labels Jul 26, 2024
@slaren slaren force-pushed the sl/hash-improvements branch 2 times, most recently from dccd09d to cdc02b7 Compare July 26, 2024 14:39
@slaren
Copy link
Collaborator Author

slaren commented Jul 26, 2024

While working on this, I also found that building the graphs in the build_llama etc functions, takes almost as much time as splitting and allocating the graph in ggml_backend_sched_alloc_graph, which is completely unreasonable. I tested a few changes to see how much this could be improved:

base:                        0.094 ms
no cb:                       0.074 ms (-0.020 ms)
no GGML_ASSERT:              0.063 ms (-0.011 ms)
no fmt name in expand:       0.048 ms (-0.015 ms)
remove format_name entirely: 0.032 ms (-0.016 ms)
remove expand (*):           0.028 ms

(*) add tensors to the list of nodes as they are created, rather than doing a DFS in ggml_build_forward_expand to build the list afterwards.

So the conclusion is that this could be improved significantly, reducing the need of complicated logic to reuse the graphs as in #8366. Simply removing the calls to ggml_format_name should reduce the cost of building the graphs significantly.

@slaren slaren force-pushed the sl/hash-improvements branch from 8ccf979 to ae5331d Compare July 27, 2024 01:48
@slaren slaren merged commit 2b1f616 into master Jul 27, 2024
58 of 59 checks passed
@slaren slaren deleted the sl/hash-improvements branch July 27, 2024 02:41
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* ggml : reduce hash table reset cost

* fix unreachable code warnings after GGML_ASSERT(false)

* GGML_ASSERT(false) -> GGML_ABORT("fatal error")

* GGML_ABORT use format string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants