diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-1 Machine Code Layout.md b/chapters/11-Machine-Code-Layout-Optimizations/11-1 Machine Code Layout.md index 349083527e..7749d90df1 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-1 Machine Code Layout.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-1 Machine Code Layout.md @@ -1,6 +1,6 @@ # Machine Code Layout Optimizations {#sec:secFEOpt} -The CPU Front-End (FE) is responsible for fetching and decoding instructions and delivering them to the out-of-order Back-End. As the newer processors get more execution "horsepower", CPU FE needs to be as powerful to keep the machine balanced. If the FE cannot keep up with supplying instructions, the BE will be underutilized, and the overall performance will suffer. That's why the FE is designed to always run well ahead of the actual execution to smooth out any hiccups that may occur and always have instructions ready to be executed. For example, Intel Skylake, released in 2016, can fetch up to 16 instructions per cycle. +The CPU Front-End (FE) is responsible for fetching and decoding instructions and delivering them to the out-of-order Back-End. As the newer processors get more execution "horsepower", the CPU FE needs to be as powerful to keep the machine balanced. If the FE cannot keep up with supplying instructions, the BE will be underutilized, and the overall performance will suffer. That's why the FE is designed to always run well ahead of the actual execution to smooth out any hiccups that may occur and always have instructions ready to be executed. For example, Intel Skylake, released in 2016, can fetch up to 16 instructions per cycle. Most of the time, inefficiencies in the CPU FE can be described as a situation when the Back-End is waiting for instructions to execute, but the FE is not able to provide them. As a result, CPU cycles are wasted without doing any actual useful work. Recall that modern CPUs can process multiple instructions every cycle, nowadays ranging from 4- to 8-wide. Situations when not all available slots are filled happen very often. This represents a source of inefficiency for applications in many domains, such as databases, compilers, web browsers, and many others. diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-3 Basic Block Placement.md b/chapters/11-Machine-Code-Layout-Optimizations/11-3 Basic Block Placement.md index 0199f9a72b..e40ca934ef 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-3 Basic Block Placement.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-3 Basic Block Placement.md @@ -19,7 +19,7 @@ Two versions of machine code layout for the snippet of code above. Which layout is better? Well, it depends on whether `cond` is usually true or false. If `cond` is usually true, then we would better choose the default layout because otherwise, we would be doing two jumps instead of one. Also, in the general case, if `coldFunc` is a relatively small function, we would want to have it inlined. However, in this particular example, we know that `coldFunc` is an error-handling function and is likely not executed very often. By choosing layout @fig:BB_better, we maintain fall through between hot pieces of the code and convert the taken branch into not taken one. -There are a few reasons why the layout presented in Figure @fig:BB_better performs better. First of all, the layout in Figure @fig:BB_better makes better use of the instruction and $\mu$op-cache (DSB, see [@sec:uarchFE]). With all hot code contiguous, there is no cache line fragmentation: all the cache lines in the L1I-cache are used by hot code. The same is true for the $\mu$op-cache since it caches based on the underlying code layout as well. Secondly, taken branches are also more expensive for the fetch unit. The Front-End of a CPU fetches contiguous chunks of bytes, so every taken jump means the bytes after the jump are useless. This reduces the maximum effective fetch throughput. Finally, on some architectures, not-taken branches are fundamentally cheaper than taken. For instance, Intel Skylake CPUs can execute two untaken branches per cycle but only one taken branch every two cycles.[^2] +There are a few reasons why the layout presented in Figure @fig:BB_better performs better. First of all, the layout in Figure @fig:BB_better makes better use of the instruction and $\mu$op-cache (DSB, see [@sec:uarchFE]). With all hot code contiguous, there is no cache line fragmentation: all the cache lines in the L1 I-cache are used by hot code. The same is true for the $\mu$op-cache since it caches based on the underlying code layout as well. Secondly, taken branches are also more expensive for the fetch unit. The Front-End of a CPU fetches contiguous chunks of bytes, so every taken jump means the bytes after the jump are useless. This reduces the maximum effective fetch throughput. Finally, on some architectures, not-taken branches are fundamentally cheaper than taken. For instance, Intel Skylake CPUs can execute two untaken branches per cycle but only one taken branch every two cycles.[^2] To suggest a compiler to generate an improved version of the machine code layout, one can provide a hint using `[[likely]]` and `[[unlikely]]` attributes, which have been available since C++20. The code that uses this hint will look like this: diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-4 Basic Block Alignment.md b/chapters/11-Machine-Code-Layout-Optimizations/11-4 Basic Block Alignment.md index c3e3f1201e..c3d86ea873 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-4 Basic Block Alignment.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-4 Basic Block Alignment.md @@ -22,7 +22,7 @@ void benchmark_func(int* a) { │ 00000000004046a0 <_Z14benchmark_funcPi>: The code itself is pretty reasonable, but its layout is not perfect (see Figure @fig:Loop_default). Instructions that correspond to the loop are highlighted with yellow hachure. As well as for data caches, instruction cache lines are 64 bytes long. In Figure @fig:LoopLayout thick boxes denote cache line borders. Notice that the loop spans multiple cache lines: it begins on the cache line `0x80-0xBF` and ends in the cache line `0xC0-0xFF`. To fetch instructions that are executed in the loop, a processor needs to read two cache lines. These kinds of situations sometimes cause performance problems for the CPU Front-End, especially for the small loops like those presented in [@lst:LoopAlignment]. -To fix this, we can shift the loop instructions forward by 16 bytes using NOPs so that the whole loop will reside in one cache line. Figure @fig:Loop_better shows the effect of doing this with NOP instructions highlighted in blue. Interestingly, the performance impact is visible even if you run nothing but this hot loop in a microbenchmark. It is somewhat puzzling since the amount of code is tiny and it shouldn't saturate the L1I-cache size on any modern CPU. The reason for the better performance of the layout in Figure @fig:Loop_better is not trivial to explain and will involve a fair amount of microarchitectural details, which we don't discuss in this book. Interested readers can find more information in the article "[Code alignment issues](https://easyperf.net/blog/2018/01/18/Code_alignment_issues)" on the Easyperf blog.[^1] +To fix this, we can shift the loop instructions forward by 16 bytes using NOPs so that the whole loop will reside in one cache line. Figure @fig:Loop_better shows the effect of doing this with NOP instructions highlighted in blue. Interestingly, the performance impact is visible even if you run nothing but this hot loop in a microbenchmark. It is somewhat puzzling since the amount of code is tiny and it shouldn't saturate the L1 I-cache size on any modern CPU. The reason for the better performance of the layout in Figure @fig:Loop_better is not trivial to explain and will involve a fair amount of microarchitectural details, which we don't discuss in this book. Interested readers can find more information in the article "[Code alignment issues](https://easyperf.net/blog/2018/01/18/Code_alignment_issues)" on the Easyperf blog.[^1]
![default layout](../../img/cpu_fe_opts/LoopAlignment_Default.png){#fig:Loop_default width=100%} diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-6 Function Reordering.md b/chapters/11-Machine-Code-Layout-Optimizations/11-6 Function Reordering.md index 512ff1ecdb..608fb8f042 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-6 Function Reordering.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-6 Function Reordering.md @@ -2,7 +2,7 @@ Following the principles described in previous sections, hot functions can be grouped together to further improve the utilization of caches in the CPU Front-End. When hot functions are grouped, they start sharing cache lines, which reduces the *code footprint*, the total number of cache lines a CPU needs to fetch. -Figure @fig:FunctionGrouping gives a graphical representation of reordering hot functions `foo`, `bar`, and `zoo`. The arrows on the image show the most frequent call pattern, i.e., `foo` calls `zoo`, which in turn calls `bar`. In the default layout (see Figure @fig:FuncGroup_default), hot functions are not adjacent to each other with some cold functions placed between them. Thus the sequence of two function calls (`foo` → `zoo` → `bar`) requires four cache line reads. +Figure @fig:FunctionGrouping gives a graphical representation of reordering hot functions `foo`, `bar`, and `zoo`. The arrows on the image show the most frequent call pattern, i.e., `foo` calls `zoo`, which in turn calls `bar`. In the default layout (see Figure @fig:FuncGroup_default), hot functions are not adjacent to each other with some cold functions placed between them. Thus the sequence of two function calls (`foo` → `zoo` → `bar`) requires four cache line reads. We can rearrange the order of the functions such that hot functions are placed close to each other (see Figure @fig:FuncGroup_better). In the improved version, the code of the `foo`, `bar`, and `zoo` functions fits in three cache lines. Also, notice that function `zoo` now is placed between `foo` and `bar` according to the order in which function calls are being made. When we call `zoo` from `foo`, the beginning of `zoo` is already in the I-cache. @@ -15,7 +15,7 @@ Reordering hot functions. Similar to previous optimizations, function reordering improves the utilization of I-cache and DSB-cache. This optimization works best when there are many small hot functions. -The linker is responsible for laying out all the functions of the program in the resulting binary output. While developers can try to reorder functions in a program themselves, there is no guarantee of the desired physical layout. For decades people have been using linker scripts to achieve this goal. Still, this is the way to go if you are using the GNU linker. The Gold linker (`ld.gold`) has an easier approach to this problem. To get the desired ordering of functions in the binary with the Gold linker, one can first compile the code with the `-ffunction-sections` flag, which will put each function into a separate section. Then use [`--section-ordering-file=order.txt`](https://manpages.debian.org/unstable/binutils/x86_64-linux-gnu-ld.gold.1.en.html) option to provide a file with a sorted list of function names that reflects the desired final layout. The same feature exists in the LLD linker, which is a part of the LLVM compiler infrastructure and is accessible via the `--symbol-ordering-file` option. +The linker is responsible for laying out all the functions of the program in the resulting binary output. While developers can try to reorder functions in a program themselves, there is no guarantee of the desired physical layout. For decades people have been using linker scripts to achieve this goal. This is still the way to go if you are using the GNU linker. The Gold linker (`ld.gold`) has an easier approach to this problem. To get the desired ordering of functions in the binary with the Gold linker, one can first compile the code with the `-ffunction-sections` flag, which will put each function into a separate section. Then use [`--section-ordering-file=order.txt`](https://manpages.debian.org/unstable/binutils/x86_64-linux-gnu-ld.gold.1.en.html) option to provide a file with a sorted list of function names that reflects the desired final layout. The same feature exists in the LLD linker, which is a part of the LLVM compiler infrastructure and is accessible via the `--symbol-ordering-file` option. An interesting approach to solving the problem of grouping hot functions was introduced in 2017 by engineers from Meta. They implemented a tool called [HFSort](https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort)[^1], that generates the section ordering file automatically based on profiling data [@HfSort]. Using this tool, they observed a 2\% performance speedup of large distributed cloud applications like Facebook, Baidu, and Wikipedia. HFSort has been integrated into Meta's HHVM, LLVM BOLT, and LLD linker[^2]. Since then, the algorithm has been superseded first by HFSort+, and most recently by Cache-Directed Sort (CDSort[^3]), with more improvements for workloads with a large code footprint. diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md b/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md index 5354a0f164..5fe52c5f43 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-7 PGO.md @@ -22,7 +22,7 @@ An alternative solution was pioneered by Google in 2016 with sample-based PGO. [ This approach has a few advantages over instrumented PGO. First of all, it eliminates one step from the PGO build workflow, namely step 1 since there is no need to build an instrumented binary. Secondly, profiling data collection runs on an already optimized binary, thus it has a much lower runtime overhead. This makes it possible to collect profiling data in a production environment for a longer time. Since this approach is based on hardware collection, it also enables new kinds of optimizations that are not possible with instrumented PGO. One example is branch-to-cmov conversion, which is a transformation that replaces conditional jumps with conditional moves to avoid the cost of a branch misprediction (see [@sec:BranchlessSelection]). To effectively perform this transformation, a compiler needs to know how frequently the original branch was mispredicted. This information is available with sample-based PGO on modern CPUs (Intel Skylake+). -The next innovative idea came from Meta in mid-2018, when it open-sourced its binary optimization tool called [BOLT](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/).[^9] BOLT works on the already compiled binary. It first disassembles the code, then it uses the profile information collected by a sampling profiler, such as Linux perf, to do various layout transformations and then relinks the binary again. [@BOLT] As of today, BOLT has more than 15 optimization passes, including basic block reordering, function splitting and reordering, and others. Similar to traditional PGO, primary candidates for BOLT optimizations are programs that suffer from instruction cache and iTLB misses. Since January 2022, BOLT has been a part of the LLVM project and is available as a standalone tool. +The next innovative idea came from Meta in mid-2018, when it open-sourced its binary optimization tool called [BOLT](https://code.fb.com/data-infrastructure/accelerate-large-scale-applications-with-bolt/).[^9] BOLT works on the already compiled binary. It first disassembles the code, then it uses the profile information collected by a sampling profiler, such as Linux perf, to do various layout transformations and then relinks the binary again. [@BOLT] As of today, BOLT has more than 15 optimization passes, including basic block reordering, function splitting and reordering, and others. Similar to traditional PGO, primary candidates for BOLT optimizations are programs that suffer from instruction cache and ITLB misses. Since January 2022, BOLT has been a part of the LLVM project and is available as a standalone tool. A few years after BOLT was introduced, Google open-sourced its binary relinking tool called [Propeller](https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf). It serves a similar purpose but instead of disassembling the original binary, it relies on linker input and thus can be distributed across several machines for better scaling and less memory consumption. Post-link optimizers such as BOLT and Propeller can be used in combination with traditional PGO (and LTO) and often provide an additional 5-10% performance speedup. Such techniques open up new kinds of binary rewriting optimizations that are based on hardware telemetry. diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-8 Reducing ITLB misses.md b/chapters/11-Machine-Code-Layout-Optimizations/11-8 Reducing ITLB misses.md index 330933fa13..ac20119ea8 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-8 Reducing ITLB misses.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-8 Reducing ITLB misses.md @@ -2,7 +2,7 @@ Another important area of tuning FE efficiency is the virtual-to-physical address translation of memory addresses. Primarily those translations are served by TLB (see [@sec:TLBs]), which caches the most recently used memory page translations in dedicated entries. When TLB cannot serve the translation request, a time-consuming page walk of the kernel page table takes place to calculate the correct physical address for each referenced virtual address. Whenever you see a high percentage of ITLB overhead in the TMA summary, the advice in this section may become handy. -In general, relatively small applications are not susceptible to ITLB misses. For example, Golden Cove microarchitecture can cover memory space up to 1MB in its ITLB. If the machine code of your application fits in 1MB you should not be affected by ITLB misses. The problem starts to appear when frequently executed parts of an application are scattered around the memory. When many functions begin to frequently call each other, they start competing for the entries in the ITLB. One of the examples is the Clang compiler, which at the time of writing, has a code section of ~60MB. ITLB overhead running on a laptop with a mainstream Intel CoffeeLake processor is ~7%, which means that 7% of cycles are wasted handling ITLB misses: doing demanding page walks and populating TLB entries. +In general, relatively small applications are not susceptible to ITLB misses. For example, Golden Cove microarchitecture can cover memory space up to 1MB in its ITLB. If the machine code of your application fits in 1MB you should not be affected by ITLB misses. The problem starts to appear when frequently executed parts of an application are scattered around the memory. When many functions begin to frequently call each other, they start competing for the entries in the ITLB. One of the examples is the Clang compiler, which at the time of writing, has a code section of ~60MB. ITLB overhead running on a laptop with a mainstream Intel Coffee Lake processor is ~7%, which means that 7% of cycles are wasted handling ITLB misses: doing demanded page walks and populating TLB entries. Another set of large memory applications that frequently benefit from using huge pages include relational databases (e.g., MySQL, PostgreSQL, Oracle), managed runtimes (e.g., JavaScript V8, Java JVM), cloud services (e.g., web search), web tooling (e.g., node.js). Mapping code sections onto huge pages can reduce the number of ITLB misses by up to 50% [@IntelBlueprint], which yields speedups of up to 10% for some applications. However, as it is with many other features, huge pages are not for every application. Small programs with an executable file of only a few KB in size would be better off using regular 4KB pages rather than 2MB huge pages; that way, memory is used more efficiently. @@ -20,7 +20,7 @@ $ /path/to/clang++ a.cpp $ hugectl --text /path/to/clang++ a.cpp ``` -The second option is to remap the code section at runtime. This option does not require the code section to be aligned to a 2MB boundary, thus can work without recompiling the application. This is especially useful when you don’t have access to the source code. The idea behind this method is to allocate huge pages at the startup of the program and transfer the code section there. The reference implementation of that approach is implemented in the [iodlr](https://github.com/intel/iodlr)[^2]. One option would be to call that functionality from your `main` function. Another option, which is simpler, is to build the dynamic library and preload it in the command line: +The second option is to remap the code section at runtime. This option does not require the code section to be aligned to a 2MB boundary, and thus can work without recompiling the application. This is especially useful when you don’t have access to the source code. The idea behind this method is to allocate huge pages at the startup of the program and transfer the code section there. The reference implementation of that approach is implemented in the [iodlr](https://github.com/intel/iodlr)[^2]. One option would be to call that functionality from your `main` function. Another option, which is simpler, is to build the dynamic library and preload it in the command line: ```bash $ LD_PRELOAD=/usr/lib64/liblppreload.so clang++ a.cpp diff --git a/chapters/11-Machine-Code-Layout-Optimizations/11-9 Code footprint.md b/chapters/11-Machine-Code-Layout-Optimizations/11-9 Code footprint.md index 4dd571f744..19cd7fe5dd 100644 --- a/chapters/11-Machine-Code-Layout-Optimizations/11-9 Code footprint.md +++ b/chapters/11-Machine-Code-Layout-Optimizations/11-9 Code footprint.md @@ -10,9 +10,9 @@ Currently, there are very few tools available that can reliably measure code foo $ perf-tools/do.py profile --profile-mask 100 -a ``` -, where `--profile-mask 100` initiates LBR sampling, and `-a` enables you to specify a program to run. This command will collect code footprint along with various other data. We don't show the output of the tool, curious readers are welcome to study documentation and experiment with the tool. +`--profile-mask 100` initiates LBR sampling, and `-a` enables you to specify a program to run. This command will collect code footprint along with various other data. We don't show the output of the tool, curious readers are welcome to study documentation and experiment with the tool. -We took a set of four benchmarks: Clang C++ compilation, Blender ray tracing, Cloverleaf hydrodynamics, and Stockfish chess engine; these workloads should be already familiar to you from [@sec:PerfMetricsCaseStudy] where we analyzed their performance characteristics. We ran them on one of Intel's Alderlake-based processors using the same commands we used in [@sec:PerfMetricsCaseStudy]. As expected, the code footprint numbers obtained by running the same benchmarks on a Skylake-based machine are very similar to those from the Alderlake run. Code footprint depends on the program and input data, and not on characteristics of a particular machine, so results should look similar across architectures. +We took a set of four benchmarks: Clang C++ compilation, Blender ray tracing, Cloverleaf hydrodynamics, and Stockfish chess engine; these workloads should be already familiar to you from [@sec:PerfMetricsCaseStudy] where we analyzed their performance characteristics. We ran them on one of Intel's Alder Lake-based processors using the same commands we used in [@sec:PerfMetricsCaseStudy]. As expected, the code footprint numbers obtained by running the same benchmarks on a Skylake-based machine are very similar to those from the Alder Lake run. Code footprint depends on the program and input data, and not on characteristics of a particular machine, so results should look similar across architectures. Before we start looking at the results, let's spend some time on terminology. Different parts of a program's code may be exercised with different frequencies, so some will be hotter than others. The `perf-tools` package doesn't make this distinction and uses the term "non-cold code" to refer to all code that was executed at least once. This is called *two-way splitting* since it splits the code into cold and non-cold parts. Other tools (e.g., Meta's HHVM) use *three-way splitting* and distinguish between hot, warm, and cold code with an adjustable threshold between warm and hot. In this section, we use the term "hot code" to refer to the non-cold code. @@ -30,7 +30,7 @@ non-cold code footprint [KB] 5042 313 104 99 non-cold code 4KB-pages 6614 546 104 61 -Frontend Bound, Alderlake-P [%] 52.3 29.4 5.3 25.8 +Frontend Bound, Alder Lake-P [%] 52.3 29.4 5.3 25.8 ------------------------------------------------------------------------------- Table: Code footprint of the benchmarks used in the case study. {#tbl:code_footprint} @@ -41,7 +41,7 @@ A few interesting observations can be made by analyzing the code footprint data. Second, let's examine the `non-cold code 4KB-pages` row in the table. For Clang17, non-cold 5042 KB are spread over 6614 4KB pages, which gives us `5042 / (6614 * 4) = 19%` page utilization. This metric tells us how dense/sparse the hot parts of the code are. The closer each hot cache line is located to another hot cache line, the fewer pages are required to store the hot code. The higher the page utilization the better. Basic block placement and function reordering that we discussed earlier in this chapter are perfect examples of a transformation that improves page utilization. For other benchmarks, the percentages are Blender 14%, CloverLeaf 25%, and Stockfish 41%. -Now that we quantified the code footprints of the four applications, it's tempting to think about the size of L1-instruction and L2 caches and whether the hot code fits or not. On our Alderlake-based machine, the L1-I cache is only 32 KB, which is not enough to fully cover any of the benchmarks that we've analyzed. But remember, at the beginning of this section we said that a large code footprint doesn't immediately point to a problem. Yes, a large codebase puts more pressure on the CPU front-end, but an instruction access pattern is also crucial for performance. The same locality principles as for data accesses apply. That's why we accompanied it with the Frontend Bound metric from Topdown analysis. +Now that we quantified the code footprints of the four applications, it's tempting to think about the size of L1-instruction and L2 caches and whether the hot code fits or not. On our Alder Lake-based machine, the L1 I-cache is only 32 KB, which is not enough to fully cover any of the benchmarks that we've analyzed. But remember, at the beginning of this section we said that a large code footprint doesn't immediately point to a problem. Yes, a large codebase puts more pressure on the CPU front-end, but an instruction access pattern is also crucial for performance. The same locality principles as for data accesses apply. That's why we accompanied it with the Frontend Bound metric from Topdown analysis. For Clang17, the 5 MB of non-cold code causes a huge 52.3% Frontend Bound performance bottleneck: more than half of the cycles are wasted waiting for instructions. From all the presented benchmarks, it benefits the most from PGO-type optimizations. CloverLeaf doesn't suffer from inefficient instruction fetch; 75% of its branches are backward jumps, which suggests that those could be relatively small loops executed over and over again. Stockfish, while having roughly the same non-cold code footprint as CloverLeaf, poses a far greater challenge for the CPU Front-end (25.8%). It has a lot more indirect jumps and function calls. Finally, Blender has even more indirect jumps and calls than Stockfish. We stop our analysis at this point as further investigations are outside the scope of this case study. For readers who are interested in continuing the analysis, we suggest drilling down into the Frontend Bound category according to the TMA methodology and looking at metrics such as `ICache_Misses`, `ITLB_Misses`, `DSB coverage`, and others.