From cca2fde9f3d792f8a59a56a56f9e4cea7694039c Mon Sep 17 00:00:00 2001 From: nick black Date: Fri, 6 Sep 2024 10:38:36 -0400 Subject: [PATCH 1/2] chapter 1 section 5 edits. mostly reductions. --- .../1-Introduction/1-6 What is in the book.md | 29 +++++++++---------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/chapters/1-Introduction/1-6 What is in the book.md b/chapters/1-Introduction/1-6 What is in the book.md index cb79bc8664..f321e3ff2b 100644 --- a/chapters/1-Introduction/1-6 What is in the book.md +++ b/chapters/1-Introduction/1-6 What is in the book.md @@ -3,32 +3,31 @@ This book is written to help developers better understand the performance of their applications, learn to find inefficiencies, and eliminate them. * Why did my change cause a 2x performance drop? -* Our customers complain about the slowness of my application, where should I start? -* Why my hand-written compression algorithm performs two times slower than the conventional one? +* Our customers complain about the slowness of my application. How should I investigate? +* Why does my bespoke compression algorithm perform slower than the conventional one? * Have I optimized my program to its full potential? * What performance analysis tools are available on my platform? -* What are the techniques to reduce the number of cache misses and branch mispredictions? +* What are techniques to reduce the number of cache misses and branch mispredictions? Hopefully, by the end of this book, you will be able to answer those questions. -The book is split into two parts. The first part (chapters 2-7) teaches you how to find performance problems, and the second part (chapters 8-13) teaches you how to fix them. Here is the outline of the book chapters: +The book is split into two parts. The first part (chapters 2--7) teaches you how to find performance problems, and the second part (chapters 8--13) teaches you how to fix them. -* Chapter 1 is an introduction that you're reading right now. -* Chapter 2 discusses how to conduct fair performance experiments and analyze their results. It introduces the best practices for performance testing and comparing results. -* Chapter 3 provides the basics of CPU microarchitecture. You will see how theoretical ideas find their implementation by taking a closer look at Intel's Goldencove microarchitecture. +* Chapter 2 discusses fair performance experiments and their analysis. It introduces the best practices for performance testing and comparing results. +* Chapter 3 introduces CPU microarchitecture, with a close look at Intel's Goldencove microarchitecture. * Chapter 4 covers terminology and metrics used in performance analysis. At the end of the chapter, we present a case study that features various performance metrics collected on four real-world applications. -* Chapter 5 explores the most popular performance analysis approaches. We describe how profiling tools work and what sort of data you can collect by using them. -* Chapter 6 examines features provided by modern Intel, AMD, and ARM-based CPUs to support and enhance performance analysis. It shows how they work and what problems they help to solve. -* Chapter 7 gives an overview of the most popular tools available on major platforms, including Linux, Windows, and MacOS, running on x86- and ARM-based processors. +* Chapter 5 explores the most popular performance analysis approaches. We describe how profiling tools work and what sort of data they can collect. +* Chapter 6 examines features provided by modern Intel, AMD, and ARM CPUs to support and enhance performance analysis. It shows how they work and what problems they help to solve. +* Chapter 7 gives an overview of the most popular tools available on Linux, Windows, and MacOS. * Chapter 8 is about optimizing memory accesses, cache-friendly code, data structure reorganization, and other techniques. * Chapter 9 is about optimizing computations; it explores data dependencies, function inlining, loop optimizations, and vectorization. * Chapter 10 is about branchless programming, which is used to avoid branch misprediction. -* Chapter 11 is about machine code layout optimizations, such as basic block placement, function splitting, profile-guided optimizations, and others. -* Chapter 12 contains optimization topics not specifically related to any of the categories covered in the previous four chapters, but are still important enough to find their place in this book. In this chapter, we will discuss CPU-specific optimizations, examine several microarchitecture-related performance problems, explore techniques used for optimizing low-latency applications, and give you advice on tuning your system for the best performance. -* Chapter 13 discusses techniques for analyzing multithreaded applications. It digs into some of the most important challenges of optimizing the performance of multithreaded applications. We provide a case study of five real-world multithreaded applications, where we explain why their performance doesn't scale with the increasing number of CPU threads. We also discuss cache coherency issues, such as "False Sharing" and a few tools that are designed to analyze multithreaded applications. +* Chapter 11 is about machine code layout optimizations, such as basic block placement, function splitting, and profile-guided optimizations. +* Chapter 12 contains optimization topics not considered in the previous four chapters, but still important enough to find their place in this book. In this chapter, we will discuss CPU-specific optimizations, examine several microarchitecture-related performance problems, explore techniques used for optimizing low-latency applications, and give you advice on tuning your system for the best performance. +* Chapter 13 discusses techniques for analyzing multithreaded applications. It digs into some of the most important challenges of optimizing multithreaded applications. We provide a case study of five real-world multithreaded applications, where we explain why their performance doesn't scale with the number of CPU threads. We also discuss cache coherency issues (e.g. "false sharing") and a few tools that are designed to analyze multithreaded applications. -Examples provided in this book are primarily based on open-source software: Linux as the operating system, the LLVM-based Clang compiler for C and C++ languages, and various open-source applications and benchmarks[^1] that you can build and run. The reason is not only the popularity of these projects but also the fact that their source code is open, which enables us to better understand the underlying mechanism of how they work. This is especially useful for learning the concepts presented in this book. This doesn't mean that we will never showcase proprietary tools. For example, we extensively use Intel® VTune™ Profiler. +Examples provided in this book are primarily based on open source software: Linux as the operating system, the LLVM-based Clang compiler for C and C++ languages, and various open source applications and benchmarks[^1] that you can build and run. The reason is not only the popularity of these projects but also the fact that their source code is open, which enables us to better understand the underlying mechanism of how they work. This is especially useful for learning the concepts presented in this book. This doesn't mean that we will never showcase proprietary tools. For example, we extensively use Intel® VTune™ Profiler. -Prior compiler experience helps a lot in performance-related work. Sometimes it's possible to obtain attractive speedups by forcing the compiler to generate desired machine code through various hints. You will find many such examples throughout the book. Luckily, most of the time, you don't have to be a compiler expert to drive performance improvements in your application. The majority of optimizations can be done at a source code level without the need to dig down into compiler sources. +Prior compiler experience helps a lot in performance work. Sometimes it's possible to obtain attractive speedups by forcing the compiler to generate desired machine code through various hints. You will find many such examples throughout the book. Luckily, most of the time, you don't have to be a compiler expert to drive performance improvements in your application. The majority of optimizations can be done at a source code level without the need to dig down into compiler sources. [^1]: Some people don't like when their application is called a "benchmark". They think that benchmarks are something that is synthesized, and contrived, and does a poor job of representing real-world scenarios. In this book, we use the terms "benchmark", "workload", and "application" interchangeably and don't mean to offend anyone. From 59beefcbd7250d1aceccde03ec7a35f89ce52c0d Mon Sep 17 00:00:00 2001 From: Denis Bakhvalov Date: Tue, 10 Sep 2024 12:25:26 -0400 Subject: [PATCH 2/2] Denis fixes++ --- chapters/1-Introduction/1-6 What is in the book.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/chapters/1-Introduction/1-6 What is in the book.md b/chapters/1-Introduction/1-6 What is in the book.md index f321e3ff2b..cf3c807da5 100644 --- a/chapters/1-Introduction/1-6 What is in the book.md +++ b/chapters/1-Introduction/1-6 What is in the book.md @@ -4,7 +4,7 @@ This book is written to help developers better understand the performance of the * Why did my change cause a 2x performance drop? * Our customers complain about the slowness of my application. How should I investigate? -* Why does my bespoke compression algorithm perform slower than the conventional one? +* Why does my handwritten compression algorithm perform slower than the conventional one? * Have I optimized my program to its full potential? * What performance analysis tools are available on my platform? * What are techniques to reduce the number of cache misses and branch mispredictions? @@ -17,7 +17,7 @@ The book is split into two parts. The first part (chapters 2--7) teaches you how * Chapter 3 introduces CPU microarchitecture, with a close look at Intel's Goldencove microarchitecture. * Chapter 4 covers terminology and metrics used in performance analysis. At the end of the chapter, we present a case study that features various performance metrics collected on four real-world applications. * Chapter 5 explores the most popular performance analysis approaches. We describe how profiling tools work and what sort of data they can collect. -* Chapter 6 examines features provided by modern Intel, AMD, and ARM CPUs to support and enhance performance analysis. It shows how they work and what problems they help to solve. +* Chapter 6 examines features provided by modern Intel, AMD, and ARM-based CPUs to support and enhance performance analysis. It shows how they work and what problems they help to solve. * Chapter 7 gives an overview of the most popular tools available on Linux, Windows, and MacOS. * Chapter 8 is about optimizing memory accesses, cache-friendly code, data structure reorganization, and other techniques. * Chapter 9 is about optimizing computations; it explores data dependencies, function inlining, loop optimizations, and vectorization. @@ -26,7 +26,7 @@ The book is split into two parts. The first part (chapters 2--7) teaches you how * Chapter 12 contains optimization topics not considered in the previous four chapters, but still important enough to find their place in this book. In this chapter, we will discuss CPU-specific optimizations, examine several microarchitecture-related performance problems, explore techniques used for optimizing low-latency applications, and give you advice on tuning your system for the best performance. * Chapter 13 discusses techniques for analyzing multithreaded applications. It digs into some of the most important challenges of optimizing multithreaded applications. We provide a case study of five real-world multithreaded applications, where we explain why their performance doesn't scale with the number of CPU threads. We also discuss cache coherency issues (e.g. "false sharing") and a few tools that are designed to analyze multithreaded applications. -Examples provided in this book are primarily based on open source software: Linux as the operating system, the LLVM-based Clang compiler for C and C++ languages, and various open source applications and benchmarks[^1] that you can build and run. The reason is not only the popularity of these projects but also the fact that their source code is open, which enables us to better understand the underlying mechanism of how they work. This is especially useful for learning the concepts presented in this book. This doesn't mean that we will never showcase proprietary tools. For example, we extensively use Intel® VTune™ Profiler. +Examples provided in this book are primarily based on open-source software: Linux as the operating system, the LLVM-based Clang compiler for C and C++ languages, and various open-source applications and benchmarks[^1] that you can build and run. The reason is not only the popularity of these projects but also the fact that their source code is open, which enables us to better understand the underlying mechanism of how they work. This is especially useful for learning the concepts presented in this book. This doesn't mean that we will never showcase proprietary tools. For example, we extensively use Intel® VTune™ Profiler. Prior compiler experience helps a lot in performance work. Sometimes it's possible to obtain attractive speedups by forcing the compiler to generate desired machine code through various hints. You will find many such examples throughout the book. Luckily, most of the time, you don't have to be a compiler expert to drive performance improvements in your application. The majority of optimizations can be done at a source code level without the need to dig down into compiler sources.