Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 2 edits #69

Merged
merged 5 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Performance problems are often harder to reproduce and root cause than most func

Conducting fair performance experiments is an essential step towards getting accurate and meaningful results. You need to ensure you're looking at the right problem and are not debugging some unrelated issue. Designing performance tests and configuring the environment are both important components in the process of evaluating performance.

Because of the measurement bias, performance evaluations often involve statistical methods, which deserves a whole book just for itself. There are many corner cases and a huge amount of research done in this field. We will not dive into statistical methods for evaluating performance measurements. Instead, we only discuss high-level ideas and give basic directions to follow. We encourage you to research deeper on your own.
Because of the measurement bias, performance evaluations often involve statistical methods, which deserve a whole book just for themselves. There are many corner cases and a huge amount of research done in this field. We will not dive into statistical methods for evaluating performance measurements. Instead, we only discuss high-level ideas and give basic directions to follow. We encourage you to research deeper on your own.

In this chapter, we:

Expand Down
10 changes: 5 additions & 5 deletions chapters/2-Measuring-Performance/2-1 Noise In Modern Systems.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

There are many features in hardware and software that are designed to increase performance, but not all of them have deterministic behavior. Let's consider Dynamic Frequency Scaling (DFS), a feature that allows a CPU to increase its frequency far above the base frequency, allowing it to run significantly faster. DFS is also frequently referred to as *turbo* mode. Unfortunately, a CPU cannot stay in the turbo mode for a long time, otherwise it may face the risk of overheating. So later, it decreases its frequency to stay within its thermal limits. DFS usually depends a lot on the current system load and external factors, such as core temperature, which makes it hard to predict the impact on performance measurements.

Figure @fig:FreqScaling shows a typical example where DFS can cause variance in performance. In our scenario, we started two runs of a benchmark, one right after another on a "cold" processor.[^1] During the first second, the first iteration of the benchmark was running on the maximum turbo frequency of 4.4 Ghz but later the CPU had to decrease its frequency below 4 Ghz. The second run did not have the advantage of boosting the CPU frequency and did not enter the turbo mode. Even though we ran the exact same version of the benchmark two times, the environment in which they ran was not the same. As you can see, the first run is 200 milliseconds faster than the second run due to the fact that it was running with a higher CPU frequency in the beginning. Such a scenario can frequently happen when you benchmark software on a laptop since laptops have limited heat dissipation.
Figure @fig:FreqScaling shows a typical example where DFS can cause variance in performance. In our scenario, we started two runs of a benchmark, one right after another on a "cold" processor.[^1] During the first second, the first iteration of the benchmark was running on the maximum turbo frequency of 4.4 GHz but later the CPU had to decrease its frequency below 4 GHz. The second run did not have the advantage of boosting the CPU frequency and did not enter the turbo mode. Even though we ran the exact same version of the benchmark two times, the environment in which they ran was not the same. As you can see, the first run is 200 milliseconds faster than the second run due to the fact that it was running with a higher CPU frequency in the beginning. Such a scenario can frequently happen when you benchmark software on a laptop since laptops have limited heat dissipation.

![Variance in performance caused by dynamic frequency scaling: the first run is 200 milliseconds faster than the second.](../../img/measurements/FreqScaling.jpg){#fig:FreqScaling width=90%}

Expand All @@ -12,15 +12,15 @@ Frequency Scaling is an example of how a hardware feature can cause variations i

You're probably thinking about including a dry run before taking measurements. That certainly helps, unfortunately, measurement bias can persist through the runs as well. [@Mytkowicz09] paper demonstrates that UNIX environment size (i.e., the total number of bytes required to store the environment variables) or the link order (the order of object files that are given to the linker) can affect performance in unpredictable ways. There are numerous other ways how memory layout may affect performance measurements.[^2]

Having consistent performance requires running all iterations of the benchmark with the same conditions. It is impossible to achieve 100% consistent results on every run of a benchmark, but perhaps you can get close by carefully controling the environment. Eliminating non-determinism in a system is helpful for well-defined, stable performance tests, e.g., microbenchmarks.
Having consistent performance requires running all iterations of the benchmark with the same conditions. It is impossible to achieve 100% consistent results on every run of a benchmark, but perhaps you can get close by carefully controlling the environment. Eliminating nondeterminism in a system is helpful for well-defined, stable performance tests, e.g., microbenchmarks.

Consider a situation when you implemented a code change and want to know the relative speedup ratio by benchmarking the "before" and "after" versions of the program. This is a scenario in which you can control most of the variability in a system, including HW configuration, OS settings, background processes, etc. Disabling features with non-deterministic performance impact will help you get a more consistent and accurate comparison. You can find examples of such features and how to disable them in Appendix A. Also, there are tools that can set up the environment to ensure benchmarking results with a low variance; one such tool is [temci](https://github.com/parttimenerd/temci)[^14].
Consider a situation when you implemented a code change and want to know the relative speedup ratio by benchmarking the "before" and "after" versions of the program. This is a scenario in which you can control most of the variability in a system, including HW configuration, OS settings, background processes, etc. Disabling features with nondeterministic performance impact will help you get a more consistent and accurate comparison. You can find examples of such features and how to disable them in Appendix A. Also, there are tools that can set up the environment to ensure benchmarking results with a low variance; one such tool is [temci](https://github.com/parttimenerd/temci)[^14].

However, it is not possible to replicate the exact same environment and eliminate bias completely: there could be different temperature conditions, power delivery spikes, unexpected system interrupts, etc. Chasing all potential sources of noise and variation in a system can be a never-ending story. Sometimes it cannot be achieved, for example, when you're benchmarking a large distributed cloud service.

You should not eliminate system non-deterministic behavior when you want to measure real-world performance impact of your change. Users of your application are likely to have all the features enabled since these features provide better performance. Yes, these features may contribute to performance instabilities, but they are designed to improve the overall performance of the system. In fact, your customers probably do not care about non-deterministic performance as long as it helps to run as fast as possible. So, when you analyze the performance of a production application, you should try to replicate the target system configuration, which you are optimizing for. Introducing any artificial tuning to the system will diverge results from what users of your service will see in practice.[^3]
You should not eliminate system nondeterministic behavior when you want to measure real-world performance impact of your change. Users of your application are likely to have all the features enabled since these features provide better performance. Yes, these features may contribute to performance instabilities, but they are designed to improve the overall performance of the system. In fact, your customers probably do not care about nondeterministic performance as long as it helps to run as fast as possible. So, when you analyze the performance of a production application, you should try to replicate the target system configuration, which you are optimizing for. Introducing any artificial tuning to the system will change results from what users of your service will see in practice.[^3]

[^1]: By cold processor, we mean the CPU that stayed in idle mode for a while, allowing it to cool down its temperature.
[^2]: One approach to enable statistically sound performance analysis was presented in [@Curtsinger13]. This work showed that it's possible to eliminate measurement bias that comes from memory layout by repeatedly randomizing the placement of code, stack, and heap objects at runtime. Sadly, these ideas didn't go much further, and right now, this project is almost abandoned.
[^3]: Another downside of disabling non-deterministic performance features is that it makes a benchmark run longer. This is especially important for CI/CD performance testing when there are time limits for how long it should take to run the whole benchmark suite.
[^3]: Another downside of disabling nondeterministic performance features is that it makes a benchmark run longer. This is especially important for CI/CD performance testing when there are time limits for how long it should take to run the whole benchmark suite.
[^14]: Temci - [https://github.com/parttimenerd/temci](https://github.com/parttimenerd/temci).
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

We just discussed why you should monitor performance in production. On the other hand, it is still beneficial to set up continuous "in-house" testing to catch performance problems early, even though not every performance regression can be caught in a lab.

Software vendors constantly seek ways to accelerate the pace of delivering their products to the market. Many companies deploy newly written code every couple of months or weeks. Unfortunately, software products don't get better performance with each new release. Performance defects tend to leak into production software at an alarming rate [@UnderstandingPerfRegress]. A large number of code changes pose a challenge to thoroughly analyze their performance impact.
Software vendors constantly seek ways to accelerate the pace of delivering their products to the market. Many companies deploy newly written code every couple of months or weeks. Unfortunately, software products don't get better performance with each new release. Performance defects tend to leak into production software at an alarming rate [@UnderstandingPerfRegress]. A large number of code changes pose a challenge to thorough analysis of their performance impact.

Performance regressions are defects that make the software run slower compared to the previous version. Catching performance regressions (or improvements) requires detecting the commit that has changed the performance of the program. From database systems to search engines to compilers, performance regressions are commonly experienced by almost all large-scale software systems during their continuous evolution and deployment life cycle. It may be impossible to entirely avoid performance regressions during software development, but with proper testing and diagnostic tools, the likelihood of such defects silently leaking into production code can be reduced significantly.

Expand All @@ -14,7 +14,7 @@ It is useful to track the performance of your application with charts, like the

Let's consider some potential solutions for detecting performance regressions. The first option that comes to mind is: having humans look at the graphs. For the chart in Figure @fig:PerfRegress, humans will likely catch performance regression that happened on August 7th, but it's not obvious that they will detect later smaller regressions. People tend to lose focus quickly and can miss regressions, especially on a busy chart. In addition to that, it is a time-consuming and boring job that must be performed daily.

There is another interesting performance drop on August 3rd. A developer will also likely catch it, however, most of us would be tempted to dismiss it since performance recovered the next day. But are we sure that it was merely a glitch in measurements? What if this was a real regression that was compensated by an optimization on August 4th? If we could fix the regression *and* keep the optimization, we would have a performance score of around 4500. Do not dismiss such cases. One way to proceed here would be to repeat the measurements for the dates Aug 02 - Aug 04 and inspect code changes during that period.
There is another interesting performance drop on August 3rd. A developer will also likely catch it, however, most of us would be tempted to dismiss it since performance recovered the next day. But are we sure that it was merely a glitch in measurements? What if this was a real regression that was compensated by an optimization on August 4th? If we could fix the regression *and* keep the optimization, we would have a performance score of around 4500. Do not dismiss such cases. One way to proceed here would be to repeat the measurements for the dates Aug 02--Aug 04 and inspect code changes during that period.

The second option is to have a threshold, say, 2%. Every code modification that has performance within that threshold is considered noise and everything above the threshold is considered a regression. It is somewhat better than the first option but still has its own drawbacks. Fluctuations in performance tests are inevitable: sometimes, even a harmless code change can trigger performance variation in a benchmark.[^3] Choosing the right value for the threshold is extremely hard and does not guarantee a low rate of false-positive as well as false-negative alarms. Setting the threshold too low might lead to analyzing a bunch of small regressions that were not caused by the change in source code but due to some random noise. Setting the threshold too high might lead to filtering out real performance regressions.

Expand All @@ -26,9 +26,9 @@ It's worth mentioning that tracking performance results over time requires that

Another option that recently became popular uses a statistical approach to identify performance regressions. It leverages an algorithm called "Change Point Detection" (CPD, see [@ChangePointAnalysis]), which utilizes historical data and identifies points in time where performance has changed. Many performance monitoring systems embraced the CPD algorithm, including several open-source projects. You can search the web to find the one that better suits your needs.

The notable advantage of CPD is that it does not require setting thresholds. The algorithm evaluates a large window of recent results, which allows it to ignore outliers as noise and produce fewer false positives. The downside for CPD is the lack of immediate feedback. For example, consider a benchmark `B` that has the following historical measurements of running time: 5 sec, 6 sec, 5 sec, 5 sec, 7 sec. If the next benchmark result comes at 11 seconds, then the threshold would likely be exceeded and an alert would be generated immediately. However, in the case of using the CPD algorithm, it wouldn't do anything at this point. If in the next run, performance restores back to 5 seconds, then it would likely dismiss it as a false positive and not generate an alert. Conversely, if the next run or two resulted in 10 sec and 12 sec respectively, only then would the CPD algorithm trigger an alert.
The notable advantage of CPD is that it does not require setting thresholds. The algorithm evaluates a large window of recent results, which allows it to ignore outliers as noise and produce fewer false positives. The downside for CPD is the lack of immediate feedback. For example, consider a benchmark `B` that has the following historical measurements of running time: 5 sec, 6 sec, 5 sec, 5 sec, 7 sec. If the next benchmark result comes at 11 seconds, then the threshold would likely be exceeded and an alert would be generated immediately. However, in the case of using the CPD algorithm, it wouldn't do anything at this point. If in the next run, performance is restored to 5 seconds, then it would likely dismiss it as a false positive and not generate an alert. Conversely, if the next run or two resulted in 10 sec and 12 sec respectively, only then would the CPD algorithm trigger an alert.

There is no clear answer to which approach is better. If your development flow requires immediate feedback, e.g., evaluating a pull request before it gets merged, then using thresholds is a better choice. Also, if you can remove a lot of noise from your system and achieve stable performance results, then using thresholds is more appropriate. In a very quiet system, the 11-second measurement mentioned before likely indicates a real performance regression, thus we need to flag it as early as possible. In contrast, if you have a lot of noise in your system, e.g., you run distributed macro-benchmarks, then that 11-second result may just be a false positive. In this case, you may be better off using Change Point Detection.
There is no clear answer to which approach is better. If your development flow requires immediate feedback, e.g., evaluating a pull request before it gets merged, then using thresholds is a better choice. Also, if you can remove a lot of noise from your system and achieve stable performance results, then using thresholds is more appropriate. In a very quiet system, the 11 second measurement mentioned before likely indicates a real performance regression, thus we need to flag it as early as possible. In contrast, if you have a lot of noise in your system, e.g., you run distributed macro-benchmarks, then that 11 second result may just be a false positive. In this case, you may be better off using Change Point Detection.

A typical CI performance tracking system should automate the following actions:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Let's describe the terms indicated on the image:

By looking at the box plot in Figure @fig:BoxPlot, we can sense that our code change has a positive impact on performance since "after" samples are generally faster than "before". However, there are some "before" measurements that are faster than "after". Box plots allow comparisons of multiple distributions on the same chart. The benefits of using box plots for visualizing performance distributions are described in a blog post by Stefan Marr.[^13]

Performance speedups can be calculated by taking a ratio between the two means. In some cases, you can use other metrics to calculate speedups, including median, min, and 95th percentile, depending on which one is more representable for your distribution.
Performance speedups can be calculated by taking a ratio between the two means. In some cases, you can use other metrics to calculate speedups, including median, min, and 95th percentile, depending on which one is more representative for your distribution.

*Standard deviation* quantifies how much the values in a dataset deviate from the mean on average. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that they are spread out over a wider range. Unless distributions have low standard deviation, do not calculate speedups. If the standard deviation in the measurements is on the same order of magnitude as the mean, the average is not a representative metric. Consider taking steps to reduce noise in your measurements. If that is not possible, present your results as a combination of the key metrics such as mean, median, standard deviation, percentiles, min, max, etc.

Expand Down
Loading
Loading