(Micro)Benchmarking reliability and consistency #753

NthPortal · 2020-12-06T06:24:32Z

Motivation

A frequent request for scala/scala PRs (particularly collections changes) is that the changes be benchmarked; however, many obstacles exist for contributors running benchmarks on their personal computers, to the extent that many or perhaps most results would generously be classified as "questionable".

Background

The following are some common causes of performance/timing variance, and whether a particular type of machine avoids it.

	Laptop	Overclocked Desktop	Normally-clocked Desktop
Boost clock speed change	❌¹	❌	✔
Thermal throttling	❌	❌	❓²
Background tasks	❌	❌	❌

¹ While theoretically possible to turn off overclocking/boost-clocking on a laptop, the CPU may clock down due to even brief changes in battery/power state as well.
² A normally-clocked desktop with good ventilation and cooling shouldn't thermally throttle, but neither of those is a guarantee in a person's home (sometimes cats sit on computers, for example).

The only type of machine that avoids any of these issues is a normally-clocked desktop, and not everyone has one of those (many of us only have laptops).

Additionally, all personal computers suffer from the problem that there are almost certainly background tasks (if not foreground tasks) running on them at all times. Benchmarks can take a long time to run, and even if someone can manage to not use their computer for an hour or two while benchmarks run, they probably don't want to have to close their web browser, 3+ chat applications (that are all electron, so basically also web browsers), and half a dozen other running programs and services. If they can't spare potentially multiple hours of their computer being tied up, it's even worse, with foreground tasks taking arbitrary and inconsistent CPU time.

Ideal Setup

To have benchmarking be reliable, it should be done on a dedicated machine running nothing else, and where cron/scheduled jobs are never running while a benchmark is running.

How do we reliably benchmark library changes?

lrytz · 2020-12-06T10:14:42Z

Thanks for bringing this up! For more technical aspects, see also #338.

We have one machine that we use for compiler benchmarks. It's not that busy, maybe we can find a good way to make it available to contributors; allow them to take the machine offline in Jenkins and ssh to it.

lrytz · 2020-12-08T13:17:56Z

@retronym says he'll look into this.

retronym · 2020-12-16T11:18:23Z

I spend some time trying to get our Jenkins instance to have a parameterized Job that could run specified benchmarks on our benchmarking server. Jenkins seems to actively resist this and wouldn't save my job configs and I had to park the attempt. I'll try again...

retronym · 2020-12-16T12:15:01Z

https://issues.jenkins.io/browse/JENKINS-64454 perhaps.

retronym · 2020-12-16T12:19:59Z

@adriaanm That Jenkins ticket mentions disabling the notifications plugin as a workaround -- after that the save/apply button actions on the job config UI worked again. I notice we're running 1.13 of the plugin but previous were running a custom build you'd created:

Can you provide context for that custom version? Is this something that you're working on now or something you worked on previously?

retronym · 2020-12-16T13:01:01Z

Better Jenkins ticket: https://issues.jenkins.io/browse/JENKINS-64254

https://issues.jenkins.io/browse/JENKINS-64072?focusedCommentId=401604&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-401604

Ichoran · 2020-12-16T17:36:45Z

There are a variety of ways to solve this experimentally instead of with quiet hardware. For instance, you can halve the number of iterations and run the whole thing twice. If any of the head-to-head comparisons aren't stable, you disbelieve the whole lot and do it all again. Usually, in my experience, they are pretty stable even on a laptop as long as you're not doing a million other things at the same time. (Watching video + compiling + benchmarking is probably a bad idea. Editing code and benchmarking is probably fine.)

You do always have to run the benchmarks head-to-head at roughly the same time and not expect them to be stable over days/months/whatever. If you're trying to search for performance regressions then you do want the quiet machine approach. But for regular PRs, I don't think it's necessary.

Note that a bigger problem is different architectures. It's often the case that code that is faster on one architecture is slower on the other. So you can have different people making different decisions about high-performance code based on accurate microbenchmarking on different hardware.

SethTisue · 2020-12-18T19:17:03Z

an older ticket with a bunch of benchmarking advice: #606

lrytz assigned retronym Dec 8, 2020

SethTisue mentioned this issue Dec 9, 2020

Optimized BigInt implementation scala/scala#8932

Closed

martijnhoekstra mentioned this issue Jan 16, 2021

Performance tweaks for Seq#diff and Seq#intersect scala/scala#9365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Micro)Benchmarking reliability and consistency #753

(Micro)Benchmarking reliability and consistency #753

NthPortal commented Dec 6, 2020

lrytz commented Dec 6, 2020

lrytz commented Dec 8, 2020

retronym commented Dec 16, 2020

retronym commented Dec 16, 2020

retronym commented Dec 16, 2020

retronym commented Dec 16, 2020

Ichoran commented Dec 16, 2020

SethTisue commented Dec 18, 2020

(Micro)Benchmarking reliability and consistency #753

(Micro)Benchmarking reliability and consistency #753

Comments

NthPortal commented Dec 6, 2020

Motivation

Background

Ideal Setup

lrytz commented Dec 6, 2020

lrytz commented Dec 8, 2020

retronym commented Dec 16, 2020

retronym commented Dec 16, 2020

retronym commented Dec 16, 2020

retronym commented Dec 16, 2020

Ichoran commented Dec 16, 2020

SethTisue commented Dec 18, 2020