Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate variance in benchmarking results #251

Open
bachase opened this issue Feb 21, 2025 · 6 comments
Open

Investigate variance in benchmarking results #251

bachase opened this issue Feb 21, 2025 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@bachase
Copy link
Collaborator

bachase commented Feb 21, 2025

          Taking a look at the latest benchmarks, it looks like PyTKET suddenly slower with these updates. There wasn't a change to the PyTKET version though, so I'm not sure what could be causing this. 

17400910229213713367557020631448

Originally posted by @jordandsullivan in #250 (comment)

@bachase bachase self-assigned this Feb 21, 2025
@bachase bachase added the bug Something isn't working label Feb 21, 2025
@bachase
Copy link
Collaborator Author

bachase commented Feb 21, 2025

Haven't found a clear issue yet, but noticed a few things that could cause uncontrolled sources of variance

  1. Multiple benchmark actions could be in parallel (look into controlling job concurrency)
  2. The docker container built for each run pulls the latest compatible libraries. Even if the main compiler dependency versions are unchanged, its possible the transitive dependencies could impact the performance? Say if qbraid changed how it translates circuits between compiler libraries, that would change the outputs?
  3. Qiskit-Aer simulation has some support for parallelization (see options here). I don't believe these are getting enabled by default or in our configurations, but still understanding what gets parallelized at the python level versus what might be happening in the C++

@bachase
Copy link
Collaborator Author

bachase commented Feb 21, 2025

I was able to recreate a bit of this locally by focusing on the large change in pytket running time recently. Locally, I found that varying the parallelism of the script changes this result. Looking into the Qiskit-Aer options as previously mentioned, it does look like if OpenMP is enabled, by default the simulation tries to use all available threads (see here). For these benchmarks, I believe this only matters for the statevector simulation of HOV which repeats several shots.

To test this on the GIthub Runner, I made a change to set the max threads to 1 when constructing the AERSimulator instances (https://github.com/unitaryfund/ucc/tree/bc-test-AER-1-thread). Running the benchmarks shows this reverts back to the prior timing

Baseline (current master branch)

pytket,qv,15000,16.386236906051636,15000

With setting max threads to 1

pytket,qv,15000,8.439182996749878,15000

which is closer to the older averages

pytket,qv,15000,4.371891975402832,15000

But still off by a factor of two.

I'll next try separating the timing runs from the expectation runs to see if that normalizes things better (maybe something else is causing contention on these processes). I would expect the performance isn't so sensitive we'd need to pin to specific cores.

@bachase
Copy link
Collaborator Author

bachase commented Feb 24, 2025

Following on the above investigations (which can now be tracked in https://github.com/unitaryfund/ucc/tree/bc-test-AER-1-thread), it looks like both adding the expectation value runs AND varying the parallelism specified to parallel non-trivially impacts the runtime of the benchmarks. For the expectation value case, this is in part due to the default parallelization within qiskit-aer. For parallel, I'm still working to understand, but the GitHub runner is virtualized and might have non-trivial sharing/throttling mechanisms, and even if not, the OS might be doing some unexpected scheduling of processes/threads during the course of running parallel.

Below, I'll summarize these observations and then make some recommendations on next steps.

Observations

Limiting qiskit-aer to one thread (f1e0b5a)

You can see below the average compile time for pytket drops back down a bit, as does cirq and qiskit

Image

Separating timing from exp value runs (ab42832)

In this commit, parallel is still used in the benchmarking shell script, but is separated into one call for just the timing benchmarks and then another call for the expected value benchmarks. You can see an even larger change (the last dot) for pytket and a bit more for cirq
Image

No parallelism in benchmarking runs (7a5778b)

This explicitly does parallel -j 1 in the benchmarking script. You can see all benchmark runs now drop down for ucc and qiskit

Image

Recommendation

The above shows the sensitivity of the benchmarks to some of the configuration and github runner setup. I think there are some immediate changes we can do to unblock near-term reporting and milestones, and then some longer term ones to make this more manageable and sustainable.

Now

For the next/nearest milestone, we could

  1. Revert to no-parallelization in the benchmarking runs to eliminate any resource contention. This will lengthen the duration of a benchmark workflow.
  2. Explicitly set max_parallel_threads=1 for Qiskit-aer. Even though we aren't focused on the timing of these simulations for the expectation value benchmark, I'd rather set them to 1 for now to avoid the issue re-occuring pending some of the longer term changes below.
  3. Ensure only one concurrent job for the Github workflow. Will need to confirm, but I believe without this multiple benchmark runs could be scheduled IF our github plan were to change (or github were to grant more resources)
  4. Use the existing locked dependencies when doing a benchmark run, versus the benchmark job pulling the latest available. This will prevent unintended changes in dependent libraries and (as described below), start making the benchmarking runs more reproducible.
  5. Related to Update benchmark plot resolution to monthly #200, we should consider scrubbing and rerunning some of the historical benchmarks to consistently use this no-parallelism approach. Perhaps we can do for the versions we intend to plot in Update benchmark plot resolution to monthly #200. Or if we keep as is, we need to caveat the issue in the plot when showing it.

Next

This issue uncovered a few challenges in making benchmark runs reproducible. Most of the above discusses the parallelism and configuration of the GitHub runner. But this issue also indicates the dependency configurations are not static (given the benchmarking job is always pulling the latest available). Also, the way in which we store the results is also not simply reproducible, since we append results to a date-based file for the date of each run. In other words, there are two things to improve -- making the benchmarking hardware more consistent and making the specification of benchmark and storage of its results reproducible.

I can include/expand on in #235, but for these

  1. Get consistent results given the benchmark machine On the one hand, we could switch to a bare-metal system where we have more control over what else running, an ability to pin processes to cores, and just overall management to make the system reproducible. But that feels too much for these type of larger scale benchmarking tests. Instead, we should consider averaging or multiple run approaches (maybe pytest_bench?) to control for this. We can also continue to investigate the GH runner setup to ensure we've configured it for consistency.
  2. Ensure the benchmark configuration and results are reproducible Associate benchmark results with the git hash of the commit on which they are run and the system details of the GH runner used, instead of the date the benchmark is run. In tandem with using the locked dependencies above, this ensures each run is idempotent/reproducible up to the uncontrollable aspects of the runner hardware. This will make it easier to support workflows which compare performance over releases (since each release is a specific git hash). Likewise, it will make it easier to compare the impact of a PR against the current baseline. Combined with dependabot, this would isolate the impact of upgrading a compiler version before including it in the top-line results, ensuring we can catch an issue before it happens.

Some of these steps might be pre-mature given how early UCC is in developing an optimization and performance strategy. But if that is an area of focus, its worth considering investing time to improve the benchmarking approach.

@jordandsullivan
Copy link
Collaborator

Thanks for investigating @bachase. I am still not clear why updating PyTKET was the change where this showed up.

The parallelism has been set to the same number since we began using the Actions workflow, so again not sure why this would be an issue here.

@bachase
Copy link
Collaborator Author

bachase commented Feb 25, 2025

@jordandsullivan I admit I don't think I've got it fully explained, but here's my best attempt focusing on just that change in PyTKET compile times.

To confirm we are talking about the same observation, the question is: Why did the benchmark's average compilation time increase so much for PyTKET around Feb 19, given the benchmarks have been using the same version (1.40.0) of PyTKET since Jan 28?

Looking at benchmark runs around this date, the jump started in #250 as you had originally noticed and flagged.

The main change with #250 was to upgrade to the latest version of dependencies in the poetry.lock file. At the bottom of this comment are the libraries that changed between those runs, which I extracted from the github action workflow logs. So no change to PyTKET version involved. Its possible a transitive dependency changed, but I didn't dig that far.

Instead I tried to recreate this locally and noticed the parallelism was impacting the duration of the benchmarks. I confirmed that also happened on the Github runners. The successive results in comment above show how changing different aspects of parallelization dropped the pytket runtimes back to its earlier average.

The biggest impact does look to be from having the expectation value benchmark runs enabled. You can see this drift upwards starting around Feb 11 or so as well. So I think that does match your instinct here, where something else than the use of parallel is the main issue. But still, your question remains, why did PyTKET get impacted so much? My best guess is it gets unlucky and happens to be running at some time one of the expected value benchmarks was running. Or the upgrade from qiskit 1.3.2 to 1.3.3 impacted something about the noise simulations? But I got to the point where it feels like diminishing returns in debugging, so haven't gone much deeper there.

But following up on your instinct here, rather than my original proposal to remove all parallelism, perhaps we focus on just having the expected value benchmarks run separately from the timing benchmarks? That seems to control for most of this variance.


< openqasm3,1.0.0
---
> openqasm3,1.0.1
39a40
> roman-numerals-py,3.0.0
49c50
< stevedore,5.4.0
---
> stevedore,5.4.1
59d59
< matplotlib,3.10.0
60a61,62
> matplotlib,3.10.0
> mdit-py-plugins,0.4.2
63d64
< mdit-py-plugins,0.4.2
67c68
< pyqasm,0.2.0
---
> pyqasm,0.2.1
71c72
< qiskit,1.3.2
---
> qiskit,1.3.3
74c75
< sphinx,8.1.3
---
> sphinx,8.2.0
83c84
< qbraid,0.9.3
---
> qbraid,0.9.4
86c87
< ruff,0.9.6
---
> ruff,0.9.7

@jordandsullivan
Copy link
Collaborator

Yep, I think running the expected value benchmarks separately (or successively, making sure they do not get run in parallel with the compile time/gate count benchmarks) would be very logical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants