Always run vf2 scoring serially #11112

mtreinish · 2023-10-25T16:47:03Z

Summary

The VF2Layout and VF2PostLayout passes' Rust scoring function has the option to use a parallel iterator for computing the score of a layout given a sufficiently large circuit (both in number of 2q interactions or qubits). Previously the scoring would be multithreaded if there were > 50 qubits or > 50 2q interactions in the circuit. However, as recent testing has shown in most cases the performance of doing this operation in parallel is much worse than serial execution. To address this issue, this commit just always uses the serial version. The parallel option is left in place for now just in case someone was manually calling the scoring function with the parallel option set to True.

Details and comments

qiskit-bot · 2023-10-25T16:47:09Z

One or more of the the following people are requested to review this:

@Qiskit/terra-core

mtreinish · 2023-10-25T16:48:32Z

For comparison these are the performance results from this PR compared to main running in parallel:

I generated that using this script:

import time

from qiskit.transpiler.preset_passmanagers import common 
from qiskit.transpiler.passes import VF2PostLayout
from qiskit.transpiler.preset_passmanagers import plugin
from qiskit.transpiler.passmanager_config import PassManagerConfig
from qiskit.providers.fake_provider import FakeSherbrooke
from qiskit import QuantumCircuit


backend = FakeSherbrooke()
pm_config = PassManagerConfig.from_backend(backend, seed_transpiler=42)

pm = common.generate_unroll_3q(backend.target)
pm += plugin.PassManagerStagePluginManager().get_passmanager_stage("layout", "default", pm_config, optimization_level=3)

call_limit, max_trials = common.get_vf2_limits(optimization_level=3)
vf2_pass = VF2PostLayout(target=backend.target, call_limit=call_limit, max_trials=max_trials, seed=42, strict_direction=False)

times = []
for n in range(2, 127):
    qc = QuantumCircuit(n, n - 1)
    qc.x(n-1)
    qc.h(range(n))
    qc.cx(range(n-1), n-1)
    qc.h(range(n-1))
    qc.barrier()
    qc.measure(range(n-1), range(n-1))
    to_post_layout = pm.run(qc)
    start = time.perf_counter()
    vf2_pass(to_post_layout)
    stop = time.perf_counter()
    times.append(stop - start)
    print(stop - start)
print(times)

coveralls · 2023-10-25T17:17:29Z

Pull Request Test Coverage Report for Build 6643541599

0 of 0 changed or added relevant lines in 0 files are covered.
13 unchanged lines in 3 files lost coverage.
Overall coverage increased (+0.002%) to 86.915%

Files with Coverage Reduction	New Missed Lines	%
qiskit/extensions/quantum_initializer/squ.py	2	84.15%
crates/qasm2/src/lex.rs	5	91.92%
crates/accelerate/src/vf2_layout.rs	6	83.64%

Totals
Change from base Build 6642110371:	0.002%
Covered Lines:	73967
Relevant Lines:	85103

💛 - Coveralls

As part of the VF2Layout and VF2PostLayout passes when there are a large number of matches found we're spending an inordinate amount of time in scoring rebuilding the same views over and over again of the interaction graph for each scoring call. For example, in one test cProfile showed that with Qiskit#11112 when running transpile() on a 65 Bernstein Vazirani circuit with a secret of all 1s for FakeSherbrooke with optimization_level=3 we were calling vf2_utils.score_layout() 161,761 times which took a culmulative time of 14.33 secs. Of that time though we spent 5.865 secs building the edge list view. These views are fixed for a given interaction graph which doesn't change during the duration of the run() method on these passes. To remove this inefficiency this commit moves the construction of the views to the beginning of the passes and just reuses them by reference for each scoring call, avoiding the reconstruction overhead.

The VF2Layout and VF2PostLayout passes' Rust scoring function has the option to use a parallel iterator for computing the score of a layout given a sufficiently large circuit (both in number of 2q interactions or qubits). Previously the scoring would be multithreaded if there were > 50 qubits or > 50 2q interactions in the circuit. However, as recent testing has shown in most cases the performance of doing this operation in parallel is much worse than serial execution. To address this issue, this commit just always uses the serial version. The parallel option is left in place for now just in case someone was manually calling the scoring function with the parallel option set to ``True``.

As part of the VF2Layout and VF2PostLayout passes when there are a large number of matches found we're spending an inordinate amount of time in scoring rebuilding the same views over and over again of the interaction graph for each scoring call. For example, in one test cProfile showed that with Qiskit#11112 when running transpile() on a 65 Bernstein Vazirani circuit with a secret of all 1s for FakeSherbrooke with optimization_level=3 we were calling vf2_utils.score_layout() 161,761 times which took a culmulative time of 14.33 secs. Of that time though we spent 5.865 secs building the edge list view. These views are fixed for a given interaction graph which doesn't change during the duration of the run() method on these passes. To remove this inefficiency this commit moves the construction of the views to the beginning of the passes and just reuses them by reference for each scoring call, avoiding the reconstruction overhead.

kevinhartman · 2023-10-26T17:46:42Z

This looks good. Technically changing the default behavior in vf2_utils would be a breaking change right? Though the impact is non-functional, and even still, going from parallel to single threaded wouldn't really have caller implications. Still might be good to have a release note.

mtreinish · 2023-10-26T18:26:52Z

Yeah, I'm not sure it's a breaking change, vf2_utils is not really part of our documented public API it's only really useful for the vf2 passes themselves. For the passes themselves it's just an internal implementation detail, I don't think the API contract for VF2Layout or VF2PostLayout ever explicitly stated that it ran in multiple threads. That being said I'll write a quick other section release note to document this just in case.

mtreinish · 2023-10-26T18:30:52Z

I added the release note in: a868032

kevinhartman

LGTM

* Always run vf2 scoring serially The VF2Layout and VF2PostLayout passes' Rust scoring function has the option to use a parallel iterator for computing the score of a layout given a sufficiently large circuit (both in number of 2q interactions or qubits). Previously the scoring would be multithreaded if there were > 50 qubits or > 50 2q interactions in the circuit. However, as recent testing has shown in most cases the performance of doing this operation in parallel is much worse than serial execution. To address this issue, this commit just always uses the serial version. The parallel option is left in place for now just in case someone was manually calling the scoring function with the parallel option set to ``True``. * Add release note (cherry picked from commit 0764a33)

As part of the VF2Layout and VF2PostLayout passes when there are a large number of matches found we're spending an inordinate amount of time in scoring rebuilding the same views over and over again of the interaction graph for each scoring call. For example, in one test cProfile showed that with Qiskit#11112 when running transpile() on a 65 Bernstein Vazirani circuit with a secret of all 1s for FakeSherbrooke with optimization_level=3 we were calling vf2_utils.score_layout() 161,761 times which took a culmulative time of 14.33 secs. Of that time though we spent 5.865 secs building the edge list view. These views are fixed for a given interaction graph which doesn't change during the duration of the run() method on these passes. To remove this inefficiency this commit moves the construction of the views to the beginning of the passes and just reuses them by reference for each scoring call, avoiding the reconstruction overhead.

* Always run vf2 scoring serially The VF2Layout and VF2PostLayout passes' Rust scoring function has the option to use a parallel iterator for computing the score of a layout given a sufficiently large circuit (both in number of 2q interactions or qubits). Previously the scoring would be multithreaded if there were > 50 qubits or > 50 2q interactions in the circuit. However, as recent testing has shown in most cases the performance of doing this operation in parallel is much worse than serial execution. To address this issue, this commit just always uses the serial version. The parallel option is left in place for now just in case someone was manually calling the scoring function with the parallel option set to ``True``. * Add release note (cherry picked from commit 0764a33) Co-authored-by: Matthew Treinish <mtreinish@kortar.org>

* Reuse VF2 scoring views for all scoring As part of the VF2Layout and VF2PostLayout passes when there are a large number of matches found we're spending an inordinate amount of time in scoring rebuilding the same views over and over again of the interaction graph for each scoring call. For example, in one test cProfile showed that with #11112 when running transpile() on a 65 Bernstein Vazirani circuit with a secret of all 1s for FakeSherbrooke with optimization_level=3 we were calling vf2_utils.score_layout() 161,761 times which took a culmulative time of 14.33 secs. Of that time though we spent 5.865 secs building the edge list view. These views are fixed for a given interaction graph which doesn't change during the duration of the run() method on these passes. To remove this inefficiency this commit moves the construction of the views to the beginning of the passes and just reuses them by reference for each scoring call, avoiding the reconstruction overhead. * Add EdgeList Rust pyclass to avoid repeated conversion This commit adds a new pyclass written in rust that wraps a rust Vec. Previously the scoring function also used an dict->IndexMap conversion, but the mapping structure wasn't necessary and added additional overhead, so it was converted to a list/Vec to speed up the execution even further. By using this new pyclass as the input to the rust scoring function we avoid converting the edge list from a list to an Vec on each call which will reduce the overhead even further.

* Reuse VF2 scoring views for all scoring As part of the VF2Layout and VF2PostLayout passes when there are a large number of matches found we're spending an inordinate amount of time in scoring rebuilding the same views over and over again of the interaction graph for each scoring call. For example, in one test cProfile showed that with #11112 when running transpile() on a 65 Bernstein Vazirani circuit with a secret of all 1s for FakeSherbrooke with optimization_level=3 we were calling vf2_utils.score_layout() 161,761 times which took a culmulative time of 14.33 secs. Of that time though we spent 5.865 secs building the edge list view. These views are fixed for a given interaction graph which doesn't change during the duration of the run() method on these passes. To remove this inefficiency this commit moves the construction of the views to the beginning of the passes and just reuses them by reference for each scoring call, avoiding the reconstruction overhead. * Add EdgeList Rust pyclass to avoid repeated conversion This commit adds a new pyclass written in rust that wraps a rust Vec. Previously the scoring function also used an dict->IndexMap conversion, but the mapping structure wasn't necessary and added additional overhead, so it was converted to a list/Vec to speed up the execution even further. By using this new pyclass as the input to the rust scoring function we avoid converting the edge list from a list to an Vec on each call which will reduce the overhead even further. (cherry picked from commit a67fe87)

* Reuse VF2 scoring views for all scoring As part of the VF2Layout and VF2PostLayout passes when there are a large number of matches found we're spending an inordinate amount of time in scoring rebuilding the same views over and over again of the interaction graph for each scoring call. For example, in one test cProfile showed that with #11112 when running transpile() on a 65 Bernstein Vazirani circuit with a secret of all 1s for FakeSherbrooke with optimization_level=3 we were calling vf2_utils.score_layout() 161,761 times which took a culmulative time of 14.33 secs. Of that time though we spent 5.865 secs building the edge list view. These views are fixed for a given interaction graph which doesn't change during the duration of the run() method on these passes. To remove this inefficiency this commit moves the construction of the views to the beginning of the passes and just reuses them by reference for each scoring call, avoiding the reconstruction overhead. * Add EdgeList Rust pyclass to avoid repeated conversion This commit adds a new pyclass written in rust that wraps a rust Vec. Previously the scoring function also used an dict->IndexMap conversion, but the mapping structure wasn't necessary and added additional overhead, so it was converted to a list/Vec to speed up the execution even further. By using this new pyclass as the input to the rust scoring function we avoid converting the edge list from a list to an Vec on each call which will reduce the overhead even further. (cherry picked from commit a67fe87) Co-authored-by: Matthew Treinish <mtreinish@kortar.org>

mtreinish added stable backport potential The bug might be minimal and/or import enough to be port to stable performance Changelog: None Do not include in changelog labels Oct 25, 2023

mtreinish added this to the 0.45.0 milestone Oct 25, 2023

mtreinish requested a review from a team as a code owner October 25, 2023 16:47

mtreinish force-pushed the serial-vf2-score branch from 8c741af to f7375c1 Compare October 25, 2023 21:31

mtreinish mentioned this pull request Oct 25, 2023

Reuse VF2 scoring views for all scoring #11115

Merged

Add release note

a868032

mtreinish assigned kevinhartman Oct 26, 2023

kevinhartman approved these changes Oct 26, 2023

View reviewed changes

kevinhartman enabled auto-merge October 26, 2023 19:42

kevinhartman added this pull request to the merge queue Oct 26, 2023

Merged via the queue into Qiskit:main with commit 0764a33 Oct 27, 2023

mergify bot mentioned this pull request Oct 27, 2023

Always run vf2 scoring serially (backport #11112) #11128

Merged

mtreinish deleted the serial-vf2-score branch November 1, 2023 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always run vf2 scoring serially #11112

Always run vf2 scoring serially #11112

mtreinish commented Oct 25, 2023

qiskit-bot commented Oct 25, 2023

mtreinish commented Oct 25, 2023

coveralls commented Oct 25, 2023

kevinhartman commented Oct 26, 2023

mtreinish commented Oct 26, 2023

mtreinish commented Oct 26, 2023

kevinhartman left a comment

Always run vf2 scoring serially #11112

Always run vf2 scoring serially #11112

Conversation

mtreinish commented Oct 25, 2023

Summary

Details and comments

qiskit-bot commented Oct 25, 2023

mtreinish commented Oct 25, 2023

coveralls commented Oct 25, 2023

Pull Request Test Coverage Report for Build 6643541599

💛 - Coveralls

kevinhartman commented Oct 26, 2023

mtreinish commented Oct 26, 2023

mtreinish commented Oct 26, 2023

kevinhartman left a comment

Choose a reason for hiding this comment