Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always run vf2 scoring serially #11112

Merged
merged 2 commits into from
Oct 27, 2023
Merged

Conversation

mtreinish
Copy link
Member

Summary

The VF2Layout and VF2PostLayout passes' Rust scoring function has the option to use a parallel iterator for computing the score of a layout given a sufficiently large circuit (both in number of 2q interactions or qubits). Previously the scoring would be multithreaded if there were > 50 qubits or > 50 2q interactions in the circuit. However, as recent testing has shown in most cases the performance of doing this operation in parallel is much worse than serial execution. To address this issue, this commit just always uses the serial version. The parallel option is left in place for now just in case someone was manually calling the scoring function with the parallel option set to True.

Details and comments

@mtreinish mtreinish added stable backport potential The bug might be minimal and/or import enough to be port to stable performance Changelog: None Do not include in changelog labels Oct 25, 2023
@mtreinish mtreinish added this to the 0.45.0 milestone Oct 25, 2023
@mtreinish mtreinish requested a review from a team as a code owner October 25, 2023 16:47
@qiskit-bot
Copy link
Collaborator

One or more of the the following people are requested to review this:

  • @Qiskit/terra-core

@mtreinish
Copy link
Member Author

For comparison these are the performance results from this PR compared to main running in parallel:

threading_comparison

I generated that using this script:

import time

from qiskit.transpiler.preset_passmanagers import common 
from qiskit.transpiler.passes import VF2PostLayout
from qiskit.transpiler.preset_passmanagers import plugin
from qiskit.transpiler.passmanager_config import PassManagerConfig
from qiskit.providers.fake_provider import FakeSherbrooke
from qiskit import QuantumCircuit


backend = FakeSherbrooke()
pm_config = PassManagerConfig.from_backend(backend, seed_transpiler=42)

pm = common.generate_unroll_3q(backend.target)
pm += plugin.PassManagerStagePluginManager().get_passmanager_stage("layout", "default", pm_config, optimization_level=3)

call_limit, max_trials = common.get_vf2_limits(optimization_level=3)
vf2_pass = VF2PostLayout(target=backend.target, call_limit=call_limit, max_trials=max_trials, seed=42, strict_direction=False)

times = []
for n in range(2, 127):
    qc = QuantumCircuit(n, n - 1)
    qc.x(n-1)
    qc.h(range(n))
    qc.cx(range(n-1), n-1)
    qc.h(range(n-1))
    qc.barrier()
    qc.measure(range(n-1), range(n-1))
    to_post_layout = pm.run(qc)
    start = time.perf_counter()
    vf2_pass(to_post_layout)
    stop = time.perf_counter()
    times.append(stop - start)
    print(stop - start)
print(times)

@coveralls
Copy link

Pull Request Test Coverage Report for Build 6643541599

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 13 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.002%) to 86.915%

Files with Coverage Reduction New Missed Lines %
qiskit/extensions/quantum_initializer/squ.py 2 84.15%
crates/qasm2/src/lex.rs 5 91.92%
crates/accelerate/src/vf2_layout.rs 6 83.64%
Totals Coverage Status
Change from base Build 6642110371: 0.002%
Covered Lines: 73967
Relevant Lines: 85103

💛 - Coveralls

mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Oct 25, 2023
As part of the VF2Layout and VF2PostLayout passes when there are a large
number of matches found we're spending an inordinate amount of time in
scoring rebuilding the same views over and over again of the interaction
graph for each scoring call. For example, in one test cProfile showed
that with Qiskit#11112 when running transpile() on a 65 Bernstein Vazirani
circuit with a secret of all 1s for FakeSherbrooke with
optimization_level=3 we were calling vf2_utils.score_layout() 161,761
times which took a culmulative time of 14.33 secs. Of that time though
we spent 5.865 secs building the edge list view.

These views are fixed for a given interaction graph which doesn't change
during the duration of the run() method on these passes. To remove this
inefficiency this commit moves the construction of the views to the
beginning of the passes and just reuses them by reference for each
scoring call, avoiding the reconstruction overhead.
The VF2Layout and VF2PostLayout passes' Rust scoring function has the
option to use a parallel iterator for computing the score of a layout
given a sufficiently large circuit (both in number of 2q interactions or
qubits). Previously the scoring would be multithreaded if there were >
50 qubits or > 50 2q interactions in the circuit. However, as recent
testing has shown in most cases the performance of doing this operation
in parallel is much worse than serial execution. To address this issue,
this commit just always uses the serial version. The parallel option is
left in place for now just in case someone was manually calling the
scoring function with the parallel option set to ``True``.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Oct 25, 2023
As part of the VF2Layout and VF2PostLayout passes when there are a large
number of matches found we're spending an inordinate amount of time in
scoring rebuilding the same views over and over again of the interaction
graph for each scoring call. For example, in one test cProfile showed
that with Qiskit#11112 when running transpile() on a 65 Bernstein Vazirani
circuit with a secret of all 1s for FakeSherbrooke with
optimization_level=3 we were calling vf2_utils.score_layout() 161,761
times which took a culmulative time of 14.33 secs. Of that time though
we spent 5.865 secs building the edge list view.

These views are fixed for a given interaction graph which doesn't change
during the duration of the run() method on these passes. To remove this
inefficiency this commit moves the construction of the views to the
beginning of the passes and just reuses them by reference for each
scoring call, avoiding the reconstruction overhead.
@kevinhartman
Copy link
Contributor

This looks good. Technically changing the default behavior in vf2_utils would be a breaking change right? Though the impact is non-functional, and even still, going from parallel to single threaded wouldn't really have caller implications. Still might be good to have a release note.

@mtreinish
Copy link
Member Author

Yeah, I'm not sure it's a breaking change, vf2_utils is not really part of our documented public API it's only really useful for the vf2 passes themselves. For the passes themselves it's just an internal implementation detail, I don't think the API contract for VF2Layout or VF2PostLayout ever explicitly stated that it ran in multiple threads. That being said I'll write a quick other section release note to document this just in case.

@mtreinish
Copy link
Member Author

I added the release note in: a868032

Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevinhartman kevinhartman added this pull request to the merge queue Oct 26, 2023
Merged via the queue into Qiskit:main with commit 0764a33 Oct 27, 2023
mergify bot pushed a commit that referenced this pull request Oct 27, 2023
* Always run vf2 scoring serially

The VF2Layout and VF2PostLayout passes' Rust scoring function has the
option to use a parallel iterator for computing the score of a layout
given a sufficiently large circuit (both in number of 2q interactions or
qubits). Previously the scoring would be multithreaded if there were >
50 qubits or > 50 2q interactions in the circuit. However, as recent
testing has shown in most cases the performance of doing this operation
in parallel is much worse than serial execution. To address this issue,
this commit just always uses the serial version. The parallel option is
left in place for now just in case someone was manually calling the
scoring function with the parallel option set to ``True``.

* Add release note

(cherry picked from commit 0764a33)
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Oct 27, 2023
As part of the VF2Layout and VF2PostLayout passes when there are a large
number of matches found we're spending an inordinate amount of time in
scoring rebuilding the same views over and over again of the interaction
graph for each scoring call. For example, in one test cProfile showed
that with Qiskit#11112 when running transpile() on a 65 Bernstein Vazirani
circuit with a secret of all 1s for FakeSherbrooke with
optimization_level=3 we were calling vf2_utils.score_layout() 161,761
times which took a culmulative time of 14.33 secs. Of that time though
we spent 5.865 secs building the edge list view.

These views are fixed for a given interaction graph which doesn't change
during the duration of the run() method on these passes. To remove this
inefficiency this commit moves the construction of the views to the
beginning of the passes and just reuses them by reference for each
scoring call, avoiding the reconstruction overhead.
github-merge-queue bot pushed a commit that referenced this pull request Oct 27, 2023
* Always run vf2 scoring serially

The VF2Layout and VF2PostLayout passes' Rust scoring function has the
option to use a parallel iterator for computing the score of a layout
given a sufficiently large circuit (both in number of 2q interactions or
qubits). Previously the scoring would be multithreaded if there were >
50 qubits or > 50 2q interactions in the circuit. However, as recent
testing has shown in most cases the performance of doing this operation
in parallel is much worse than serial execution. To address this issue,
this commit just always uses the serial version. The parallel option is
left in place for now just in case someone was manually calling the
scoring function with the parallel option set to ``True``.

* Add release note

(cherry picked from commit 0764a33)

Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
github-merge-queue bot pushed a commit that referenced this pull request Oct 27, 2023
* Reuse VF2 scoring views for all scoring

As part of the VF2Layout and VF2PostLayout passes when there are a large
number of matches found we're spending an inordinate amount of time in
scoring rebuilding the same views over and over again of the interaction
graph for each scoring call. For example, in one test cProfile showed
that with #11112 when running transpile() on a 65 Bernstein Vazirani
circuit with a secret of all 1s for FakeSherbrooke with
optimization_level=3 we were calling vf2_utils.score_layout() 161,761
times which took a culmulative time of 14.33 secs. Of that time though
we spent 5.865 secs building the edge list view.

These views are fixed for a given interaction graph which doesn't change
during the duration of the run() method on these passes. To remove this
inefficiency this commit moves the construction of the views to the
beginning of the passes and just reuses them by reference for each
scoring call, avoiding the reconstruction overhead.

* Add EdgeList Rust pyclass to avoid repeated conversion

This commit adds a new pyclass written in rust that wraps a rust
Vec. Previously the scoring function also used an dict->IndexMap
conversion, but the mapping structure wasn't necessary and added
additional overhead, so it was converted to a list/Vec to speed up the
execution even further. By using this new pyclass as the input to the
rust scoring function we avoid converting the edge list from a list to
an Vec on each call which will reduce the overhead even further.
mergify bot pushed a commit that referenced this pull request Oct 27, 2023
* Reuse VF2 scoring views for all scoring

As part of the VF2Layout and VF2PostLayout passes when there are a large
number of matches found we're spending an inordinate amount of time in
scoring rebuilding the same views over and over again of the interaction
graph for each scoring call. For example, in one test cProfile showed
that with #11112 when running transpile() on a 65 Bernstein Vazirani
circuit with a secret of all 1s for FakeSherbrooke with
optimization_level=3 we were calling vf2_utils.score_layout() 161,761
times which took a culmulative time of 14.33 secs. Of that time though
we spent 5.865 secs building the edge list view.

These views are fixed for a given interaction graph which doesn't change
during the duration of the run() method on these passes. To remove this
inefficiency this commit moves the construction of the views to the
beginning of the passes and just reuses them by reference for each
scoring call, avoiding the reconstruction overhead.

* Add EdgeList Rust pyclass to avoid repeated conversion

This commit adds a new pyclass written in rust that wraps a rust
Vec. Previously the scoring function also used an dict->IndexMap
conversion, but the mapping structure wasn't necessary and added
additional overhead, so it was converted to a list/Vec to speed up the
execution even further. By using this new pyclass as the input to the
rust scoring function we avoid converting the edge list from a list to
an Vec on each call which will reduce the overhead even further.

(cherry picked from commit a67fe87)
github-merge-queue bot pushed a commit that referenced this pull request Oct 27, 2023
* Reuse VF2 scoring views for all scoring

As part of the VF2Layout and VF2PostLayout passes when there are a large
number of matches found we're spending an inordinate amount of time in
scoring rebuilding the same views over and over again of the interaction
graph for each scoring call. For example, in one test cProfile showed
that with #11112 when running transpile() on a 65 Bernstein Vazirani
circuit with a secret of all 1s for FakeSherbrooke with
optimization_level=3 we were calling vf2_utils.score_layout() 161,761
times which took a culmulative time of 14.33 secs. Of that time though
we spent 5.865 secs building the edge list view.

These views are fixed for a given interaction graph which doesn't change
during the duration of the run() method on these passes. To remove this
inefficiency this commit moves the construction of the views to the
beginning of the passes and just reuses them by reference for each
scoring call, avoiding the reconstruction overhead.

* Add EdgeList Rust pyclass to avoid repeated conversion

This commit adds a new pyclass written in rust that wraps a rust
Vec. Previously the scoring function also used an dict->IndexMap
conversion, but the mapping structure wasn't necessary and added
additional overhead, so it was converted to a list/Vec to speed up the
execution even further. By using this new pyclass as the input to the
rust scoring function we avoid converting the edge list from a list to
an Vec on each call which will reduce the overhead even further.

(cherry picked from commit a67fe87)

Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
@mtreinish mtreinish deleted the serial-vf2-score branch November 1, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog performance stable backport potential The bug might be minimal and/or import enough to be port to stable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants