pauli.evolve() needs a major speedup #8177

ajavadia · 2022-06-14T16:30:12Z

What should we add?

@aeddins-ibm found that pauli.evolve() can be quite slow, and we can get substantial performance improvement from using private methods for special cases.

Since pauli.evolve() is the main entry-point / user-interface, it should be fast. There are probably some simple things that can be done to dramatically speed it up. But it needs some profiling to see what's causing the slowdown.

Here's a code that shows pauli.evolve() is 10,000x slower than calling _evolve_cx() on individual instructions. Some of this is expected since CX is a special case of clifford. But a) we should have faster ways of dealing with special cases of self-adjoint cliffords and b) is there some other overhead from validation/type-checking?

from qiskit import QuantumCircuit
from qiskit.quantum_info.operators.symplectic.base_pauli import _evolve_cx
from qiskit.quantum_info import Clifford, Pauli, PauliList, random_pauli_list, SparsePauliOp
import time

n = 65
paulis_per_instance = 100
instances = 100

qc = QuantumCircuit(n)
qc.cx(range(n-1),range(1,n))
cliff = Clifford(qc)

result1 = random_pauli_list(n, paulis_per_instance*instances)
result2 = result1.copy()

start = time.time()
result1 = result1.evolve(cliff,frame='s')
print('time using evolve:',time.time()-start)

start = time.time()
for gate in qc.data:
    _evolve_cx(result2, gate[1][0]._index, gate[1][1]._index)
print('time using _evolve_cx:',time.time()-start)

print('same answer?',result1 == result2)

time using evolve: 33.13418793678284
time using _evolve_cx: 0.0023658275604248047
same answer? True

The text was updated successfully, but these errors were encountered:

ajavadia · 2022-06-14T17:07:04Z

@chriseclectic observed that we should do pauli.evolve(qc) and not pauli.evolve(cliff), which will be just as fast as calling the private method.
I'm still wondering if evolution by clifford can be sped up when the clifford is made up of lots of CX or CZ which are easy to invert.

aeddins-ibm · 2022-06-14T20:45:26Z

Ali separately suggested the bottleneck might be in taking the adjoint of the clifford. Just wanted to point out the test above was with frame='s' so the adjoint was not taken.

So then it could just be that the nested for-loops over individual Pauli objects need be replaced with vectorized operations using PauliList objects / numpy arrays. (...hopefully the indexing is not too painful to figure out 😅)

ShellyGarion · 2022-06-15T08:24:26Z

Here is an updated code with some additions:

from qiskit import QuantumCircuit
from qiskit.quantum_info.operators.symplectic.base_pauli import _evolve_cx
from qiskit.quantum_info import Clifford, Pauli, PauliList, random_pauli_list, SparsePauliOp
import time

n = 65
paulis_per_instance = 100
instances = 100

qc = QuantumCircuit(n)
qc.cx(range(n-1),range(1,n))
cliff = Clifford(qc)

result1 = random_pauli_list(n, paulis_per_instance*instances)
result2 = result1.copy()
result3 = result1.copy()
result4 = result1.copy()

start = time.time()
result1 = result1.evolve(cliff,frame='s')
time1 = time.time()-start
print('time using evolve s:',time.time()-start)

start = time.time()
result2 = result2.evolve(cliff,frame='h')
time2 = time.time()-start
print('time using evolve h:',time.time()-start)

start = time.time()
result3 = result3.evolve(qc)
time3 = time.time()-start
print('time using evolve qc:',time.time()-start)

start = time.time()
for gate in qc.data:
    _evolve_cx(result4, gate[1][0]._index, gate[1][1]._index)
time4 = time.time()-start
print('time using _evolve_cx:',time.time()-start)

print('same answer?',result1 == result4, result2 == result3)

outputs:

time using evolve s: 70.19226288795471
time using evolve h: 65.28276705741882
time using evolve qc: 0.006928682327270508
time using _evolve_cx: 0.003778696060180664
same answer? True True

ShellyGarion · 2022-06-15T08:30:47Z

In this case, the adjoint calculation is not much of a bottleneck since there is only one clifford (and 100 paulis), so the calculation is only done once, hence there is not much difference between the options h and s.

The main problem is that _evolve_clifford has two nested for loops (and I'm not sure how these can be avoided).
https://github.com/Qiskit/qiskit-terra/blob/a0b9a846f93a1e580c876b8fb37ae5d10436f594/qiskit/quantum_info/operators/symplectic/base_pauli.py#L281

As @chriseclectic suggested, pauli.evolve(qc) is much faster, almost as using _evolve_cx since it updates the clifford tableaux directly after each gate using _append_circuit
https://github.com/Qiskit/qiskit-terra/blob/a0b9a846f93a1e580c876b8fb37ae5d10436f594/qiskit/quantum_info/operators/symplectic/clifford_circuits.py#L22

@aeddins-ibm - do you think that using pauli.evolve(qc) gives the desired speedup for your use-case?

ajavadia · 2022-06-15T16:02:47Z

Nested for loops are known to be slow in python. To make matters worse, this implementation starts with pauli_list = [] and keeps growing it, which requires repeated memory allocations.

It will be faster to allocate the full array size first. Also it will be faster to use vectorized numpy operations rather than for loops.

Also the initializaiton of identity paulis as Pauli('I' * n) will be slow since it has to string parse to build the initial array.. the identity Pauli can be initialized directly.

aeddins-ibm · 2022-06-15T16:26:44Z

Thanks Shelly, running with evolve(circuit) is adequate for my use case. But I also agree we should speed up this code since most users will probably expect evolve to prefer a Clifford argument.

No promises but I have some thoughts on how to vectorize this, at least partly. I'll make an attempt and report back.

alexanderivrii · 2022-06-15T22:01:29Z

If I recall correctly, there was a PR by @jakelishman to vectorize some very related code. @jakelishman, was your PR merged?

jakelishman · 2022-06-16T10:07:11Z

I think the PR you may be thinking of was to do with Clifford.compose? That's #7483. We didn't merge that at the time because of #7269 heavily modifying the Clifford class, but it seems like we ended up not doing anything with StabilizerTable any time soon. We could merge #7483 for Terra 0.21, potentially. That said, #7483 doesn't touch Pauli.evolve, so that PR won't change this one, but I imagine much of the vectorisation strategy could be ported.

Just for the future:

To make matters worse, this implementation starts with pauli_list = [] and keeps growing it, which requires repeated memory allocations.

This is generally nowhere near as bad as you might expect. There are more memory allocations involved than doing pauli_list = [None]*length and mutating, but it's not quadratic complexity because Python (and any modern growable array implementation) amortises it to linear time by overallocating each growth by a constant ratio. Here's three ways of constructing the simplest possible list, with zero cost to build any of its elements:

In [2]: %%timeit
   ...: l = []
   ...: for _ in [None]*1_000_000:
   ...:     l.append(None)
48.9 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: l = [None]*1_000_000
   ...: for i in range(1_000_000):
   ...:     l[i] = None
38.4 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: l = [None for _ in [None]*1_000_000]
26.6 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[None]*1_000_000 is essentially fastest iterator for construction and iteration - for comparison, that takes 3.22ms on my machine. Using a list comprehension is the fastest way, and mutating indices is marginally faster, but they're all linear. For 10,000,000 elements instead, the timings are 445ms, 386ms and 222ms respectively. If you're spending more than 22ns to build each element (which is around the time taken to access a single class attribute), the list construction time is unlikely to matter too much.

(Of course, Numpy arrays and vectorisation blow all this out the water. It's just that I think people dismiss list.append more than they perhaps ought to! list.insert, on the other hand, is terrible for performance because it does copy every time.)

aeddins-ibm · 2022-06-16T15:27:38Z

-- Deleted comment since the code I had pasted still has a bug. Didn't try enough test cases, apologies. --

aeddins-ibm · 2022-06-16T15:58:55Z

Sorry about the error above, I think this is correct. The bug was that I was confused how boolean types were handled inside matrix multiplication. Now it converts the arrays to integers first as a dumb fix, which has some cost (perhaps it can be avoided somehow). There is also some memory cost of the einsum call.

Assuming it's correct, does this seem like an appropriate replacement for the for loops? (It will still need to be modified to handle qargs)

    idx = np.concatenate((self.x, self.z), axis=1) # axes: [pauli_list, input q (or p)]
    x_out = idx.dot(adj.table.X.astype(int))%2
    z_out = idx.dot(adj.table.Z.astype(int))%2
    phase_out = idx.dot(adj.table.phase.astype(int))
    
    uptri = np.zeros((2*self.num_qubits, 2*self.num_qubits), dtype=bool)
    uptri[np.triu_indices_from(uptri)] = True
    commutations = np.einsum('qQ, pQ, qp -> qp', adj.table.Z.astype(int), adj.table.X.astype(int), uptri.astype(int))
    phase_out += np.einsum('lq,lp,qp -> l', idx, idx, commutations)
    
    return PauliList.from_symplectic(z=z_out, x=x_out, phase=(self.phase+2*phase_out)%4 )

ajavadia · 2022-06-16T16:28:58Z

@jakelishman thanks - you are right that doesn't look too bad.

@aeddins-ibm did you test the above for your example? how is performance?

aeddins-ibm · 2022-06-16T18:12:01Z

Have to apologize again as there is still an error with the phase (this is a learning exercise for me). I think it's because I did not yet account for the extra phase from commuting the X gates in the input Pauli object (self) past the Z gates from the stabilizer table (also needed to add self.phase at the end but that's easier -- edited above to include this).

Regarding timing, as written above it was ballpark ~100 times faster than the old evolve, but still ~100 times slower than calling _evolve_cx in a loop. Being somewhere in the middle is perhaps not surprising given that (as others have pointed out) it will be more expensive to process the large N-qubit stabilizer matrix than to process the smaller matrices of the constituent 2-qubit-gates.

aeddins-ibm · 2022-06-17T01:48:06Z

I'm out of time to keep working on this for the time being. I did not find why the phase is not computed correctly. Though I did only just now discover that pauli.phase is not the same as pauli._phase, so that may be related.

aeddins-ibm · 2022-06-17T17:00:33Z

OK, understood the difference between phase and _phase at the last second, so here's updated code after all. It could probably be simplified (maybe avoid the astype(int) calls) and definitely needs updating to support qargs but hopefully that's not too bad.

Here's the updated code (written as a stand-alone function below; as a class method replace the pauli argument with self and delete self=pauli):

def _evolve_clifford_new(pauli, other, qargs=None, frame="h"):
    self = pauli
    """Heisenberg picture evolution of a Pauli by a Clifford."""

    # Get action of Pauli's from Clifford
    if frame == "s":
        adj = other
    else:
        adj = other.adjoint()

    idx = np.concatenate((self.x, self.z), axis=1) # axes: [pauli_list, input q (or p)]
    x_out = idx.dot(adj.table.X.astype(int))%2
    z_out = idx.dot(adj.table.Z.astype(int))%2
    _phase_out = 2*idx.dot(adj.table.phase.astype(int)) # this sums phase/2 over all Pauli rows, but we want to sum _phase/2.
    _phase_out += idx.dot(np.sum(adj.table.X.astype(int) * adj.table.Z.astype(int),axis=1))
    
    # extra phase from commuting all X rightward past all Z:
    uptri = np.zeros((2*self.num_qubits, 2*self.num_qubits), dtype=int)
    uptri[np.triu_indices_from(uptri,k=1)] = 1
    commutations = np.einsum('qQ, pQ, qp -> qp', adj.table.Z.astype(int), adj.table.X.astype(int), uptri.astype(int))
    _phase_out += 2*np.einsum('lq,lp,qp -> l', idx.astype(int), idx.astype(int), commutations.astype(int)) 
    out = PauliList.from_symplectic(z_out, x_out)
    out._phase = (self._phase + _phase_out)%4
    return out

Here's the test showing it gives correct results for random inputs, with speed improvement of ~100x compared to for-loop-based evolve:

from qiskit.quantum_info import random_pauli_list, random_clifford
import time

n = 7
paulis_per_instance = 13
instances = 10

agreed = []
for _ in range(10):
    cliff = random_clifford(n)
    result1 = random_pauli_list(n, paulis_per_instance*instances)
    result2 = result1.copy()
    
    start = time.time()
    result1 = _evolve_clifford_new(result1, cliff,frame='s')
    print('time using new evolve:',time.time()-start)

    start = time.time()
    result2 = result2.evolve(cliff,frame='s')
    print('time using evolve:',time.time()-start)

    print('agreed?',result1 == result2)

Output:

time using new evolve: 0.001493215560913086
time using evolve: 0.1083841323852539
agreed? True
time using new evolve: 0.0009107589721679688
time using evolve: 0.10019421577453613
agreed? True
time using new evolve: 0.0008382797241210938
time using evolve: 0.11111712455749512
agreed? True
time using new evolve: 0.0006940364837646484
time using evolve: 0.09914803504943848
agreed? True
time using new evolve: 0.0006570816040039062
time using evolve: 0.10026311874389648
agreed? True
time using new evolve: 0.000453948974609375
time using evolve: 0.10394597053527832
agreed? True
time using new evolve: 0.0007090568542480469
time using evolve: 0.10787606239318848
agreed? True
time using new evolve: 0.0007600784301757812
time using evolve: 0.10610795021057129
agreed? True
time using new evolve: 0.0008609294891357422
time using evolve: 0.10302495956420898
agreed? True
time using new evolve: 0.001020193099975586
time using evolve: 0.10412406921386719
agreed? True

aeddins-ibm · 2022-06-17T18:08:23Z

This version also handles qargs. I will try to make a PR with this in place of the for-loop-based _evolve_clifford method; pasting it here first in case I can't figure out how to do that.

def _evolve_clifford_new(pauli, other, qargs=None, frame="h"):
    self = pauli
    """Heisenberg picture evolution of a Pauli by a Clifford."""

    # Get action of Pauli's from Clifford
    if frame == "s":
        adj = other
    else:
        adj = other.adjoint()
    
    if qargs is None:
        qargs_ = slice(None)
        num_act = self.num_qubits
    else:
        qargs_ = list(qargs)
        num_act = len(qargs_)
    
    ret = self.copy()
    ret._x[:,qargs_] = False
    ret._z[:,qargs_] = False

    idx = np.concatenate((self.x[:,qargs_], self.z[:,qargs_]), axis=1) # axes: [pauli_list, input q (or p)]
    ret._x[:,qargs_] = idx.dot(adj.table.X.astype(int))%2
    ret._z[:,qargs_] = idx.dot(adj.table.Z.astype(int))%2
    ret._phase += 2*idx.dot(adj.table.phase.astype(int)) # this sums phase/2 over all Pauli rows, but we want to sum _phase/2
    
    # extra phase from commuting all X rightward past all Z:
    uptri = np.zeros((2*num_act, 2*num_act), dtype=int)
    uptri[np.triu_indices_from(uptri,k=1)] = 2
    np.fill_diagonal(uptri, 1) # combines with earlier sum of phase/2 to total sum of _phase/2
    commutations = np.einsum('qQ, pQ, qp -> qp', adj.table.Z, adj.table.X, uptri)
    ret._phase += np.einsum('lq,lp,qp -> l', idx, idx, commutations)
    ret._phase = (ret._phase)%4
    return ret

ajavadia · 2022-06-17T18:49:48Z

nice work @aeddins-ibm !

aeddins-ibm · 2022-06-20T20:17:11Z

Modified the code simpler. Basically the code I posted above eliminated for loops, but seemed to be scaling worse b/c of building an N x N matrix, so I reverted to the old nested-for code (which did not need that matrix), but replaced one of the two for loops with a PauliList operation (below for reference). This seems to roughly match or else outperform the nested-for code for all cases, including the important case of many qubits. And is maybe more readable.

Just need to figure out tox / lint stuff I think then will try the PR again.

 def _evolve_clifford(self, other, qargs=None, frame="h"):

        """Heisenberg picture evolution of a Pauli by a Clifford."""

        if frame == "s":
            adj = other
        else:
            adj = other.adjoint()

        if qargs is None:
            qargs_ = slice(None)
        else:
            qargs_ = list(qargs)
        
        is_pauli_list = len(self.x.shape) > 1

        ret = self.copy()
        ret._x[:,qargs_] = False
        ret._z[:,qargs_] = False
        
        # pylint: disable=cyclic-import
        from qiskit.quantum_info.operators.symplectic.pauli_list import PauliList

        idx = np.concatenate((self._x[:,qargs_], self._z[:,qargs_]), axis=1)
        for idx_, row in zip(idx.T, PauliList.from_symplectic(z=adj.table.Z, x=adj.table.X, phase=2*adj.table.phase)):
            if idx_.any():
                if is_pauli_list:
                    ret[idx_] = ret[idx_].compose(row, qargs=qargs)
                else:
                    ret = ret.compose(row, qargs=qargs)

        return ret

* Vectorize base_pauli._evolve_clifford Per issue #8177 . Seems to give a ~100x speed up by vectorizing for loops. I am not sure about the memory cost of the vectorization (einsum, etc). * support Pauli.evolve in addition to PauliList.evolve - use ._x instead of .x to get correct array dimensionality - use 'uint8' dtype for phases (not doing so raised error for Pauli but not for PauliList) * Simpler evolve computation Revised the evolve computation to be more like the original nested-for code before this PR, but replacing one of the two for-loops with PauliList operations. In earlier commits in this PR, I tried to avoid both for-loops using einsum, but that seemed to have worse scaling with number of qubits. Some clever way to eliminate the remaining for loop could be nice, but at least this code speed seems comparable to or better than the old for-loop code in speed for all cases. Avoiding introducing a scaling problem at higher qubit number seems important since that's what users might want Cliffords for in the first place. * formatting (black) * change x to _x compatibility with BasePauli class * fix rare error for evolving a Pauli - preceding commit in this PR changed an `x` to `_x` to keep linter happy, but this introduced a rare bug when evolving Pauli type (but not PauliList AFAICT). - made less ambiguous test using `isinstance` to avoid this bug * avoid calling isinstance() - avoid calling istinstance(self, PauliList) inside the evolve method of BasePauli, since that is problematic. - the replacement has an extra call to np.sum, but this does not seem to significantly impact overall time of execution. * add release note * Reword release note Co-authored-by: Jake Lishman <jake.lishman@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

ajavadia added type: feature request New feature or request performance labels Jun 14, 2022

ShellyGarion self-assigned this Jun 14, 2022

ShellyGarion removed their assignment Jun 16, 2022

aeddins-ibm mentioned this issue Jun 17, 2022

Speed up pauli evolve clifford #8199

Merged

mergify bot closed this as completed in #8199 Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pauli.evolve() needs a major speedup #8177

pauli.evolve() needs a major speedup #8177

ajavadia commented Jun 14, 2022 •

edited

Loading

ajavadia commented Jun 14, 2022

aeddins-ibm commented Jun 14, 2022

ShellyGarion commented Jun 15, 2022

ShellyGarion commented Jun 15, 2022

ajavadia commented Jun 15, 2022

aeddins-ibm commented Jun 15, 2022

alexanderivrii commented Jun 15, 2022

jakelishman commented Jun 16, 2022 •

edited

Loading

aeddins-ibm commented Jun 16, 2022 •

edited

Loading

aeddins-ibm commented Jun 16, 2022 •

edited

Loading

ajavadia commented Jun 16, 2022

aeddins-ibm commented Jun 16, 2022 •

edited

Loading

aeddins-ibm commented Jun 17, 2022

aeddins-ibm commented Jun 17, 2022 •

edited

Loading

aeddins-ibm commented Jun 17, 2022

ajavadia commented Jun 17, 2022

aeddins-ibm commented Jun 20, 2022 •

edited

Loading

pauli.evolve() needs a major speedup #8177

pauli.evolve() needs a major speedup #8177

Comments

ajavadia commented Jun 14, 2022 • edited Loading

What should we add?

ajavadia commented Jun 14, 2022

aeddins-ibm commented Jun 14, 2022

ShellyGarion commented Jun 15, 2022

ShellyGarion commented Jun 15, 2022

ajavadia commented Jun 15, 2022

aeddins-ibm commented Jun 15, 2022

alexanderivrii commented Jun 15, 2022

jakelishman commented Jun 16, 2022 • edited Loading

aeddins-ibm commented Jun 16, 2022 • edited Loading

aeddins-ibm commented Jun 16, 2022 • edited Loading

ajavadia commented Jun 16, 2022

aeddins-ibm commented Jun 16, 2022 • edited Loading

aeddins-ibm commented Jun 17, 2022

aeddins-ibm commented Jun 17, 2022 • edited Loading

aeddins-ibm commented Jun 17, 2022

ajavadia commented Jun 17, 2022

aeddins-ibm commented Jun 20, 2022 • edited Loading

ajavadia commented Jun 14, 2022 •

edited

Loading

jakelishman commented Jun 16, 2022 •

edited

Loading

aeddins-ibm commented Jun 16, 2022 •

edited

Loading

aeddins-ibm commented Jun 16, 2022 •

edited

Loading

aeddins-ibm commented Jun 16, 2022 •

edited

Loading

aeddins-ibm commented Jun 17, 2022 •

edited

Loading

aeddins-ibm commented Jun 20, 2022 •

edited

Loading