-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pauli.evolve() needs a major speedup #8177
Comments
@chriseclectic observed that we should do |
Ali separately suggested the bottleneck might be in taking the adjoint of the clifford. Just wanted to point out the test above was with So then it could just be that the nested for-loops over individual Pauli objects need be replaced with vectorized operations using PauliList objects / numpy arrays. (...hopefully the indexing is not too painful to figure out 😅) |
Here is an updated code with some additions:
outputs:
|
In this case, the adjoint calculation is not much of a bottleneck since there is only one clifford (and 100 paulis), so the calculation is only done once, hence there is not much difference between the options The main problem is that As @chriseclectic suggested, @aeddins-ibm - do you think that using |
Nested for loops are known to be slow in python. To make matters worse, this implementation starts with It will be faster to allocate the full array size first. Also it will be faster to use vectorized numpy operations rather than for loops. Also the initializaiton of identity paulis as |
Thanks Shelly, running with evolve(circuit) is adequate for my use case. But I also agree we should speed up this code since most users will probably expect evolve to prefer a No promises but I have some thoughts on how to vectorize this, at least partly. I'll make an attempt and report back. |
If I recall correctly, there was a PR by @jakelishman to vectorize some very related code. @jakelishman, was your PR merged? |
I think the PR you may be thinking of was to do with Just for the future:
This is generally nowhere near as bad as you might expect. There are more memory allocations involved than doing In [2]: %%timeit
...: l = []
...: for _ in [None]*1_000_000:
...: l.append(None)
48.9 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %%timeit
...: l = [None]*1_000_000
...: for i in range(1_000_000):
...: l[i] = None
38.4 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %%timeit
...: l = [None for _ in [None]*1_000_000]
26.6 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(Of course, Numpy arrays and vectorisation blow all this out the water. It's just that I think people dismiss |
-- Deleted comment since the code I had pasted still has a bug. Didn't try enough test cases, apologies. -- |
Sorry about the error above, I think this is correct. The bug was that I was confused how boolean types were handled inside matrix multiplication. Now it converts the arrays to integers first as a dumb fix, which has some cost (perhaps it can be avoided somehow). There is also some memory cost of the Assuming it's correct, does this seem like an appropriate replacement for the for loops? (It will still need to be modified to handle
|
@jakelishman thanks - you are right that doesn't look too bad. @aeddins-ibm did you test the above for your example? how is performance? |
Have to apologize again as there is still an error with the phase (this is a learning exercise for me). I think it's because I did not yet account for the extra phase from commuting the X gates in the input Pauli object (self) past the Z gates from the stabilizer table (also needed to add self.phase at the end but that's easier -- edited above to include this). Regarding timing, as written above it was ballpark ~100 times faster than the old evolve, but still ~100 times slower than calling _evolve_cx in a loop. Being somewhere in the middle is perhaps not surprising given that (as others have pointed out) it will be more expensive to process the large N-qubit stabilizer matrix than to process the smaller matrices of the constituent 2-qubit-gates. |
I'm out of time to keep working on this for the time being. I did not find why the phase is not computed correctly. Though I did only just now discover that |
OK, understood the difference between Here's the updated code (written as a stand-alone function below; as a class method replace the pauli argument with self and delete self=pauli):
Here's the test showing it gives correct results for random inputs, with speed improvement of ~100x compared to for-loop-based evolve:
Output:
|
This version also handles
|
nice work @aeddins-ibm ! |
Modified the code simpler. Basically the code I posted above eliminated for loops, but seemed to be scaling worse b/c of building an N x N matrix, so I reverted to the old nested-for code (which did not need that matrix), but replaced one of the two for loops with a PauliList operation (below for reference). This seems to roughly match or else outperform the nested-for code for all cases, including the important case of many qubits. And is maybe more readable. Just need to figure out tox / lint stuff I think then will try the PR again.
|
* Vectorize base_pauli._evolve_clifford Per issue #8177 . Seems to give a ~100x speed up by vectorizing for loops. I am not sure about the memory cost of the vectorization (einsum, etc). * support Pauli.evolve in addition to PauliList.evolve - use ._x instead of .x to get correct array dimensionality - use 'uint8' dtype for phases (not doing so raised error for Pauli but not for PauliList) * Simpler evolve computation Revised the evolve computation to be more like the original nested-for code before this PR, but replacing one of the two for-loops with PauliList operations. In earlier commits in this PR, I tried to avoid both for-loops using einsum, but that seemed to have worse scaling with number of qubits. Some clever way to eliminate the remaining for loop could be nice, but at least this code speed seems comparable to or better than the old for-loop code in speed for all cases. Avoiding introducing a scaling problem at higher qubit number seems important since that's what users might want Cliffords for in the first place. * formatting (black) * change x to _x compatibility with BasePauli class * fix rare error for evolving a Pauli - preceding commit in this PR changed an `x` to `_x` to keep linter happy, but this introduced a rare bug when evolving Pauli type (but not PauliList AFAICT). - made less ambiguous test using `isinstance` to avoid this bug * avoid calling isinstance() - avoid calling istinstance(self, PauliList) inside the evolve method of BasePauli, since that is problematic. - the replacement has an extra call to np.sum, but this does not seem to significantly impact overall time of execution. * add release note * Reword release note Co-authored-by: Jake Lishman <jake.lishman@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
What should we add?
@aeddins-ibm found that
pauli.evolve()
can be quite slow, and we can get substantial performance improvement from using private methods for special cases.Since
pauli.evolve()
is the main entry-point / user-interface, it should be fast. There are probably some simple things that can be done to dramatically speed it up. But it needs some profiling to see what's causing the slowdown.Here's a code that shows
pauli.evolve()
is 10,000x slower than calling_evolve_cx()
on individual instructions. Some of this is expected since CX is a special case of clifford. But a) we should have faster ways of dealing with special cases of self-adjoint cliffords and b) is there some other overhead from validation/type-checking?The text was updated successfully, but these errors were encountered: