-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multishift QZ with AED #421
Multishift QZ with AED #421
Conversation
Codecov Report
@@ Coverage Diff @@
## master #421 +/- ##
==========================================
+ Coverage 83.33% 83.40% +0.06%
==========================================
Files 1820 1838 +18
Lines 170849 188066 +17217
==========================================
+ Hits 142378 156850 +14472
- Misses 28471 31216 +2745
Continue to review full report at Codecov.
|
a150ae2
to
5e0dc04
Compare
Two updates:
This implementation should be done now, let me know if you need anything else like documentation or performance tests. |
3a9bddb
to
fb06515
Compare
I've done some performance tests on my laptop. The following table shows execution time in seconds for different pencil sizes.
|
from 12x to 40x speedup!!!! |
Yeah, pretty massive speedup. It's the result of years of research culminating in one implementation (because DHGEQZ is quite old). Compared to more recent libraries like NLAFET this is only about twice as fast though. |
fd22927
to
21e993a
Compare
I'll make some time to look at it in the weekend, but NW=0 should not be possible. The only thing i can think of is that NWR is not set in ILAENV in that particular test. Do you still know what specific test failed? |
Note to self: possibly related to the XGGEV routines |
Yes. That's exactly it. ILAENV returns 0 for this test. The first test in |
Is it the responsibility of the routine to check if ilaenv is valid? |
should be solved now |
Thanks! You're the best! Now I saw how we need to set the parameters in the specific "err" files. Just for my personal record:
|
Perhaps it would be a good idea to initialize that array to the values in IPARMQ to avoid issues with it in the future. |
Probably.. The problem is that we have different values being set for the same |
Hi Thijs, what is an easy pencil? And what is a hard pencil? I am trying to time this as well. Julien. |
These pencils are also discussed in the paper (and in other papers for that matter). The easy pencil is generated in Hessenberg upper triangular form with randomly drawn entries. The hard pencil is the same but with entries A_ij = i + j and B_ij = 3i + 2j. You may need to adapt the parameters to achieve optimal speedup. I'll dig up the test code later for you. |
@langou this is the code i used for testing: https://gist.github.com/thijssteel/6b2b7509fdc0a86fed5cc925ca268963 it should be linked with mmio.f for matrix market files. Command line arguments:
|
Thanks for sharing @thijssteel ! This is helpful! Cheers, Julien. |
Hi! I've just reverted my commits related to the GGEV subroutines. We (@langou and I) recently realized we actually needed to test GGEV3. Sorry for this confusion. |
To report a good news. MathWorks took the QZ code from Thijs to be released in 3.10, and they used it in their test suite, and the QZ code passes their test suite. So this is really great! Big thanks to MathWorks for taking the time to export the pre-release of LAPACK 3.10, compile it, sneak it under the hood of Matlab, and run their test suite. Everything is running fine. So thanks to MathWorks, and thanks to Thijs. |
Hi @thijssteel. What kind of BLAS did you use to obtain these results? And what was your processor? Thanks. |
Intel MKL (can't remember the version, but probably doesn't matter that much). Machine has an Intel Xeon E5-2697 v3 CPU with That's the compute server i have access to, my pc has an intel i7-8750H and 16GB of ram. Can't remember which one it was. |
I did some experiments on my laptop too. My objective was to run this branch without optimizing the parameters. For that I used:
My system:
Results:
We still arrive at 12x speedup, and that is awesome!! |
Interesting comparison. My take away is that @thijssteel's laptop is faster than @weslleyspereira's laptop ;) ( More seriously, thanks for the timing @weslleyspereira, it's great. ) |
I want to note that the original 40× speedup is somewhat unrealistic. With well tuned parameters, that easy pencil just keeps doing aed and no sweeps are required. I reverted that commit because i don't want to optimise for such a special case. |
At the request of @langou, I did some additional experiments for the GGES3 on my laptop.
My system:
Results (best time of 3 runs of
|
Thanks for the run @weslleyspereira ! GEES3 parameters for this experiment are: JOBVSL = ’N’, JOBVSR = ’N’, SORT = ’N’. So in short, taking N=4,000 as a reference, GGES3 is 3.7x faster in v3.10 than v3.5. This is because GGHD3 (released in v3.6, Nov 2015) is 2.4x faster than GGHRD (in v3.5). And then LAQZ0 (will be released in v3.10, June 2021) is 14.1x faster than HGEQZ (in v3.9). In v3.5, the times in GGHRD and HGEQZ were about 54.3% and 44.9% resp. of GGES3. In v3.10, the times in GGHRD and HGEQZ are about 82.0% and 11.7% resp. of GGES3. So, now that QZ is so fast, the reduction to Triangular Hessenberg has become the major bottleneck of GGES3. All these experiments are done without any tuning and using the default parameters. |
Yes. That's it! I should also say that DLAQZ0 uses all my cores during almost all its execution. But we may need to use a proper profiler to get more information about core occupancy and balancing. |
I don't think OpenBLAS can utilize a variable amount of cores. Its just a switch between threading and not threading. Using MKL reveals that it doesn't scale all that well (up to about 4 cores). Most of the multiplications involve thin matrices (This may improve with other parameters). Also don't forget about the eigenvector calculation. I haven't timed it myself, but given the lack of level 3 BLAS calls, it can easily be the major bottleneck. |
…d-QZ Multishift QZ with AED
(PR reopened because I had to change branches)
This PR adds an implementation of the multishift QZ algorithm with AED.
It is loosely based on my implementation of the rational QZ algorithm (https://github.com/thijssteel/multishift-multipole-rqz).
It features:
It does not feature:
All things considered this won't significantly reduce the runtime of generalised schur form calculations because the HT reduction is now dominant, but I hope be able to have a really fast implementation of that soon.
depends on #420