Multishift QZ with AED #421

thijssteel · 2020-07-03T19:50:21Z

(PR reopened because I had to change branches)

This PR adds an implementation of the multishift QZ algorithm with AED.
It is loosely based on my implementation of the rational QZ algorithm (https://github.com/thijssteel/multishift-multipole-rqz).

It features:

Agressive early deflation
Multishift QZ sweeps using optimal packing of the bulges
A new heuristic to select the number of positions in the sweep windows

It does not feature:

A windowed deflation of infinite eigenvalues (that is only useful if many infinite eigenvalues are to be deflated and in that case you should probably do some preprocessing anyway).

All things considered this won't significantly reduce the runtime of generalised schur form calculations because the HT reduction is now dominant, but I hope be able to have a really fast implementation of that soon.

depends on #420

codecov · 2020-07-03T20:02:35Z

Codecov Report

Merging #421 (3bbb3e8) into master (6281084) will increase coverage by 0.06%.
The diff coverage is 86.13%.

@@            Coverage Diff             @@
##           master     #421      +/-   ##
==========================================
+ Coverage   83.33%   83.40%   +0.06%     
==========================================
  Files        1820     1838      +18     
  Lines      170849   188066   +17217     
==========================================
+ Hits       142378   156850   +14472     
- Misses      28471    31216    +2745

Impacted Files	Coverage Δ
SRC/ilaenv.f	`23.07% <0.00%> (-0.09%)`	⬇️
SRC/iparmq.f	`0.00% <0.00%> (ø)`
SRC/claqz0.f	`65.61% <65.61%> (ø)`
SRC/cgges3.f	`93.71% <66.66%> (+0.42%)`	⬆️
SRC/dgges3.f	`91.58% <66.66%> (+0.67%)`	⬆️
SRC/sgges3.f	`89.50% <66.66%> (+0.72%)`	⬆️
SRC/zgges3.f	`93.71% <66.66%> (+0.42%)`	⬆️
SRC/zlaqz0.f	`68.32% <68.32%> (ø)`
SRC/dggev3.f	`93.75% <75.00%> (+0.45%)`	⬆️
SRC/cggev3.f	`93.10% <80.00%> (+0.60%)`	⬆️
... and 1856 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6281084...3bbb3e8. Read the comment docs.

thijssteel · 2020-08-05T14:38:05Z

Two updates:

I've decided to use the same parameters as for the QR algorithm. This is a bit cleaner, but because of optimal packing, that is suboptimal.
I mentioned that the HT reduction is now dominant. That was slightly wrong because I was linking with the MKL version of LAPACK and they disable the blocked HT reduction for some reason. A proper implementation is already much faster.

This implementation should be done now, let me know if you need anything else like documentation or performance tests.

thijssteel · 2020-11-12T08:15:42Z

I've done some performance tests on my laptop. The following table shows execution time in seconds for different pencil sizes.

N	DLAQZ0 (easy pencil)	DLAQZ0 (hard pencil)	DHGEQZ (easy pencil)	DHGEQZ (hard pencil)
1000	1.2477440	1.8427166	4.4717226	5.2714647
1414	1.8305675	3.4307614	13.5002084	15.8388140
2000	3.2328878	6.4804934	41.9715897	49.1257091
2828	3.9365313	15.0546696	118.7894478	135.7473488
4000	8.8735433	31.2473274	365.1408321	400.7008308

langou · 2020-11-12T15:13:31Z

from 12x to 40x speedup!!!!

thijssteel · 2020-11-12T15:29:35Z

Yeah, pretty massive speedup. It's the result of years of research culminating in one implementation (because DHGEQZ is quite old). Compared to more recent libraries like NLAFET this is only about twice as fast though.

thijssteel · 2021-03-17T21:58:15Z

I'll make some time to look at it in the weekend, but NW=0 should not be possible.

The only thing i can think of is that NWR is not set in ILAENV in that particular test. Do you still know what specific test failed?

thijssteel · 2021-03-17T22:00:14Z

Note to self: possibly related to the XGGEV routines

weslleyspereira · 2021-03-17T22:06:23Z

I'll make some time to look at it in the weekend, but NW=0 should not be possible.

The only thing i can think of is that NWR is not set in ILAENV in that particular test. Do you still know what specific test failed?

Yes. That's exactly it. ILAENV returns 0 for this test. The first test in cgd.in and zgd.in fail. Not only the first, but you can look at the first.

thijssteel · 2021-03-17T22:09:40Z

Is it the responsibility of the routine to check if ilaenv is valid?

thijssteel · 2021-03-18T08:11:34Z

should be solved now

weslleyspereira · 2021-03-18T12:43:17Z

Thanks! You're the best! Now I saw how we need to set the parameters in the specific "err" files.

Just for my personal record:

Out of the test suite: ILAENV(ISPEC=12,...,17) calls IPARMQ(ISPEC)
Inside the test suite: ILAENV(ISPEC=12,...,17) access the array IPARMS(ISPEC)

thijssteel · 2021-03-18T12:50:55Z

Perhaps it would be a good idea to initialize that array to the values in IPARMQ to avoid issues with it in the future.

weslleyspereira · 2021-03-18T13:35:04Z

Probably.. The problem is that we have different values being set for the same ISPEC. See:

https://github.com/thijssteel/lapack/blob/0b8015ebeb342abeaa2dcf8d350c0223ad2e80ed/TESTING/EIG/cchkee.f#L1248

https://github.com/thijssteel/lapack/blob/0b8015ebeb342abeaa2dcf8d350c0223ad2e80ed/TESTING/EIG/cchkee.f#L1387

https://github.com/thijssteel/lapack/blob/0b8015ebeb342abeaa2dcf8d350c0223ad2e80ed/TESTING/EIG/cchkee.f#L1808

langou · 2021-03-27T15:16:58Z

I've done some performance tests on my laptop. The following table shows execution time in seconds for different pencil sizes.

N DLAQZ0 (easy pencil) DLAQZ0 (hard pencil) DHGEQZ (easy pencil) DHGEQZ (hard pencil)
1000 1.2477440 1.8427166 4.4717226 5.2714647
1414 1.8305675 3.4307614 13.5002084 15.8388140
2000 3.2328878 6.4804934 41.9715897 49.1257091
2828 3.9365313 15.0546696 118.7894478 135.7473488
4000 8.8735433 31.2473274 365.1408321 400.7008308

Hi Thijs, what is an easy pencil? And what is a hard pencil? I am trying to time this as well. Julien.

thijssteel · 2021-03-27T15:55:41Z

These pencils are also discussed in the paper (and in other papers for that matter).

The easy pencil is generated in Hessenberg upper triangular form with randomly drawn entries. The hard pencil is the same but with entries A_ij = i + j and B_ij = 3i + 2j.

You may need to adapt the parameters to achieve optimal speedup.

I'll dig up the test code later for you.

thijssteel · 2021-03-31T08:55:23Z

@langou this is the code i used for testing: https://gist.github.com/thijssteel/6b2b7509fdc0a86fed5cc925ca268963

it should be linked with mmio.f for matrix market files.

Command line arguments:

Algorithm: 1 for DLAQZ0, 2 for DHGEQZ
Matrixtype: 1 for random, 2 for i+j, 3 for matrix market
N/File: when matrixtype is 1 or 2, an integer denoting the size of the pencil, when matrixtype is 3, two filenames of matrix market files

langou · 2021-03-31T13:10:39Z

Thanks for sharing @thijssteel ! This is helpful! Cheers, Julien.

…tishift-aed" This reverts commit 77a97c4, reversing changes made to 93fd62f.

weslleyspereira · 2021-03-31T21:24:21Z

Hi! I've just reverted my commits related to the GGEV subroutines. We (@langou and I) recently realized we actually needed to test GGEV3. Sorry for this confusion.

langou · 2021-04-08T15:13:00Z

To report a good news. MathWorks took the QZ code from Thijs to be released in 3.10, and they used it in their test suite, and the QZ code passes their test suite. So this is really great! Big thanks to MathWorks for taking the time to export the pre-release of LAPACK 3.10, compile it, sneak it under the hood of Matlab, and run their test suite. Everything is running fine. So thanks to MathWorks, and thanks to Thijs.

weslleyspereira · 2021-04-08T17:35:05Z

I've done some performance tests on my laptop. The following table shows execution time in seconds for different pencil sizes.
N DLAQZ0 (easy pencil) DLAQZ0 (hard pencil) DHGEQZ (easy pencil) DHGEQZ (hard pencil)
1000 1.2477440 1.8427166 4.4717226 5.2714647
1414 1.8305675 3.4307614 13.5002084 15.8388140
2000 3.2328878 6.4804934 41.9715897 49.1257091
2828 3.9365313 15.0546696 118.7894478 135.7473488
4000 8.8735433 31.2473274 365.1408321 400.7008308

Hi @thijssteel. What kind of BLAS did you use to obtain these results? And what was your processor? Thanks.

thijssteel · 2021-04-08T17:52:01Z

Intel MKL (can't remember the version, but probably doesn't matter that much). Machine has an Intel Xeon E5-2697 v3 CPU with
14 cores and 128GB of RAM. The parameters were also adapted to achieve this performance (see a commit in the tree i reverted)

That's the compute server i have access to, my pc has an intel i7-8750H and 16GB of ram. Can't remember which one it was.

weslleyspereira · 2021-04-09T19:42:32Z

I did some experiments on my laptop too. My objective was to run this branch without optimizing the parameters.

For that I used:

@thijssteel's code: https://gist.github.com/thijssteel/6b2b7509fdc0a86fed5cc925ca268963.
mmio.f from https://math.nist.gov/MatrixMarket/mmio/f/mmio.f.
OpenBLAS (https://github.com/xianyi/OpenBLAS/tree/f9aaf22fc3aa63fd74c0f268826235a65d12cf4c) build with cmake -DCMAKE_BUILD_TYPE=Release. No additional configuration.
This branch (Multishift QZ with AED #421) build with cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_Fortran_FLAGS=-fopenmp

My system:

Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz (12 cores)
7.8 GB RAM
Ubuntu 18.04.5 LTS (5.4.0-70-generic)
GNU compilers, GCC version 7.5.0

Results:

N	DLAQZ0 (easy pencil)	DLAQZ0 (hard pencil)	DHGEQZ (easy pencil)	DHGEQZ (hard pencil)
1000	1.5187637	2.5955450	6.8730151	6.6085497
2000	4.3298647	9.8546768	54.9083396	55.6486221
3000	19.8760378	35.1632956	198.6228615	194.1338587
4000	34.8435903	88.3386668	439.0105713	484.1612056

All the results represent the best time in 3 runs.

We still arrive at 12x speedup, and that is awesome!!

langou · 2021-04-09T20:25:30Z

Interesting comparison. My take away is that @thijssteel's laptop is faster than @weslleyspereira's laptop ;)

( More seriously, thanks for the timing @weslleyspereira, it's great. )

thijssteel · 2021-04-09T20:25:43Z

I want to note that the original 40× speedup is somewhat unrealistic. With well tuned parameters, that easy pencil just keeps doing aed and no sweeps are required. I reverted that commit because i don't want to optimise for such a special case.

weslleyspereira · 2021-04-12T18:50:59Z

At the request of @langou, I did some additional experiments for the GGES3 on my laptop.

Adaptation to @thijssteel's code: https://gist.github.com/weslleyspereira/15929a3a363683ddac5647ee2fe73723.
OpenBLAS (https://github.com/xianyi/OpenBLAS/tree/f9aaf22fc3aa63fd74c0f268826235a65d12cf4c) build with cmake -DCMAKE_BUILD_TYPE=Release. No additional configuration.
weslleyspereira/LAPACK branches:
- DGGES3 using DGGHD3 and DLAQZ0: https://github.com/weslleyspereira/lapack/tree/multishift-aed-QZ-performance
- DGGES3 using DGGHRD and DHGEQZ: https://github.com/weslleyspereira/lapack/tree/no-multishift-aed-QZ-performance
general matrices with random numbers, USER_SEED = 1

My system:

Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz (12 cores)
7.8 GB RAM
Ubuntu 18.04.5 LTS (5.4.0-70-generic)
GNU compilers, GCC version 7.5.0

Results (best time of 3 runs of DGGES3):

N	DGGES3	DGGHD3	DLAQZ0
1000	3.1482918	1.7105347 (54.3%)	1.3828994 (43.9%)
2000	17.7040713	12.8302498 (72.5%)	4.3809202 (24.7%)
3000	69.7460524	46.6079671 (66.8%)	10.2170884 (14.6%)
4000	145.9439852	119.7468701 (82.0%)	17.1359081 (11.7%)

N	DGGES3	DGGHRD	DHGEQZ
1000	5.3760809	2.0468199 (38.1%)	3.2798471 (61.0%)
2000	57.9165448	27.2491951 (47.0%)	30.0143812 (51.8%)
3000	215.3769933	108.1476715 (50.2%)	105.4639979 (49.0%)
4000	537.6017731	291.7817445 (54.3%)	241.6236139 (44.9%)

langou · 2021-04-12T20:24:47Z

Thanks for the run @weslleyspereira !

GEES3 parameters for this experiment are: JOBVSL = ’N’, JOBVSR = ’N’, SORT = ’N’.

So in short, taking N=4,000 as a reference, GGES3 is 3.7x faster in v3.10 than v3.5.

This is because GGHD3 (released in v3.6, Nov 2015) is 2.4x faster than GGHRD (in v3.5).

And then LAQZ0 (will be released in v3.10, June 2021) is 14.1x faster than HGEQZ (in v3.9).

In v3.5, the times in GGHRD and HGEQZ were about 54.3% and 44.9% resp. of GGES3.

In v3.10, the times in GGHRD and HGEQZ are about 82.0% and 11.7% resp. of GGES3.

So, now that QZ is so fast, the reduction to Triangular Hessenberg has become the major bottleneck of GGES3.

All these experiments are done without any tuning and using the default parameters.

weslleyspereira · 2021-04-12T20:31:31Z

Yes. That's it! I should also say that DLAQZ0 uses all my cores during almost all its execution. But we may need to use a proper profiler to get more information about core occupancy and balancing.

thijssteel · 2021-04-12T20:39:53Z

I don't think OpenBLAS can utilize a variable amount of cores. Its just a switch between threading and not threading. Using MKL reveals that it doesn't scale all that well (up to about 4 cores). Most of the multiplications involve thin matrices (This may improve with other parameters).

Also don't forget about the eigenvector calculation. I haven't timed it myself, but given the lack of level 3 BLAS calls, it can easily be the major bottleneck.

…d-QZ Multishift QZ with AED

thijssteel force-pushed the multishift-aed-QZ branch from a150ae2 to 5e0dc04 Compare July 5, 2020 11:47

thijssteel force-pushed the multishift-aed-QZ branch 2 times, most recently from 3a9bddb to fb06515 Compare September 17, 2020 08:52

julielangou added this to the Next Release milestone Nov 12, 2020

langou mentioned this pull request Jan 28, 2021

Convergence errors in ggev and gges with complex double element types #475

Closed

thijssteel force-pushed the multishift-aed-QZ branch from fd22927 to 21e993a Compare February 10, 2021 15:31

langou mentioned this pull request Feb 13, 2021

Add GSVD with QR factorizations, 2-by-1 CS decomposition #406

Open

thijssteel added 18 commits February 14, 2021 11:25

add double precision QZ code

c21d77d

remove some extra subroutines

8ce4a70

some more cleanup

7b0afdf

add the new QZ solver to dggev3

49d9be2

also add the files to cmake, not only make

701a890

attempt to improve code coverage

81211e4

attempt to fix coverage (again)

9f3c00c

use ilaenv, better errors and sweet sweet recusion

a6e2f3e

add new files to cmake list

03fe5c4

fix dumb mistake

433a5b5

fix the tests

ffef71f

kapot

4a26b58

fix a few bugs

2d276d9

update deflation criterium in AED to reflect dlaqr3

1034f10

change the parameters a little to increase coverage

e310909

some small fixes

b7e5cfa

some more improvements

4f7bee6

some formatting + solve overflow in dlaqz2

a7c0cbd

fix error exit tests

0b8015e

Revert "Merge pull request #1 from weslleyspereira/try-dggev-with-mul…

3bbb3e8

…tishift-aed" This reverts commit 77a97c4, reversing changes made to 93fd62f.

langou approved these changes Apr 15, 2021

View reviewed changes

langou merged commit f97e867 into Reference-LAPACK:master Apr 15, 2021

christoph-conrads pushed a commit to christoph-conrads/lapack that referenced this pull request May 23, 2021

Merge pull request Reference-LAPACK#421 from thijssteel/multishift-ae…

649e4e4

…d-QZ Multishift QZ with AED

langou mentioned this pull request Oct 11, 2021

DGEES gives different results in versions 3.9 and 3.10 #628

Closed

2 tasks

eprovst mentioned this pull request Mar 17, 2023

Prefer blocked LAPACK routines for generalized eigenvalue problems JuliaLang/julia#49037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multishift QZ with AED #421

Multishift QZ with AED #421

thijssteel commented Jul 3, 2020 •

edited

Loading

codecov bot commented Jul 3, 2020 •

edited

Loading

thijssteel commented Aug 5, 2020

thijssteel commented Nov 12, 2020

langou commented Nov 12, 2020 •

edited

Loading

thijssteel commented Nov 12, 2020

thijssteel commented Mar 17, 2021

thijssteel commented Mar 17, 2021 •

edited

Loading

weslleyspereira commented Mar 17, 2021

thijssteel commented Mar 17, 2021

thijssteel commented Mar 18, 2021

weslleyspereira commented Mar 18, 2021

thijssteel commented Mar 18, 2021

weslleyspereira commented Mar 18, 2021

langou commented Mar 27, 2021

thijssteel commented Mar 27, 2021

thijssteel commented Mar 31, 2021

langou commented Mar 31, 2021

weslleyspereira commented Mar 31, 2021

langou commented Apr 8, 2021

weslleyspereira commented Apr 8, 2021

thijssteel commented Apr 8, 2021 •

edited

Loading

weslleyspereira commented Apr 9, 2021 •

edited

Loading

langou commented Apr 9, 2021

thijssteel commented Apr 9, 2021

weslleyspereira commented Apr 12, 2021

langou commented Apr 12, 2021

weslleyspereira commented Apr 12, 2021

thijssteel commented Apr 12, 2021 •

edited

Loading

Multishift QZ with AED #421

Multishift QZ with AED #421

Conversation

thijssteel commented Jul 3, 2020 • edited Loading

codecov bot commented Jul 3, 2020 • edited Loading

Codecov Report

thijssteel commented Aug 5, 2020

thijssteel commented Nov 12, 2020

langou commented Nov 12, 2020 • edited Loading

thijssteel commented Nov 12, 2020

thijssteel commented Mar 17, 2021

thijssteel commented Mar 17, 2021 • edited Loading

weslleyspereira commented Mar 17, 2021

thijssteel commented Mar 17, 2021

thijssteel commented Mar 18, 2021

weslleyspereira commented Mar 18, 2021

thijssteel commented Mar 18, 2021

weslleyspereira commented Mar 18, 2021

langou commented Mar 27, 2021

thijssteel commented Mar 27, 2021

thijssteel commented Mar 31, 2021

langou commented Mar 31, 2021

weslleyspereira commented Mar 31, 2021

langou commented Apr 8, 2021

weslleyspereira commented Apr 8, 2021

thijssteel commented Apr 8, 2021 • edited Loading

weslleyspereira commented Apr 9, 2021 • edited Loading

langou commented Apr 9, 2021

thijssteel commented Apr 9, 2021

weslleyspereira commented Apr 12, 2021

langou commented Apr 12, 2021

weslleyspereira commented Apr 12, 2021

thijssteel commented Apr 12, 2021 • edited Loading

thijssteel commented Jul 3, 2020 •

edited

Loading

codecov bot commented Jul 3, 2020 •

edited

Loading

langou commented Nov 12, 2020 •

edited

Loading

thijssteel commented Mar 17, 2021 •

edited

Loading

thijssteel commented Apr 8, 2021 •

edited

Loading

weslleyspereira commented Apr 9, 2021 •

edited

Loading

thijssteel commented Apr 12, 2021 •

edited

Loading