Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multishift QZ with AED #421

Merged
merged 50 commits into from
Apr 15, 2021
Merged

Conversation

thijssteel
Copy link
Collaborator

@thijssteel thijssteel commented Jul 3, 2020

(PR reopened because I had to change branches)

This PR adds an implementation of the multishift QZ algorithm with AED.
It is loosely based on my implementation of the rational QZ algorithm (https://github.com/thijssteel/multishift-multipole-rqz).

It features:

  • Agressive early deflation
  • Multishift QZ sweeps using optimal packing of the bulges
  • A new heuristic to select the number of positions in the sweep windows

It does not feature:

  • A windowed deflation of infinite eigenvalues (that is only useful if many infinite eigenvalues are to be deflated and in that case you should probably do some preprocessing anyway).

All things considered this won't significantly reduce the runtime of generalised schur form calculations because the HT reduction is now dominant, but I hope be able to have a really fast implementation of that soon.

depends on #420

@codecov
Copy link

codecov bot commented Jul 3, 2020

Codecov Report

Merging #421 (3bbb3e8) into master (6281084) will increase coverage by 0.06%.
The diff coverage is 86.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #421      +/-   ##
==========================================
+ Coverage   83.33%   83.40%   +0.06%     
==========================================
  Files        1820     1838      +18     
  Lines      170849   188066   +17217     
==========================================
+ Hits       142378   156850   +14472     
- Misses      28471    31216    +2745     
Impacted Files Coverage Δ
SRC/ilaenv.f 23.07% <0.00%> (-0.09%) ⬇️
SRC/iparmq.f 0.00% <0.00%> (ø)
SRC/claqz0.f 65.61% <65.61%> (ø)
SRC/cgges3.f 93.71% <66.66%> (+0.42%) ⬆️
SRC/dgges3.f 91.58% <66.66%> (+0.67%) ⬆️
SRC/sgges3.f 89.50% <66.66%> (+0.72%) ⬆️
SRC/zgges3.f 93.71% <66.66%> (+0.42%) ⬆️
SRC/zlaqz0.f 68.32% <68.32%> (ø)
SRC/dggev3.f 93.75% <75.00%> (+0.45%) ⬆️
SRC/cggev3.f 93.10% <80.00%> (+0.60%) ⬆️
... and 1856 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6281084...3bbb3e8. Read the comment docs.

@thijssteel thijssteel force-pushed the multishift-aed-QZ branch from a150ae2 to 5e0dc04 Compare July 5, 2020 11:47
@thijssteel
Copy link
Collaborator Author

Two updates:

  • I've decided to use the same parameters as for the QR algorithm. This is a bit cleaner, but because of optimal packing, that is suboptimal.

  • I mentioned that the HT reduction is now dominant. That was slightly wrong because I was linking with the MKL version of LAPACK and they disable the blocked HT reduction for some reason. A proper implementation is already much faster.

This implementation should be done now, let me know if you need anything else like documentation or performance tests.

@thijssteel thijssteel force-pushed the multishift-aed-QZ branch 2 times, most recently from 3a9bddb to fb06515 Compare September 17, 2020 08:52
@thijssteel
Copy link
Collaborator Author

I've done some performance tests on my laptop. The following table shows execution time in seconds for different pencil sizes.

N DLAQZ0 (easy pencil) DLAQZ0 (hard pencil) DHGEQZ (easy pencil) DHGEQZ (hard pencil)
1000 1.2477440 1.8427166 4.4717226 5.2714647
1414 1.8305675 3.4307614 13.5002084 15.8388140
2000 3.2328878 6.4804934 41.9715897 49.1257091
2828 3.9365313 15.0546696 118.7894478 135.7473488
4000 8.8735433 31.2473274 365.1408321 400.7008308

@langou
Copy link
Contributor

langou commented Nov 12, 2020

from 12x to 40x speedup!!!!

@thijssteel
Copy link
Collaborator Author

Yeah, pretty massive speedup. It's the result of years of research culminating in one implementation (because DHGEQZ is quite old). Compared to more recent libraries like NLAFET this is only about twice as fast though.

@thijssteel
Copy link
Collaborator Author

I'll make some time to look at it in the weekend, but NW=0 should not be possible.

The only thing i can think of is that NWR is not set in ILAENV in that particular test. Do you still know what specific test failed?

@thijssteel
Copy link
Collaborator Author

thijssteel commented Mar 17, 2021

Note to self: possibly related to the XGGEV routines

@weslleyspereira
Copy link
Collaborator

I'll make some time to look at it in the weekend, but NW=0 should not be possible.

The only thing i can think of is that NWR is not set in ILAENV in that particular test. Do you still know what specific test failed?

Yes. That's exactly it. ILAENV returns 0 for this test. The first test in cgd.in and zgd.in fail. Not only the first, but you can look at the first.

@thijssteel
Copy link
Collaborator Author

Is it the responsibility of the routine to check if ilaenv is valid?

@thijssteel
Copy link
Collaborator Author

should be solved now

@weslleyspereira
Copy link
Collaborator

Thanks! You're the best! Now I saw how we need to set the parameters in the specific "err" files.

Just for my personal record:

  • Out of the test suite: ILAENV(ISPEC=12,...,17) calls IPARMQ(ISPEC)
  • Inside the test suite: ILAENV(ISPEC=12,...,17) access the array IPARMS(ISPEC)

@thijssteel
Copy link
Collaborator Author

Perhaps it would be a good idea to initialize that array to the values in IPARMQ to avoid issues with it in the future.

@langou
Copy link
Contributor

langou commented Mar 27, 2021

I've done some performance tests on my laptop. The following table shows execution time in seconds for different pencil sizes.

N DLAQZ0 (easy pencil) DLAQZ0 (hard pencil) DHGEQZ (easy pencil) DHGEQZ (hard pencil)
1000 1.2477440 1.8427166 4.4717226 5.2714647
1414 1.8305675 3.4307614 13.5002084 15.8388140
2000 3.2328878 6.4804934 41.9715897 49.1257091
2828 3.9365313 15.0546696 118.7894478 135.7473488
4000 8.8735433 31.2473274 365.1408321 400.7008308

Hi Thijs, what is an easy pencil? And what is a hard pencil? I am trying to time this as well. Julien.

@thijssteel
Copy link
Collaborator Author

These pencils are also discussed in the paper (and in other papers for that matter).

The easy pencil is generated in Hessenberg upper triangular form with randomly drawn entries. The hard pencil is the same but with entries A_ij = i + j and B_ij = 3i + 2j.

You may need to adapt the parameters to achieve optimal speedup.

I'll dig up the test code later for you.

@thijssteel
Copy link
Collaborator Author

@langou this is the code i used for testing: https://gist.github.com/thijssteel/6b2b7509fdc0a86fed5cc925ca268963

it should be linked with mmio.f for matrix market files.

Command line arguments:

  • Algorithm: 1 for DLAQZ0, 2 for DHGEQZ
  • Matrixtype: 1 for random, 2 for i+j, 3 for matrix market
  • N/File: when matrixtype is 1 or 2, an integer denoting the size of the pencil, when matrixtype is 3, two filenames of matrix market files

@langou
Copy link
Contributor

langou commented Mar 31, 2021

Thanks for sharing @thijssteel ! This is helpful! Cheers, Julien.

…tishift-aed"

This reverts commit 77a97c4, reversing
changes made to 93fd62f.
@weslleyspereira
Copy link
Collaborator

Hi! I've just reverted my commits related to the GGEV subroutines. We (@langou and I) recently realized we actually needed to test GGEV3. Sorry for this confusion.

@langou
Copy link
Contributor

langou commented Apr 8, 2021

To report a good news. MathWorks took the QZ code from Thijs to be released in 3.10, and they used it in their test suite, and the QZ code passes their test suite. So this is really great! Big thanks to MathWorks for taking the time to export the pre-release of LAPACK 3.10, compile it, sneak it under the hood of Matlab, and run their test suite. Everything is running fine. So thanks to MathWorks, and thanks to Thijs.

@weslleyspereira
Copy link
Collaborator

I've done some performance tests on my laptop. The following table shows execution time in seconds for different pencil sizes.
N DLAQZ0 (easy pencil) DLAQZ0 (hard pencil) DHGEQZ (easy pencil) DHGEQZ (hard pencil)
1000 1.2477440 1.8427166 4.4717226 5.2714647
1414 1.8305675 3.4307614 13.5002084 15.8388140
2000 3.2328878 6.4804934 41.9715897 49.1257091
2828 3.9365313 15.0546696 118.7894478 135.7473488
4000 8.8735433 31.2473274 365.1408321 400.7008308

Hi @thijssteel. What kind of BLAS did you use to obtain these results? And what was your processor? Thanks.

@thijssteel
Copy link
Collaborator Author

thijssteel commented Apr 8, 2021

Intel MKL (can't remember the version, but probably doesn't matter that much). Machine has an Intel Xeon E5-2697 v3 CPU with
14 cores and 128GB of RAM. The parameters were also adapted to achieve this performance (see a commit in the tree i reverted)

That's the compute server i have access to, my pc has an intel i7-8750H and 16GB of ram. Can't remember which one it was.

@weslleyspereira
Copy link
Collaborator

weslleyspereira commented Apr 9, 2021

I did some experiments on my laptop too. My objective was to run this branch without optimizing the parameters.

For that I used:

My system:

  • Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz (12 cores)
  • 7.8 GB RAM
  • Ubuntu 18.04.5 LTS (5.4.0-70-generic)
  • GNU compilers, GCC version 7.5.0

Results:

N DLAQZ0 (easy pencil) DLAQZ0 (hard pencil) DHGEQZ (easy pencil) DHGEQZ (hard pencil)
1000 1.5187637 2.5955450 6.8730151 6.6085497
2000 4.3298647 9.8546768 54.9083396 55.6486221
3000 19.8760378 35.1632956 198.6228615 194.1338587
4000 34.8435903 88.3386668 439.0105713 484.1612056
  • All the results represent the best time in 3 runs.

We still arrive at 12x speedup, and that is awesome!!

@langou
Copy link
Contributor

langou commented Apr 9, 2021

Interesting comparison. My take away is that @thijssteel's laptop is faster than @weslleyspereira's laptop ;)

( More seriously, thanks for the timing @weslleyspereira, it's great. )

@thijssteel
Copy link
Collaborator Author

I want to note that the original 40× speedup is somewhat unrealistic. With well tuned parameters, that easy pencil just keeps doing aed and no sweeps are required. I reverted that commit because i don't want to optimise for such a special case.

@weslleyspereira
Copy link
Collaborator

At the request of @langou, I did some additional experiments for the GGES3 on my laptop.

My system:

  • Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz (12 cores)
  • 7.8 GB RAM
  • Ubuntu 18.04.5 LTS (5.4.0-70-generic)
  • GNU compilers, GCC version 7.5.0

Results (best time of 3 runs of DGGES3):

N DGGES3 DGGHD3 DLAQZ0
1000 3.1482918 1.7105347 (54.3%) 1.3828994 (43.9%)
2000 17.7040713 12.8302498 (72.5%) 4.3809202 (24.7%)
3000 69.7460524 46.6079671 (66.8%) 10.2170884 (14.6%)
4000 145.9439852 119.7468701 (82.0%) 17.1359081 (11.7%)
N DGGES3 DGGHRD DHGEQZ
1000 5.3760809 2.0468199 (38.1%) 3.2798471 (61.0%)
2000 57.9165448 27.2491951 (47.0%) 30.0143812 (51.8%)
3000 215.3769933 108.1476715 (50.2%) 105.4639979 (49.0%)
4000 537.6017731 291.7817445 (54.3%) 241.6236139 (44.9%)

@langou
Copy link
Contributor

langou commented Apr 12, 2021

Thanks for the run @weslleyspereira !

GEES3 parameters for this experiment are: JOBVSL = ’N’, JOBVSR = ’N’, SORT = ’N’.

So in short, taking N=4,000 as a reference, GGES3 is 3.7x faster in v3.10 than v3.5.

This is because GGHD3 (released in v3.6, Nov 2015) is 2.4x faster than GGHRD (in v3.5).

And then LAQZ0 (will be released in v3.10, June 2021) is 14.1x faster than HGEQZ (in v3.9).

In v3.5, the times in GGHRD and HGEQZ were about 54.3% and 44.9% resp. of GGES3.

In v3.10, the times in GGHRD and HGEQZ are about 82.0% and 11.7% resp. of GGES3.

So, now that QZ is so fast, the reduction to Triangular Hessenberg has become the major bottleneck of GGES3.

All these experiments are done without any tuning and using the default parameters.

@weslleyspereira
Copy link
Collaborator

Yes. That's it! I should also say that DLAQZ0 uses all my cores during almost all its execution. But we may need to use a proper profiler to get more information about core occupancy and balancing.

@thijssteel
Copy link
Collaborator Author

thijssteel commented Apr 12, 2021

I don't think OpenBLAS can utilize a variable amount of cores. Its just a switch between threading and not threading. Using MKL reveals that it doesn't scale all that well (up to about 4 cores). Most of the multiplications involve thin matrices (This may improve with other parameters).

Also don't forget about the eigenvector calculation. I haven't timed it myself, but given the lack of level 3 BLAS calls, it can easily be the major bottleneck.

@langou langou merged commit f97e867 into Reference-LAPACK:master Apr 15, 2021
christoph-conrads pushed a commit to christoph-conrads/lapack that referenced this pull request May 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants