Integrate MKL option for performance improvement on Intel Architectures. #543

costat · 2018-06-13T16:16:54Z

This PR integrates new MKL features into SU2 to accelerate performance on Intel architectures. Changes 1) and 2) below are protected by a "HAVE_MKL" preprocessor flag. To observe performance improvement from change 2), the "DIRECT_CALL_MKL_SEQ" compiler flag must be enabled. These changes require MKL 2019 or newer, as the JIT GEMM feature is a brand new feature.

Integrate MKL JIT GEMM to accelerate MatrixMatrix and MatrixVector Products.
Use LAPACK DGETRF + DGETRS in place of Gaussian Elimination ILU when MKL is present.
Use memcpy in Gaussian Elimination ILU. Source/dest overlap is not a concern and this is faster.

The changes improve Broadwell performance by up to 18% and Skylake performance by up to 28%. These improvements were measured on the Inviscid_ONERA_M6 tutorial.

PR Checklist

Put an X by all that apply. You can fill this out after submitting the PR. If you have any questions, don't hesitate to ask! We want to help. These are a guide for you to know what the reviewers will be looking for in your contribution.

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
My contribution is commented and consistent with SU2 style.
I have added a test case that demonstrates my contribution, if necessary.

… Products. 2) Use LAPACK DGETRF + DGETRS in place of Gaussian Elimination when MKL is present. With DirectCall enabled this is much faster. 3) Use memcpy in GE. Source/dest overlap is not a concern and this is faster.

vdweide

A few comments/questions.

It does not look like the configure script has changed. So I suppose that you define the -DHAVE_MKL as an additional compiler flag during the configure. Would it be possible to add this option to the configure script in combination with a check if the MKL can actually be used? That would be more user-friendly in my opinion.
In the calls to the actual MKL routines there is a cast to (double *) for obvious reasons. This means that in discrete adjoint mode it cannot be used. I think there should be an explicit protection against the usage of the MKL when running in discrete adjoint mode to avoid any trouble.
I don't think this is possible, but do you see a possibility how the MKL can be used for the discrete adjoint, i.e. when su2double is not equal to a double?
Later on we may actually consider single precision as well. I suppose that would be a rather trivial change.
You give speed-ups for the inviscid ONERA M6. What is the speed-up for a RANS case?

talbring · 2018-07-23T21:03:54Z

@costat: Are you still planning to address the comments/questions from @vdweide ? Otherwise we cannot make any progress on reviewing/accepting this PR.

costat · 2018-07-24T17:02:52Z

@talbring @vdweide Yes I am still planning to address these comments and questions. Just before @vdweide commented I went on a long vacation, and have just returned. I'll follow up shortly.

…igh enough for features used in HAVE_MKL regions.

…tegrate_mkl

talbring · 2018-08-13T21:03:19Z

Any news here ?

…tegrate_mkl

costat · 2018-08-14T21:38:27Z

& 2. I believe I have addressed these comments in my subsequent commits.
I will need some time to look into this, but it would be orthogonal to what was done in this work.
This is a trivial change, and I'll be happy to make it when single precision support is added. The JIT features support both single and double precision.
Could you suggest a RANS case for me to run to check performance?

vdweide

1: Has been addressed, but as far as I can see it, you made the changes directly in configure. However, configure is created automatically from configure.ac using autoconf. Hence you should make the changes in configure.ac and create configure automatically.
2: Looks good.
3: (What you also called 2). I know this is indeed something different, but it would be very valuable to the high order DG solver. If you can make the MKL to work for the discrete adjoint a factor 10 speed-up can be obtained. But in order to do so, we need to involve a few more people.
4: (what you called 3).That's what I thought as well.
5: (what you called 4).The viscous M6 case, TestCases/rans/oneram6, would be a good start. If you would like to test it on a larger case, we can provide you one.

vdweide · 2018-08-28T05:30:12Z

I have one more question. Why do you explicitly add the MKL libraries to LIBS? Isn't it sufficient to use the compiler flag -mkl=sequential? That looks a bit easier to me.

costat · 2018-08-28T15:58:16Z

You're correct that is easier, but MKL does not require the Intel compiler, so I didn't want to restrict the compiler choice for the MKL enabled version. As far as I know the -mkl flag is unique to the Intel compiler.

vdweide · 2018-08-29T05:23:39Z

The configure looks good to me now. Did you do the test for the viscous ONERA M6? It would be good to know what the speed up is here.

Anyway, if you can merge with the latest develop version, it can be merged in as far as I am concerned.

costat · 2018-08-29T13:33:51Z

I did run the viscous ONERA M6 case. The speed up with the default config was marginal (2-3%). I also tried with multigrid enabled, in which case the MKL version was 10% faster. Overall the bottlenecks for the rans case were different than the inviscid onera case. I'll look at what can be done here next.

costat · 2018-08-29T15:04:32Z

I've merged the latest develop version. Thanks for your review.

vdweide · 2018-08-29T16:06:32Z

Looks all good to me. Merging in.

vdweide · 2018-08-29T16:08:11Z

@costat, the remaining question is whether we can do something with the MKL for the discrete adjoint. Are you still interested in looking into this?

costat · 2018-08-29T16:26:14Z

@vdweide Yes, definitely.

Integrate MKL option for performance improvement on Intel Architectures.

vdweide reviewed Jun 16, 2018

View reviewed changes

Merge branch 'develop' into integrate_mkl

65d47de

timothy.b.costa and others added 3 commits July 27, 2018 15:03

Detect MKL version when MKLROOT is defined. Determine if version is h…

a3a8f30

…igh enough for features used in HAVE_MKL regions.

Merge branch 'integrate_mkl' of https://github.com/costat/SU2 into in…

88cb5bc

…tegrate_mkl

Merge branch 'develop' into integrate_mkl

729e117

timothy.b.costa added 2 commits August 14, 2018 14:34

Add checks for adjoint solvers -- do not use MKL in those cases.

642ef28

Merge branch 'integrate_mkl' of https://github.com/costat/SU2 into in…

b876d62

…tegrate_mkl

vdweide requested changes Aug 17, 2018

View reviewed changes

costat and others added 2 commits August 28, 2018 09:01

Merge branch 'develop' into integrate_mkl

7c808b2

Move MKL build to configure.ac and regenerate configure

a00a18d

Merge branch 'develop' into integrate_mkl

9068883

vdweide approved these changes Aug 29, 2018

View reviewed changes

vdweide merged commit 2dc7be3 into su2code:develop Aug 29, 2018

costat deleted the integrate_mkl branch August 29, 2018 17:13

pcarruscag mentioned this pull request Feb 16, 2019

Increased performance of the discrete adjoint solver by using Templates for Linear Solvers #653

Merged

4 tasks

CatarinaGarbacz pushed a commit that referenced this pull request Aug 26, 2019

Merge pull request #543 from costat/integrate_mkl

05d5377

Integrate MKL option for performance improvement on Intel Architectures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate MKL option for performance improvement on Intel Architectures. #543

Integrate MKL option for performance improvement on Intel Architectures. #543

costat commented Jun 13, 2018 •

edited

Loading

vdweide left a comment

talbring commented Jul 23, 2018

costat commented Jul 24, 2018

talbring commented Aug 13, 2018

costat commented Aug 14, 2018

vdweide left a comment

vdweide commented Aug 28, 2018

costat commented Aug 28, 2018

vdweide commented Aug 29, 2018

costat commented Aug 29, 2018

costat commented Aug 29, 2018

vdweide commented Aug 29, 2018

vdweide commented Aug 29, 2018

costat commented Aug 29, 2018

Integrate MKL option for performance improvement on Intel Architectures. #543

Integrate MKL option for performance improvement on Intel Architectures. #543

Conversation

costat commented Jun 13, 2018 • edited Loading

PR Checklist

vdweide left a comment

Choose a reason for hiding this comment

talbring commented Jul 23, 2018

costat commented Jul 24, 2018

talbring commented Aug 13, 2018

costat commented Aug 14, 2018

vdweide left a comment

Choose a reason for hiding this comment

vdweide commented Aug 28, 2018

costat commented Aug 28, 2018

vdweide commented Aug 29, 2018

costat commented Aug 29, 2018

costat commented Aug 29, 2018

vdweide commented Aug 29, 2018

vdweide commented Aug 29, 2018

costat commented Aug 29, 2018

costat commented Jun 13, 2018 •

edited

Loading