-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate MKL option for performance improvement on Intel Architectures. #543
Conversation
… Products. 2) Use LAPACK DGETRF + DGETRS in place of Gaussian Elimination when MKL is present. With DirectCall enabled this is much faster. 3) Use memcpy in GE. Source/dest overlap is not a concern and this is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments/questions.
-
It does not look like the configure script has changed. So I suppose that you define the -DHAVE_MKL as an additional compiler flag during the configure. Would it be possible to add this option to the configure script in combination with a check if the MKL can actually be used? That would be more user-friendly in my opinion.
-
In the calls to the actual MKL routines there is a cast to (double *) for obvious reasons. This means that in discrete adjoint mode it cannot be used. I think there should be an explicit protection against the usage of the MKL when running in discrete adjoint mode to avoid any trouble.
-
I don't think this is possible, but do you see a possibility how the MKL can be used for the discrete adjoint, i.e. when su2double is not equal to a double?
-
Later on we may actually consider single precision as well. I suppose that would be a rather trivial change.
-
You give speed-ups for the inviscid ONERA M6. What is the speed-up for a RANS case?
…igh enough for features used in HAVE_MKL regions.
Any news here ? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1: Has been addressed, but as far as I can see it, you made the changes directly in configure. However, configure is created automatically from configure.ac using autoconf. Hence you should make the changes in configure.ac and create configure automatically.
2: Looks good.
3: (What you also called 2). I know this is indeed something different, but it would be very valuable to the high order DG solver. If you can make the MKL to work for the discrete adjoint a factor 10 speed-up can be obtained. But in order to do so, we need to involve a few more people.
4: (what you called 3).That's what I thought as well.
5: (what you called 4).The viscous M6 case, TestCases/rans/oneram6, would be a good start. If you would like to test it on a larger case, we can provide you one.
I have one more question. Why do you explicitly add the MKL libraries to LIBS? Isn't it sufficient to use the compiler flag -mkl=sequential? That looks a bit easier to me. |
You're correct that is easier, but MKL does not require the Intel compiler, so I didn't want to restrict the compiler choice for the MKL enabled version. As far as I know the -mkl flag is unique to the Intel compiler. |
The configure looks good to me now. Did you do the test for the viscous ONERA M6? It would be good to know what the speed up is here. Anyway, if you can merge with the latest develop version, it can be merged in as far as I am concerned. |
I did run the viscous ONERA M6 case. The speed up with the default config was marginal (2-3%). I also tried with multigrid enabled, in which case the MKL version was 10% faster. Overall the bottlenecks for the rans case were different than the inviscid onera case. I'll look at what can be done here next. |
I've merged the latest develop version. Thanks for your review. |
Looks all good to me. Merging in. |
@costat, the remaining question is whether we can do something with the MKL for the discrete adjoint. Are you still interested in looking into this? |
@vdweide Yes, definitely. |
Integrate MKL option for performance improvement on Intel Architectures.
This PR integrates new MKL features into SU2 to accelerate performance on Intel architectures. Changes 1) and 2) below are protected by a "HAVE_MKL" preprocessor flag. To observe performance improvement from change 2), the "DIRECT_CALL_MKL_SEQ" compiler flag must be enabled. These changes require MKL 2019 or newer, as the JIT GEMM feature is a brand new feature.
The changes improve Broadwell performance by up to 18% and Skylake performance by up to 28%. These improvements were measured on the Inviscid_ONERA_M6 tutorial.
PR Checklist
Put an X by all that apply. You can fill this out after submitting the PR. If you have any questions, don't hesitate to ask! We want to help. These are a guide for you to know what the reviewers will be looking for in your contribution.