Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate MKL option for performance improvement on Intel Architectures. #543

Merged
merged 10 commits into from
Aug 29, 2018
Merged

Integrate MKL option for performance improvement on Intel Architectures. #543

merged 10 commits into from
Aug 29, 2018

Conversation

costat
Copy link
Contributor

@costat costat commented Jun 13, 2018

This PR integrates new MKL features into SU2 to accelerate performance on Intel architectures. Changes 1) and 2) below are protected by a "HAVE_MKL" preprocessor flag. To observe performance improvement from change 2), the "DIRECT_CALL_MKL_SEQ" compiler flag must be enabled. These changes require MKL 2019 or newer, as the JIT GEMM feature is a brand new feature.

  1. Integrate MKL JIT GEMM to accelerate MatrixMatrix and MatrixVector Products.
  2. Use LAPACK DGETRF + DGETRS in place of Gaussian Elimination ILU when MKL is present.
  3. Use memcpy in Gaussian Elimination ILU. Source/dest overlap is not a concern and this is faster.

The changes improve Broadwell performance by up to 18% and Skylake performance by up to 28%. These improvements were measured on the Inviscid_ONERA_M6 tutorial.

PR Checklist

Put an X by all that apply. You can fill this out after submitting the PR. If you have any questions, don't hesitate to ask! We want to help. These are a guide for you to know what the reviewers will be looking for in your contribution.

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
  • My contribution is commented and consistent with SU2 style.
  • I have added a test case that demonstrates my contribution, if necessary.

… Products.

2) Use LAPACK DGETRF + DGETRS in place of Gaussian Elimination when MKL is present. With DirectCall enabled this is much faster.
3) Use memcpy in GE. Source/dest overlap is not a concern and this is faster.
Copy link
Contributor

@vdweide vdweide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions.

  1. It does not look like the configure script has changed. So I suppose that you define the -DHAVE_MKL as an additional compiler flag during the configure. Would it be possible to add this option to the configure script in combination with a check if the MKL can actually be used? That would be more user-friendly in my opinion.

  2. In the calls to the actual MKL routines there is a cast to (double *) for obvious reasons. This means that in discrete adjoint mode it cannot be used. I think there should be an explicit protection against the usage of the MKL when running in discrete adjoint mode to avoid any trouble.

  3. I don't think this is possible, but do you see a possibility how the MKL can be used for the discrete adjoint, i.e. when su2double is not equal to a double?

  4. Later on we may actually consider single precision as well. I suppose that would be a rather trivial change.

  5. You give speed-ups for the inviscid ONERA M6. What is the speed-up for a RANS case?

@talbring
Copy link
Member

@costat: Are you still planning to address the comments/questions from @vdweide ? Otherwise we cannot make any progress on reviewing/accepting this PR.

@costat
Copy link
Contributor Author

costat commented Jul 24, 2018

@talbring @vdweide Yes I am still planning to address these comments and questions. Just before @vdweide commented I went on a long vacation, and have just returned. I'll follow up shortly.

@talbring
Copy link
Member

Any news here ?

@costat
Copy link
Contributor Author

costat commented Aug 14, 2018

  1. & 2. I believe I have addressed these comments in my subsequent commits.
  2. I will need some time to look into this, but it would be orthogonal to what was done in this work.
  3. This is a trivial change, and I'll be happy to make it when single precision support is added. The JIT features support both single and double precision.
  4. Could you suggest a RANS case for me to run to check performance?

Copy link
Contributor

@vdweide vdweide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1: Has been addressed, but as far as I can see it, you made the changes directly in configure. However, configure is created automatically from configure.ac using autoconf. Hence you should make the changes in configure.ac and create configure automatically.
2: Looks good.
3: (What you also called 2). I know this is indeed something different, but it would be very valuable to the high order DG solver. If you can make the MKL to work for the discrete adjoint a factor 10 speed-up can be obtained. But in order to do so, we need to involve a few more people.
4: (what you called 3).That's what I thought as well.
5: (what you called 4).The viscous M6 case, TestCases/rans/oneram6, would be a good start. If you would like to test it on a larger case, we can provide you one.

@vdweide
Copy link
Contributor

vdweide commented Aug 28, 2018

I have one more question. Why do you explicitly add the MKL libraries to LIBS? Isn't it sufficient to use the compiler flag -mkl=sequential? That looks a bit easier to me.

@costat
Copy link
Contributor Author

costat commented Aug 28, 2018

You're correct that is easier, but MKL does not require the Intel compiler, so I didn't want to restrict the compiler choice for the MKL enabled version. As far as I know the -mkl flag is unique to the Intel compiler.

@vdweide
Copy link
Contributor

vdweide commented Aug 29, 2018

The configure looks good to me now. Did you do the test for the viscous ONERA M6? It would be good to know what the speed up is here.

Anyway, if you can merge with the latest develop version, it can be merged in as far as I am concerned.

@costat
Copy link
Contributor Author

costat commented Aug 29, 2018

I did run the viscous ONERA M6 case. The speed up with the default config was marginal (2-3%). I also tried with multigrid enabled, in which case the MKL version was 10% faster. Overall the bottlenecks for the rans case were different than the inviscid onera case. I'll look at what can be done here next.

@costat
Copy link
Contributor Author

costat commented Aug 29, 2018

I've merged the latest develop version. Thanks for your review.

@vdweide vdweide merged commit 2dc7be3 into su2code:develop Aug 29, 2018
@vdweide
Copy link
Contributor

vdweide commented Aug 29, 2018

Looks all good to me. Merging in.

@vdweide
Copy link
Contributor

vdweide commented Aug 29, 2018

@costat, the remaining question is whether we can do something with the MKL for the discrete adjoint. Are you still interested in looking into this?

@costat
Copy link
Contributor Author

costat commented Aug 29, 2018

@vdweide Yes, definitely.

@costat costat deleted the integrate_mkl branch August 29, 2018 17:13
CatarinaGarbacz pushed a commit that referenced this pull request Aug 26, 2019
Integrate MKL option for performance improvement on Intel Architectures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants