Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PRKs #6162

Open
3 of 17 tasks
ben-albrecht opened this issue May 3, 2017 · 4 comments
Open
3 of 17 tasks

Improve PRKs #6162

ben-albrecht opened this issue May 3, 2017 · 4 comments

Comments

@ben-albrecht
Copy link
Member

ben-albrecht commented May 3, 2017

Here is a meta-issue to track progress on the implementations of Intel's Parallel Research Kernels in Chapel.

Resources

General

Implementations

Stencil

Transpose

  • Rewrite distributed implementation to reflect reference version
    • Current implementation is naive blockDist
  • Enable multilocale performance testing

Synch_p2p

  • Current implementation does not reflect reference version

DGEMM

DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA.

Maintaining multiple implementations would be useful (see @e-kayrakli's comment below)

PIC

  • Distributed implementation
  • Performance testing

Sparse

  • Performance testing

NStream

  • Performance testing

AMR

A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement

  • Implement

Branch

Very simple one that tests branch performance

  • Implement

Random

  • Implement

Reduce

Note: "Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.

  • Implement
@e-kayrakli e-kayrakli mentioned this issue May 3, 2017
2 tasks
ben-albrecht added a commit that referenced this issue May 25, 2017
Add new PRKs

Adds four new PRKs(the Parellel Research Kernels): DGEMM, Nstream,
PIC and Sparse

Issue #6162 has overall notes for improving PRKs. Some applies to the
PRKs in this PR, as well. Also, #6152 and #6153 updated the existing
Transpose and Stencil implementations recently.

Open questions / random notes:
- All but PIC are naively distributed. (There is no optimization
  whatsover) Should we keep it that way? Or not create distributed
  versions without properly optimizing them? PIC is not distributed at
  all.
- PIC code style needs a revision. I translated it from OpenMP version,
  trying to do things the Chapel way as much as I can. But variable
  names are almost identical to OpenMP variables. I think that's not the
  convention that Chapel community is adopting, so those should be
  changed to camel case where appropriate. (Possibly in a cleanup PR)
- PIC uses random_draw.c from the PRK repo almost as-is. I needed to
  make small adjustments so that variables are always declared in the
  beginning of blocks. It would be cool to have it implemented in
  Chapel, as well.  But I'd say it is very low priority at this point.
- DGEMM: I used unbounded ranges in deeply nested loops. I wonder if it
  has any performance problems compared to bounded ranges?
- DGEMM: In the CHIUW paper this version performed ~60% of OpenMP. But
  using C arrays for tile arrays makes Chapel performance almost
  identical to OpenMP. I wonder if that implementation of DGEMM would be
  frowned upon.
- On a similar note, should we have different flavors of PRKs? Such as
  coforall based Nstream.

Testing:
This PR only adds new tests to studies/prk. Said folder is locally
tested with standard linux64 and standard GASNet configs

Todo:
- [x] Cosmetic changes/more Chapel-ification in PIC
- [x] Investigate a TODO comment regarding unexpected behavior in PIC

[Contributed by @e-kayrakli]
[Reviewed by @ben-albrecht]
ben-albrecht added a commit that referenced this issue Jun 8, 2017
Trivial housekeeping in PRK

Addresses some of the trivial issues in #6162:

- Header comments with contributors
- Dir name updates
- Change `validate` flag to `correctness`

test/studies/prk passes with standard linux64 config

[Reviewed by @ben-albrecht]
@e-kayrakli
Copy link
Contributor

I have been working on Transpose recently and wanted to capture what is missing in the current implementation:

PRK specifications and the reference MPI1 implementation uses column-major arrays for both matrices and uses column-wise data decomposition. Then, the output array is accessed in column-major order where the input is accessed in row-major order. Current Transpose implementation in Chapel do things rather haphazardly in this context. Given that there is no native column-major layout in Chapel (yet?), I think arrays can be distributed with row-major decomposition and the access orders can be reversed (row-major on output array) to emulate something close to the reference implementation and the specs.

@e-kayrakli
Copy link
Contributor

@ben-albrecht, looking at the issue again I think there are few things that can be added:

  • Missing PRKs for completeness (some may be more important then others, like AMR):
    • AMR: A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement
    • Branch: Very simple one that tests branch performance
    • Random: Another simple one
    • Reduce: At least a straightforward implementation should be simple. ("Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.)
  • More clarification for DGEMM: DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA. FWIW, in a more proof-of-concept implementation I observed significant speedups and not-so-good scalability with a more naive approach where remote data is localized in bulk. I think in general it is good to have multiple versions (including the current one to see fine-grained communication performance) for especially something as important as matrix multiplication.

I don't think I can modify the original post, so you can interpret these however you wish and update it.

@ben-albrecht
Copy link
Member Author

@e-kayrakli - Updated. Let me know if you see anything that could be updated further.

@caizixian
Copy link
Member

Sorry, I wasn't aware of the existence of this issue.
FWIW, performance trend of transpose as of 1.17.1 can be found in #11031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants