-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PRKs #6162
Comments
Add new PRKs Adds four new PRKs(the Parellel Research Kernels): DGEMM, Nstream, PIC and Sparse Issue #6162 has overall notes for improving PRKs. Some applies to the PRKs in this PR, as well. Also, #6152 and #6153 updated the existing Transpose and Stencil implementations recently. Open questions / random notes: - All but PIC are naively distributed. (There is no optimization whatsover) Should we keep it that way? Or not create distributed versions without properly optimizing them? PIC is not distributed at all. - PIC code style needs a revision. I translated it from OpenMP version, trying to do things the Chapel way as much as I can. But variable names are almost identical to OpenMP variables. I think that's not the convention that Chapel community is adopting, so those should be changed to camel case where appropriate. (Possibly in a cleanup PR) - PIC uses random_draw.c from the PRK repo almost as-is. I needed to make small adjustments so that variables are always declared in the beginning of blocks. It would be cool to have it implemented in Chapel, as well. But I'd say it is very low priority at this point. - DGEMM: I used unbounded ranges in deeply nested loops. I wonder if it has any performance problems compared to bounded ranges? - DGEMM: In the CHIUW paper this version performed ~60% of OpenMP. But using C arrays for tile arrays makes Chapel performance almost identical to OpenMP. I wonder if that implementation of DGEMM would be frowned upon. - On a similar note, should we have different flavors of PRKs? Such as coforall based Nstream. Testing: This PR only adds new tests to studies/prk. Said folder is locally tested with standard linux64 and standard GASNet configs Todo: - [x] Cosmetic changes/more Chapel-ification in PIC - [x] Investigate a TODO comment regarding unexpected behavior in PIC [Contributed by @e-kayrakli] [Reviewed by @ben-albrecht]
Trivial housekeeping in PRK Addresses some of the trivial issues in #6162: - Header comments with contributors - Dir name updates - Change `validate` flag to `correctness` test/studies/prk passes with standard linux64 config [Reviewed by @ben-albrecht]
I have been working on Transpose recently and wanted to capture what is missing in the current implementation: PRK specifications and the reference MPI1 implementation uses column-major arrays for both matrices and uses column-wise data decomposition. Then, the output array is accessed in column-major order where the input is accessed in row-major order. Current Transpose implementation in Chapel do things rather haphazardly in this context. Given that there is no native column-major layout in Chapel (yet?), I think arrays can be distributed with row-major decomposition and the access orders can be reversed (row-major on output array) to emulate something close to the reference implementation and the specs. |
@ben-albrecht, looking at the issue again I think there are few things that can be added:
I don't think I can modify the original post, so you can interpret these however you wish and update it. |
@e-kayrakli - Updated. Let me know if you see anything that could be updated further. |
Sorry, I wasn't aware of the existence of this issue. |
Here is a meta-issue to track progress on the implementations of Intel's Parallel Research Kernels in Chapel.
Resources
General
contributed by
header comments, giving credit to authors and contributors. (Trivial housekeeping in PRK #6405)--correctness
flag rather than--validate
for clarity (Trivial housekeeping in PRK #6405)Implementations
Stencil
chpl__getPrivatizedCopy
Hoist chpl__getPrivatizedCopy #6184Transpose
Synch_p2p
DGEMM
DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA.
Maintaining multiple implementations would be useful (see @e-kayrakli's comment below)
PIC
Sparse
NStream
AMR
A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement
Branch
Very simple one that tests branch performance
Random
Reduce
Note: "Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.
The text was updated successfully, but these errors were encountered: