Improve PRKs #6162

ben-albrecht · 2017-05-03T17:27:17Z

Here is a meta-issue to track progress on the implementations of Intel's Parallel Research Kernels in Chapel.

Resources

General

Clean up and improve maintainability of prk README
Update directory names to reflect names used in PRKs, e.g. casing (Trivial housekeeping in PRK #6405)
Add contributed by header comments, giving credit to authors and contributors. (Trivial housekeeping in PRK #6405)
Use --correctness flag rather than --validate for clarity (Trivial housekeeping in PRK #6405)

Implementations

Stencil

Consider dynamic unrolling approach used in OpenMP version as described in Update PRK Stencil #6153
Hoist chpl__getPrivatizedCopy Hoist chpl__getPrivatizedCopy #6184

Transpose

Rewrite distributed implementation to reflect reference version
- Current implementation is naive blockDist
Enable multilocale performance testing

Synch_p2p

Current implementation does not reflect reference version

DGEMM

DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA.

Maintaining multiple implementations would be useful (see @e-kayrakli's comment below)

Performance testing
- Blocking Improve loop invariant code motion for PRK-DGEMM #6388

PIC

Distributed implementation
Performance testing

Sparse

Performance testing

NStream

Performance testing

AMR

A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement

Implement

Branch

Very simple one that tests branch performance

Implement

Random

Implement

Reduce

Note: "Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.

Implement

The text was updated successfully, but these errors were encountered:

@e-kayrakli

Add new PRKs Adds four new PRKs(the Parellel Research Kernels): DGEMM, Nstream, PIC and Sparse Issue #6162 has overall notes for improving PRKs. Some applies to the PRKs in this PR, as well. Also, #6152 and #6153 updated the existing Transpose and Stencil implementations recently. Open questions / random notes: - All but PIC are naively distributed. (There is no optimization whatsover) Should we keep it that way? Or not create distributed versions without properly optimizing them? PIC is not distributed at all. - PIC code style needs a revision. I translated it from OpenMP version, trying to do things the Chapel way as much as I can. But variable names are almost identical to OpenMP variables. I think that's not the convention that Chapel community is adopting, so those should be changed to camel case where appropriate. (Possibly in a cleanup PR) - PIC uses random_draw.c from the PRK repo almost as-is. I needed to make small adjustments so that variables are always declared in the beginning of blocks. It would be cool to have it implemented in Chapel, as well. But I'd say it is very low priority at this point. - DGEMM: I used unbounded ranges in deeply nested loops. I wonder if it has any performance problems compared to bounded ranges? - DGEMM: In the CHIUW paper this version performed ~60% of OpenMP. But using C arrays for tile arrays makes Chapel performance almost identical to OpenMP. I wonder if that implementation of DGEMM would be frowned upon. - On a similar note, should we have different flavors of PRKs? Such as coforall based Nstream. Testing: This PR only adds new tests to studies/prk. Said folder is locally tested with standard linux64 and standard GASNet configs Todo: - [x] Cosmetic changes/more Chapel-ification in PIC - [x] Investigate a TODO comment regarding unexpected behavior in PIC [Contributed by @e-kayrakli] [Reviewed by @ben-albrecht]

@ben-albrecht

Trivial housekeeping in PRK Addresses some of the trivial issues in #6162: - Header comments with contributors - Dir name updates - Change `validate` flag to `correctness` test/studies/prk passes with standard linux64 config [Reviewed by @ben-albrecht]

e-kayrakli · 2017-09-12T16:27:54Z

I have been working on Transpose recently and wanted to capture what is missing in the current implementation:

PRK specifications and the reference MPI1 implementation uses column-major arrays for both matrices and uses column-wise data decomposition. Then, the output array is accessed in column-major order where the input is accessed in row-major order. Current Transpose implementation in Chapel do things rather haphazardly in this context. Given that there is no native column-major layout in Chapel (yet?), I think arrays can be distributed with row-major decomposition and the access orders can be reversed (row-major on output array) to emulate something close to the reference implementation and the specs.

e-kayrakli · 2018-02-02T17:40:36Z

@ben-albrecht, looking at the issue again I think there are few things that can be added:

Missing PRKs for completeness (some may be more important then others, like AMR):
- AMR: A variation of Stencil that spawns subgrids to emulate adaptive mesh refinement
- Branch: Very simple one that tests branch performance
- Random: Another simple one
- Reduce: At least a straightforward implementation should be simple. ("Reduce" may be a misnomer as it seemingly does a element-wise vector addition where vectors are at specific parts of the memory.)
More clarification for DGEMM: DGEMM is distributed in its current state but it is not SUMMA. Note that the PRK specs does not specify an algorithm but MPI1 implementation is based on SUMMA. FWIW, in a more proof-of-concept implementation I observed significant speedups and not-so-good scalability with a more naive approach where remote data is localized in bulk. I think in general it is good to have multiple versions (including the current one to see fine-grained communication performance) for especially something as important as matrix multiplication.

I don't think I can modify the original post, so you can interpret these however you wish and update it.

ben-albrecht · 2018-02-02T21:07:59Z

@e-kayrakli - Updated. Let me know if you see anything that could be updated further.

caizixian · 2018-09-07T01:25:34Z

Sorry, I wasn't aware of the existence of this issue.
FWIW, performance trend of transpose as of 1.17.1 can be found in #11031

e-kayrakli mentioned this issue May 3, 2017

Add new PRKs #6165

Merged

2 tasks

ben-albrecht added area: Tests / Benchmarks type: Performance labels May 6, 2017

e-kayrakli mentioned this issue Jun 7, 2017

Trivial housekeeping in PRK #6405

Merged

e-kayrakli mentioned this issue Oct 17, 2017

port to Chapel ParRes/Kernels#37

Open

ben-albrecht mentioned this issue Sep 7, 2018

Slow multi locale PRK transpose #11031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PRKs #6162

Improve PRKs #6162

ben-albrecht commented May 3, 2017 •

edited

Loading

e-kayrakli commented Sep 12, 2017

e-kayrakli commented Feb 2, 2018

ben-albrecht commented Feb 2, 2018

caizixian commented Sep 7, 2018

Improve PRKs #6162

Improve PRKs #6162

Comments

ben-albrecht commented May 3, 2017 • edited Loading

Resources

General

Implementations

Stencil

Transpose

Synch_p2p

DGEMM

PIC

Sparse

NStream

AMR

Branch

Random

Reduce

e-kayrakli commented Sep 12, 2017

e-kayrakli commented Feb 2, 2018

ben-albrecht commented Feb 2, 2018

caizixian commented Sep 7, 2018

ben-albrecht commented May 3, 2017 •

edited

Loading