Hoist chpl__getPrivatizedCopy #6184

benharsh · 2017-05-05T22:16:18Z

Consider the following program:

use BlockDist;

proc main() {
  var dom = {1..10};
  var space = dom dmapped Block(dom);
  var A : [space] int;

  for i in 1..10 {
    A[i] += 1;
  }

  writeln(A);
}

When compiled with --no-local, the A[i] expression will generate a call to the runtime function chpl__getPrivatizedCopy. We currently do not hoist this call, and it surprisingly turns out to have an impact on the Stencil PRK. In the Stencil PRK, there is more than one call to chpl__getPrivatizedCopy.

On 16-nodes ugni-qthreads we observe a 15% improvement in performance when hand-modifying the code to manually hoist the chpl__getPrivatizedCopy call.

	MFlop/s
Nightly	81860.5
HandOpt	93988.9

The performance improvement varies depending on problem size and other potential hand-optimizations.

The text was updated successfully, but these errors were encountered:

ronawho · 2017-05-06T01:13:19Z

@benharsh I had a pretty easy time getting chpl__getPrivatizedCopy hoisted for something like:

use PrivatizationWrappers;

proc myproc()  {
  var privatizedIdx = 5;
  var newValue = new C(privatizedIdx);
  insertPrivatized(newValue, privatizedIdx);

  for 1..10 {
    var a = getPrivatized(privatizedIdx);
  }
}

myproc();

but for the code you gave me it's going to be a lot trickier to do. The privatized class ends up getting passed around to a bunch of different functions and is used in different branches, so I think the defUse analysis and the domination analysis are preventing hoisting. If you have time next week, we should sit down and go through the generated code together.

An easier workaround might be to just move the runtime support for privatization into a header so the backend can inline and optimize it. Here's a first stab at that: master...ronawho:inline-privatization

It's not getting quite the performance boost you saw, but it's close at ~90-92 MFlop/s:

==> stencil-blockdist.dat <==
# master
05/05/17 	79833.6	0.243646
05/05/17 	79927.7	0.243359
05/05/17 	79953.6	0.24328

# inline privatization
05/05/17 	90730.2	0.214384
05/05/17 	89620.7	0.217038
05/05/17 	90291.2	0.215427

==> stencil-stencildist.dat <==
# master
05/05/17 	81120.2	0.239782
05/05/17 	80971.4	0.240222
05/05/17 	81020.4	0.240077

# inline privatization
05/05/17 	91998.7	0.211428
05/05/17 	91408.4	0.212794
05/05/17 	91287.2	0.213076

benharsh · 2017-05-06T15:52:33Z

Thanks for looking into this! It seems like the most practical next step would be to make the header modifications. I'll be interested to see what other benchmarks are impacted.

We can still look over the generated code if you want.

ronawho · 2017-05-06T23:20:08Z

Agreed, I think it makes sense to go ahead with the header mods, though longer term the LICM changes are probably still a good idea.

I didn't see many other major perf changes, here's the .dat files that had diffs https://gist.github.com/ronawho/40223c26c32c362dc012b3a14d66f18e

benharsh · 2017-05-08T23:12:58Z

This is likely a simpler program to work with:

use BlockDist;

proc main() {
  var dom = {1..10};
  var space = dom dmapped Block(dom);
  var A : [space] int;

  for i in 1..10 do local {
    A.localAccess[i] += 1;
  }

  writeln(A);
}

If you inline LocBlockArr.this, the generated code is much easier to follow. We'll eventually want to be using localAccess somehow in the PRK anyways.

Move runtime privatization support from the .c file to the header. chpl_getPrivatizedClass() can be called frequently, so we want to allow the backend compiler to fully optimize/inline calls to it. Moving the privatization source code into the header has a pretty big performance impact for the stencil PRK, improving performance by about 15% for 16-node-xc. There's also some minor improvements for fft, and lulesh. This is motivated by chapel-lang#6184, though it's not quite enough to close that issue yet.

@benharsh

Move runtime privatization support into chpl-privatization.h [reviewed by @benharsh] Move runtime privatization support from the .c file to the header. chpl_getPrivatizedClass() can be called frequently, so we want to allow the backend compiler to fully optimize/inline calls to it. Moving the privatization source code into the header has a pretty big performance impact for the stencil PRK, improving performance by about 15% for 16-node-xc. There's also some minor improvements for fft, and lulesh. This is motivated by #6184, though it's not quite enough to close that issue yet.

This is a second attempt at chapel-lang#6198, but only moves chpl_getPrivatizedClass() instead of the entire privatization implementation. chpl_getPrivatizedClass() is just a getter for chpl_privateObjects, so we also need to extern to chpl_privateObjects. chpl_getPrivatizedClass() can be called frequently, so we want to allow the backend compiler to fully optimize/inline calls to it. This has a pretty big performance impact for the stencil PRK, improving performance by about 15% for 16-node-xc. There's also some minor improvements for fft, and lulesh. This is motivated by chapel-lang#6184, though it's not quite enough to close that issue yet.

benharsh · 2017-05-10T23:39:06Z

Another variant to consider:

use BlockDist;

proc main() {
  var dom = {1..10};
  var space = dom dmapped Block(dom);
  var A : [space] int;

  forall i in space do local {
    A.localAccess[i] += 1;
  }

  writeln(A);
}

The forall will create coforall functions and pass A as an argument, which currently thwarts some LICM optimizations.

@benharsh

Move chpl_getPrivatizedClass() into chpl-privatization.h [reviewed by @benharsh, @dmk42, and @gbtitus] This is a second attempt at #6198, but only moves chpl_getPrivatizedClass() instead of the entire privatization implementation. chpl_getPrivatizedClass() is a getter for chpl_privateObjects, so we also need to extern to chpl_privateObjects. chpl_getPrivatizedClass() can be called frequently, so we want to allow the backend compiler to fully optimize/inline calls to it. This has a pretty big performance impact for the stencil PRK, improving performance by about 15% for 16-node-xc. There's also some minor improvements for fft, and lulesh. This is motivated by #6184, though it's not quite enough to close that issue yet.

benharsh assigned benharsh and ronawho and unassigned benharsh May 5, 2017

ronawho added area: Compiler type: Performance labels May 5, 2017

ben-albrecht mentioned this issue May 8, 2017

Improve PRKs #6162

Open

17 tasks

ronawho mentioned this issue May 9, 2017

Move runtime privatization support into chpl-privatization.h #6198

Merged

ronawho mentioned this issue May 10, 2017

Move chpl_getPrivatizedClass() into chpl-privatization.h #6212

Merged

This was referenced Jun 6, 2017

Improve loop invariant code motion for PRK-DGEMM #6388

Open

Improve Loop Invariant Code Motion #6411

Open

ronawho mentioned this issue Jan 13, 2018

Introducing Quiescent State-Based Reclamation to Chapel #8182

Merged

6 tasks

ronawho removed their assignment Jan 19, 2018

ronawho mentioned this issue Apr 28, 2018

Performance with --llvm lags for prk-stencil with --no-local #8060

Closed

mppf mentioned this issue Nov 17, 2023

privatization meta-issue #23877

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoist chpl__getPrivatizedCopy #6184

Hoist chpl__getPrivatizedCopy #6184

benharsh commented May 5, 2017 •

edited

Loading

ronawho commented May 6, 2017

benharsh commented May 6, 2017

ronawho commented May 6, 2017

benharsh commented May 8, 2017

benharsh commented May 10, 2017

Hoist chpl__getPrivatizedCopy #6184

Hoist chpl__getPrivatizedCopy #6184

Comments

benharsh commented May 5, 2017 • edited Loading

ronawho commented May 6, 2017

benharsh commented May 6, 2017

ronawho commented May 6, 2017

benharsh commented May 8, 2017

benharsh commented May 10, 2017

benharsh commented May 5, 2017 •

edited

Loading