Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoist chpl__getPrivatizedCopy #6184

Open
benharsh opened this issue May 5, 2017 · 5 comments
Open

Hoist chpl__getPrivatizedCopy #6184

benharsh opened this issue May 5, 2017 · 5 comments

Comments

@benharsh
Copy link
Member

benharsh commented May 5, 2017

Consider the following program:

use BlockDist;

proc main() {
  var dom = {1..10};
  var space = dom dmapped Block(dom);
  var A : [space] int;

  for i in 1..10 {
    A[i] += 1;
  }

  writeln(A);
}

When compiled with --no-local, the A[i] expression will generate a call to the runtime function chpl__getPrivatizedCopy. We currently do not hoist this call, and it surprisingly turns out to have an impact on the Stencil PRK. In the Stencil PRK, there is more than one call to chpl__getPrivatizedCopy.

On 16-nodes ugni-qthreads we observe a 15% improvement in performance when hand-modifying the code to manually hoist the chpl__getPrivatizedCopy call.

MFlop/s
Nightly 81860.5
HandOpt 93988.9

The performance improvement varies depending on problem size and other potential hand-optimizations.

@ronawho
Copy link
Contributor

ronawho commented May 6, 2017

@benharsh I had a pretty easy time getting chpl__getPrivatizedCopy hoisted for something like:

use PrivatizationWrappers;

proc myproc()  {
  var privatizedIdx = 5;
  var newValue = new C(privatizedIdx);
  insertPrivatized(newValue, privatizedIdx);

  for 1..10 {
    var a = getPrivatized(privatizedIdx);
  }
}

myproc();

but for the code you gave me it's going to be a lot trickier to do. The privatized class ends up getting passed around to a bunch of different functions and is used in different branches, so I think the defUse analysis and the domination analysis are preventing hoisting. If you have time next week, we should sit down and go through the generated code together.

An easier workaround might be to just move the runtime support for privatization into a header so the backend can inline and optimize it. Here's a first stab at that: master...ronawho:inline-privatization

It's not getting quite the performance boost you saw, but it's close at ~90-92 MFlop/s:

==> stencil-blockdist.dat <==
# master
05/05/17 	79833.6	0.243646
05/05/17 	79927.7	0.243359
05/05/17 	79953.6	0.24328

# inline privatization
05/05/17 	90730.2	0.214384
05/05/17 	89620.7	0.217038
05/05/17 	90291.2	0.215427

==> stencil-stencildist.dat <==
# master
05/05/17 	81120.2	0.239782
05/05/17 	80971.4	0.240222
05/05/17 	81020.4	0.240077

# inline privatization
05/05/17 	91998.7	0.211428
05/05/17 	91408.4	0.212794
05/05/17 	91287.2	0.213076

@benharsh
Copy link
Member Author

benharsh commented May 6, 2017

Thanks for looking into this! It seems like the most practical next step would be to make the header modifications. I'll be interested to see what other benchmarks are impacted.

We can still look over the generated code if you want.

@ronawho
Copy link
Contributor

ronawho commented May 6, 2017

Agreed, I think it makes sense to go ahead with the header mods, though longer term the LICM changes are probably still a good idea.

I didn't see many other major perf changes, here's the .dat files that had diffs https://gist.github.com/ronawho/40223c26c32c362dc012b3a14d66f18e

@ben-albrecht ben-albrecht mentioned this issue May 8, 2017
17 tasks
@benharsh
Copy link
Member Author

benharsh commented May 8, 2017

This is likely a simpler program to work with:

use BlockDist;

proc main() {
  var dom = {1..10};
  var space = dom dmapped Block(dom);
  var A : [space] int;

  for i in 1..10 do local {
    A.localAccess[i] += 1;
  }

  writeln(A);
}

If you inline LocBlockArr.this, the generated code is much easier to follow. We'll eventually want to be using localAccess somehow in the PRK anyways.

ronawho added a commit to ronawho/chapel that referenced this issue May 9, 2017
Move runtime privatization support from the .c file to the header.
chpl_getPrivatizedClass() can be called frequently, so we want to allow the
backend compiler to fully optimize/inline calls to it.

Moving the privatization source code into the header has a pretty big
performance impact for the stencil PRK, improving performance by about 15% for
16-node-xc. There's also some minor improvements for fft, and lulesh.

This is motivated by chapel-lang#6184, though
it's not quite enough to close that issue yet.
ronawho added a commit that referenced this issue May 9, 2017
Move runtime privatization support into chpl-privatization.h

[reviewed by @benharsh]

Move runtime privatization support from the .c file to the header.
chpl_getPrivatizedClass() can be called frequently, so we want to allow the
backend compiler to fully optimize/inline calls to it.

Moving the privatization source code into the header has a pretty big
performance impact for the stencil PRK, improving performance by about 15% for
16-node-xc. There's also some minor improvements for fft, and lulesh.

This is motivated by #6184, though 
it's not quite enough to close that issue yet.
ronawho added a commit to ronawho/chapel that referenced this issue May 10, 2017
This is a second attempt at chapel-lang#6198, but only moves chpl_getPrivatizedClass()
instead of the entire privatization implementation. chpl_getPrivatizedClass()
is just a getter for chpl_privateObjects, so we also need to extern to
chpl_privateObjects.

chpl_getPrivatizedClass() can be called frequently, so we want to allow the
backend compiler to fully optimize/inline calls to it. This has a pretty big
performance impact for the stencil PRK, improving performance by about 15% for
16-node-xc. There's also some minor improvements for fft, and lulesh.

This is motivated by chapel-lang#6184, though it's not quite enough to close that issue
yet.
ronawho added a commit to ronawho/chapel that referenced this issue May 10, 2017
This is a second attempt at chapel-lang#6198, but only moves chpl_getPrivatizedClass()
instead of the entire privatization implementation. chpl_getPrivatizedClass()
is just a getter for chpl_privateObjects, so we also need to extern to
chpl_privateObjects.

chpl_getPrivatizedClass() can be called frequently, so we want to allow the
backend compiler to fully optimize/inline calls to it. This has a pretty big
performance impact for the stencil PRK, improving performance by about 15% for
16-node-xc. There's also some minor improvements for fft, and lulesh.

This is motivated by chapel-lang#6184, though it's not quite enough to close that issue
yet.
@benharsh
Copy link
Member Author

Another variant to consider:

use BlockDist;

proc main() {
  var dom = {1..10};
  var space = dom dmapped Block(dom);
  var A : [space] int;

  forall i in space do local {
    A.localAccess[i] += 1;
  }

  writeln(A);
}

The forall will create coforall functions and pass A as an argument, which currently thwarts some LICM optimizations.

ronawho added a commit that referenced this issue May 11, 2017
Move chpl_getPrivatizedClass() into chpl-privatization.h

[reviewed by @benharsh, @dmk42, and @gbtitus]

This is a second attempt at #6198, but only moves chpl_getPrivatizedClass()
instead of the entire privatization implementation. chpl_getPrivatizedClass()
is a getter for chpl_privateObjects, so we also need to extern to
chpl_privateObjects.

chpl_getPrivatizedClass() can be called frequently, so we want to allow the
backend compiler to fully optimize/inline calls to it. This has a pretty big
performance impact for the stencil PRK, improving performance by about 15% for
16-node-xc. There's also some minor improvements for fft, and lulesh.

This is motivated by #6184, though it's not quite enough to close that issue
yet.
@ronawho ronawho removed their assignment Jan 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants