Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS hangs in triangular matrix multiplication on Excavator #841

Closed
andreasnoack opened this issue Apr 18, 2016 · 15 comments
Closed

OpenBLAS hangs in triangular matrix multiplication on Excavator #841

andreasnoack opened this issue Apr 18, 2016 · 15 comments

Comments

@andreasnoack
Copy link
Contributor

I'm sorry for the incomplete bug report but I don't have access to an Excavator machine. The issue was reported in https://groups.google.com/forum/#!forum/julia-stats and forcing OpenBLAS to use Piledriver kernels fixed the issue.

If you have an Excavator machine I can get access to, I can offer to prepare a better bug report.

@xianyi
Copy link
Collaborator

xianyi commented Apr 19, 2016

I only has a steamroller machine. I think it can ran excavator codes.

@andreasnoack
Copy link
Contributor Author

andreasnoack commented Apr 21, 2016

Two observations: with the build of OpenBLAS shipped with Julia, the following small Fortran program that hangs on an Excavator machine

program test

double precision :: a(2,2), b(2,1)

a(1,1) = 1
a(1,2) = 2
a(2,2) = 3
b(1,1) = 1
b(2,1) = 1

call dtrmm64_('L', 'U', 'N', 'N', 2_8, 1_8, 1.0d0, a, 2_8, b, 2_8)

write(*,*) 'done'

end program

when executed with the default coretype. If I set OPENBLAS_CORETYPE then it works. The weird thing is that if I don't set anything, the core type returned by openblas_get_corename is Excavator, but if I set OPENBLAS_CORETYPE=Excavator then it is Prescott.

Next, I tried to build OpenBLAS from source with DYNAMIC_ARCH=1 on the Excavator machine. There, the tests hang and when I try to compile my test program I get and test program also hangs (after building with lpthread. Thanks for stating the obvious @martin-frbg )

xianyi@carrizo:~$ gfortran test.f90 OpenBLAS/libopenblas.a
OpenBLAS/libopenblas.a(memory.o): In function `openblas_fork_handler':
memory.c:(.text+0x1a0): undefined reference to `pthread_atfork'
OpenBLAS/libopenblas.a(blas_server.o): In function `blas_thread_init':
blas_server.c:(.text+0x434): undefined reference to `pthread_create'
OpenBLAS/libopenblas.a(blas_server.o): In function `goto_set_num_threads':
blas_server.c:(.text+0x8db): undefined reference to `pthread_create'
OpenBLAS/libopenblas.a(blas_server.o): In function `blas_thread_shutdown_':
blas_server.c:(.text+0xb03): undefined reference to `pthread_join'
collect2: error: ld returned 1 exit status

Update: The test program is written for Julia's special OpenBLAS build with symbol renaming and 64 bit integers but when I try to link with the freshly built OpenBLAS I've, of course, adjusted the symbol names and integer sizes.

@martin-frbg
Copy link
Collaborator

At the risk of stating the obvious, it looks like your test program needs -lpthread to link.
Part of the confusion with coretypes probably stems from force_coretype in driver/others/dynamic.c
where the "strncasecmp" loop that tries to match cpu names to numbers runs only to element 21, where the relatively recently added Excavator is number 22.

@martin-frbg
Copy link
Collaborator

Again based on my still limited understanding of the codebase, would it be worth trying to force OPENBLAS_CORETYPE to "Steamroller", the generation directly preceding the Excavator cpu, in the hope of finding a more modern "working" baseline than Prescott/Piledriver ?
The KERNEL file for the EXCAVATOR target appears to be practically identical to the earlier one for STEAMROLLER (before wernsaar optimized trsm,cdot and cscal for that architecture) and the definitions in param.h appear to match as well.

@andreasnoack
Copy link
Contributor Author

OPENBLAS_CORETYPE=Steamroller works fine as well. The main issue is that most users won't (and shouldn't need to) know about OPENBLAS_CORETYPE.

@martin-frbg
Copy link
Collaborator

Sure. But knowing that Steamroller kernel works should help to track down the problem, seeing that there seem to be much fewer differences in configuration between these two than between Excavator and Piledriver. My next step if I had the hardware would be to simply copy KERNEL.STEAMROLLER over KERNEL.EXCAVATOR and add "|| defined(EXCAVATOR)" to those optimized function codes in kernel/x86_64 that already have "#if ... defined (STEAMROLLER)" - about 20 of them, but a small change in each.

@wernsaar
Copy link
Contributor

Hi,

I have built latest OpenBLAS develop on our STEAMROLLER machine.
All tests succeded

Then I modified Makefile.rule
TARGET = EXCAVATOR

The Current File KERNEL.EXCAVATOR is identical to KERNEL.PILEDRIVER

All tests succeded

Then I copied KERNEL.STEAMROLLER to KERNEL.EXCAVATOR

All tests succeded

The best configuration is, to copy KERNEL.STEAMROLLER to KERNEL.EXCAVATOR

But as Martin says, there are some files ( #if ... defined(STEAMROLLER)
|| defined(EXCAVATOR) ) , that should be modified for better performance.

I recommand, do not use gcc 5.2.x.

Best regards
Werner

On 04/23/2016 03:06 PM, Martin Kroeker wrote:

Sure. But knowing that Steamroller kernel works should help to track
down the problem, seeing that there seem to be much fewer differences
in configuration between these two than between Excavator and
Piledriver. My next step if I had the hardware would be to simply copy
KERNEL.STEAMROLLER over KERNEL.EXCAVATOR and add "||
defined(EXCAVATOR)" to those optimized function codes in kernel/x86_64
that already have "#if ... defined (STEAMROLLER)" - about 20 of them,
but a small change in each.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#841 (comment)

@andreasnoack
Copy link
Contributor Author

The program doesn't hang if I build with TARGET=EXCAVATOR. It's only when I build with DYNAMIC_ARCH=1.

@martin-frbg
Copy link
Collaborator

I do not see much difference between DYNAMIC_ARCH with and without OPENBLAS_CORETYPE set - in either case, gotoblas_dynamic_init() from driver/others/dynamic.c will try to read the environment variable first. Only when it is not set, get_coretype() is called (which from what you posted above apparently identifies the Excavator cpu correctly). So would your test code hang as well once the loop in force_coretype() is fixed to return the correct number "22" for Excavator instead of falling back to Prescott ?

@andreasnoack
Copy link
Contributor Author

Yes. I've extended the loop but the error is still there for DYNAMIC_ARCH=1 but not without. @wernsaar Did you try with DYNAMIC_ARCH=1?

@wernsaar
Copy link
Contributor

No,

I simpy built for TARGET=EXCAVATOR

On 04/25/2016 04:28 AM, Andreas Noack wrote:

Yes. I've extended the loop but the error is still there for
|DYNAMIC_ARCH=1| but not without. @wernsaar
https://github.com/wernsaar Did you try with |DYNAMIC_ARCH=1|?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#841 (comment)

@martin-frbg
Copy link
Collaborator

Could you get a backtrace at the time of the hang (perhaps something along the lines of #716, though it does not segfault in your case) ? Or just try running single-threaded first (which I suspect will work fine even with DYNAMIC_ARCH=1).

Somewhat unrelated, glancing through the changes made to fix #716 by ensuring that getenv is called not more than once (which would appear to still be violated in the dynamic case ?), I notice that driver/others/parameter.c has a few cases where an #ifdef with a long list of cpu models does contain STEAMROLLER but not EXCAVATOR, but this probably only affects the efficiency with which cpu cache sizes are determined.

@wernsaar
Copy link
Contributor

Hi

I have solved the problem.

I will push the chances soon.

Best regards
Werner

On 04/25/2016 09:10 AM, Martin Kroeker wrote:

Could you get a backtrace at the time of the hang (perhaps something
along the lines of #716
#716, though it does not
segfault in your case) ? Or just try running single-threaded first
(which I suspect will work fine even with DYNAMIC_ARCH=1).

Somewhat unrelated, glancing through the changes made to fix #716
#716 by ensuring that
getenv is called not more than once (which would appear to still be
violated in the dynamic case ?), I notice that
driver/others/parameter.c has a few cases where an #ifdef with a long
list of cpu models does contain STEAMROLLER but not EXCAVATOR, but
this probably only affects the efficiency with which cpu cache sizes
are determined.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#841 (comment)

@wernsaar
Copy link
Contributor

Hi,

I pushed the bug fixes and enhancements for EXCAVATOR.

Please checkout latest develop branch and test on a real EXCAVATOR machine.

Best regards
Werner

@andreasnoack
Copy link
Contributor Author

Great. I've just tried it and I can confirm that it fixed the issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants