Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Closed
grisuthedragon opened this issue Jan 20, 2017 · 34 comments
Closed

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

grisuthedragon opened this issue Jan 20, 2017 · 34 comments

Comments

@grisuthedragon
Copy link
Contributor

grisuthedragon commented Jan 20, 2017

After the discussion of compiling the HPL on the Power 8 platform I tried several of my codes with OpenBLAS from the development branch on an IBM OpenPower 8.
The machine is an IBM Model 8335-GTB (2x10 Cores, 8-way SMT, 160 virtual cores) with CentOS7.3, gcc 5.4, glibc 2.17 and Kernel 4.8.

I compiled the current OpenBLAS development ( ab2033f ) version using

make USE_OPENMP=1 USE_THREAD=1 

and linked my code (Fortran, without any OpenMP parallelized parts). The code solves a Sylvester equation AXB+CXD=F, (A,B,C,D,F,X are 1024x1024 matrices) using some algorithm and it relies on the following BLAS operations (figured out using the profiling feature of OpenBLAS):

  • dswap
  • dscal
  • dnrm2
  • idamax
  • drot
  • dgemv
  • dger
  • dtrmv
  • dgemm
  • dlaswp (see below in the posts)

The code solves the equation computing X 25 times and checks the forward error. When I execute the code with setting OMP_NUM_THREADS=20 to avoid over-subscription of the cores more than the half of the performed computations are wrong (ferr := || X_true - X_computed || / || X_true ||):

Run:   0 	 ferr = 6.12559e-12
Run:   1 	 ferr = 6.12559e-12
Run:   2 	 ferr = 0.0187795
Run:   3 	 ferr = 0.63829
Run:   4 	 ferr = 6.12559e-12
Run:   5 	 ferr = 0.0902158
Run:   6 	 ferr = 0.0154748
Run:   7 	 ferr = 4367.12
Run:   8 	 ferr = 4.22541
Run:   9 	 ferr = 6.12559e-12
Run:  10 	 ferr = 6.12559e-12
Run:  11 	 ferr = 0.00973169
Run:  12 	 ferr = 30.6905
Run:  13 	 ferr = 6.12559e-12
Run:  14 	 ferr = 12.3574
Run:  15 	 ferr = 334.143
Run:  16 	 ferr = 574.977
Run:  17 	 ferr = 0.0585436
Run:  18 	 ferr = 258.05
Run:  19 	 ferr = 401.401
Run:  20 	 ferr = 493.521
Run:  21 	 ferr = 0.112386
Run:  22 	 ferr = 0.0393082
Run:  23 	 ferr = 0.258171
Run:  24 	 ferr = 543.353

If I run the same code using OMP_NUM_THREADS=1 or the reference BLAS implementation and having OMP_NUM_THREADS unset (to ensure that really nothing in my code depends on threading) I obtain a forward error ferr of approx. 10^-12 for all runs of the benchmark. I already check what happened if I restrict the number of threads to 20 at compile time (make ... NUM_THREADS=20) and this yields the wrong result by more runs are correct. Disabling threading completely in OpenBLAS results in the correct reasults as well.

From this observation I concluded that something with threading went wrong and there is a race condition.

Interestingly, if I use the PGI for OpenPower compiler suite, which delivers a BLAS implementation based on OpenBLAS (I think a slightly modified one to be able to be compiled with the PGI compiler on the ppc64le architecture) the same error appears. But this means the bug is not in the GNU OpenMP implementation because PGI uses its own separate one.

Unfortunately, I do not have a minimal-not-working example for this bug yet, because the code mentioned above is part of current research.

@martin-frbg
Copy link
Collaborator

I wonder if valgrind/helgrind is available for power8, or if you might have access to some comparable thread debugger ? I made an amateurish attempt to fix the races I came upon in #1052 but may well have missed something ifdef'd for non-x86 platforms (conversely you may even want to try a version preceding 87c7d10 of Jan 8 in case I managed to break something for you)

@grisuthedragon
Copy link
Contributor Author

grisuthedragon commented Jan 21, 2017

Helgrind is available and I have done a smaller example (solving for a 256x256 matrix) to do this. Otherwise it takes for days and I recognized that even for this small problem in <10% of the cases I got the wrong results as well. I compiled the current OPenBLAS again with enabled debug symbols and for using 20 OpenMP threads and I get the following helgrind output:

https://gist.github.com/grisuthedragon/43a36d248454fd02c52ea18aa1b2614f

(It will be too long to be posted here, therefore the gist link)

The error log also shows some stuff from the HDF5 library but also in examples where the data is not read by HDF5 it computes wrong results.

PS: I also tried to use 87c7d10 but this does not change anything. Due to the fact that I regard the same erros in PGIs OpenBLAS (released Nov 16, 2016) the change in 87c7d10 does not influence this behavior. I also tested it on an 16 Core x86-86 Haswell Xeon, and everything works fine, at least what I can say after 1000 runs of the code.

@martin-frbg
Copy link
Collaborator

Thanks for the helgrind.log - at first glance it looks as if at least part of the problem was in libgomp, but I now see from the helgrind manual that it simply cannot keep track of omp's threading primitives unless gcc including libgomp was specially compiled with the --disable-linux-futex option. Looking further, according to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55561 another option seems to be to compile everything with -fsanitize=thread to let gcc itself do the checking, but this again would also need a libgomp specifically recompiled with --disable-linux-futex (this bugzilla entry was last edited around the time of gcc 4.9 it seems, so it is unclear if your 5.4 contains any improvements over that stage). I am not sure how to proceed.

@martin-frbg
Copy link
Collaborator

IFF the reduced testcase runs quickly enough you could try commenting the DGEMVNKERNEL and/or DGEMMKERNEL line(s) in KERNEL.POWER8 to see if the issue crept in with more recent optimizations.

@brada4
Copy link
Contributor

brada4 commented Jan 21, 2017

helgrind points finger to interface/lapack/laswp.c.
is it possible to edit that file:
82 if (nthreads == 1) {
to become
82 if (1) {
I.e disable threading in this just one function.

Rest of output looks like heap of false positives as in gcc bugzilla....

@grisuthedragon
Copy link
Contributor Author

@martin-frbg Unfortunately, the thread sanitizer is only available on on x86_64 as far as I know. But I recompiled the gcc with --disable-linux-futex an will do a helgrind run again. Commenting out the specialized kernels is on my to do list as well.
@brada4 This does not change anything but leading me to a new idea.

In general, when I remember right, I got into struggle with an similar error on an Power 7 (Big Endian), two years ago but there I stopped using OpenBLAS on this machine and switched to ATLAS.

@grisuthedragon
Copy link
Contributor Author

grisuthedragon commented Jan 22, 2017

Turning on the --disable-linux-futex incredibly slow down the computations and make helgrind running 10 slower. But running the original large problem without helgrind gives the correct results (100 runs).

The I triied the following, this time with the compiler with enabled linux-futex and I compiled OpenBLAS without LAPACK support and use Netlib LAPACK to reduce the number of possible errors a bit. This does not yield correct results but I can concluded that the dlaswp function seems to be ok. On top of this setup I took a look how the GEMM call is done and I recognized that the old threading model can be still turned on. So I compiled OpenBLAS with

make USE_OPENMP=1 USE_THREAD=1 NUM_THREADS=20 NO_LAPACK=1 USE_SIMPLE_THREADED_LEVEL3=1

Now, I get the correct results as well (checked for 1000 runs of the benchmark). From this point of view the problem must somewhere inside the new GEMM threading in combination with OpenMP. I will not say GNU OpenMP because the PGI version does the same strange things.

PS: The OpenBLAS version delivered with the 2016.10 community edition of the PGI compiler suite is 0.2.18.

@martin-frbg
Copy link
Collaborator

Thanks for the extensive testing. Now you seem to have pushed the problem back into generic code when I was hoping that it was something specific to the POWER platform ? But if it is indeed the gemm threading then hopefully the existing gemm checks and benchmarks are sufficient to expose the problem. (Not sure if I am able to tackle this problem though...)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jan 22, 2017

Hmm. A quick check on Haswell, running helgrind on "benchmark/dgemm.goto 100 100 1" indeed yields complaints about a race between two instances of inner_thread without any locks held. (One at line 433 of level3_thread.c called from blas_thread_server, the other at line 494 of the same file, called from dgemm_thread_nn via exec_blas). And they go away when USE_SIMPLE_THREADED_LEVEL3 is set. Food for thought at least...

@brada4
Copy link
Contributor

brada4 commented Jan 23, 2017

@martin-frbg the other martin's helgrind ouput looked like laswp reads past what it is expected to.

@martin-frbg
Copy link
Collaborator

@brada4 Not sure we can trust the original log where helgrind had no idea what OMP was doing behind its back. Also grisuthedragon has apparently narrowed it down to GEMM already by removing the OpenBLAS dlaswp from the equation, and a thread data race demonstrably occurs in level3_thread.c even on x86_64 when the more elaborate multithreading code is used. (It is possible that the likelyhood of the race occuring depends on the implementation of YIELDING for the platform - at least I remember from #990 that doing away with sched_yield() on x86_64 seemed to increase the likelyhood of lockups - hopefully now fixed through #995)

@grisuthedragon
Copy link
Contributor Author

So as I promised the helgrind output of the small example with --disable-linux-futex in the compiler but enabled Power8 optimizations:

https://gist.github.com/grisuthedragon/87df1ae18702fb30d1ee546c08085362

As well as in the small Haswell experiment of @martin-frbg the inner_thread routine is involved.

Due to the disable futex one run takes up to 6 hours -.-

@brada4
Copy link
Contributor

brada4 commented Jan 26, 2017

I looked through your log - it looks like all false positives, i.e 1ct optimisation could be possible getting threads in next step in (in->kernel->out) processing path to start in warm data, but it does not explain numerical mayhem.

Lets try another thing - all 0s, all 1s, checkerboard,(all combinations of those passed to dgemm)
Any stripes off-patern means one of 3 functions has one-off error choosing strides try somehing you can easily see on the screen.

@martin-frbg
Copy link
Collaborator

Wouldn't an error in stride calculation lead to rather obvious and unavoidable errors on any platform running more than a single thread, in contrast to thread data races where "merely" the likelyhood of bad access patterns increases with the number of threads ?
(I have managed to reduce the number of warnings from my helgrind testcase a bit already, but further changes only introduced lockups so far.)

@brada4
Copy link
Contributor

brada4 commented Jan 26, 2017

could be base 0 vs base 1 numbering etc.... You never know.

@grisuthedragon
Copy link
Contributor Author

So guys, I started looking into the gemm scheduler but I do not get how this thing really works. Is there any paper/preprint/whatever publication describing the "new" scheduling? I only found the papers of Goto and Robert van de Geijn about the optimizations of the kernels.

@martin-frbg
Copy link
Collaborator

I looked already but did not find any, nor was there any explanation to be found in the various old versions of libgoto I had archived. The gotoblas mailing list archive is gone from the utexas website and is not available at archive.org either. On arxiv.org I did come across a very recent whitepaper by Smith & van de Geijn (labeled FLAME working note 83) that touches on how Goto's algorithm tries to optimize cache use but it does not address the topic of parallelization at all.

@brada4
Copy link
Contributor

brada4 commented Feb 24, 2017

It does not look complex by any means:
driver/level3/gemm_thread_variable.c
If GEMM parallelisation was at fault dont you think it would somehow self-fix on Haswell that nobody complained yet?

@grisuthedragon
Copy link
Contributor Author

grisuthedragon commented May 17, 2017

After a few weeks I run into problems with this issue again. This time I have a piece of code which reproduces the error. The attached code below computes the QR decomposition of a matrix using the Tile-QR approach and is parallelized using OpenMP4. Details about the algorithm can be found here.

System Details:

  • CentOS 7.3 litle endian
  • IBM Power8 (SL822C) (20 Cores, 160 Threads)
  • GCC 5.4.1

For the first experiment I compiled the current OpenBLAS using

make USE_OPENMP=1 

and compiled the attached code by

gfortran -fopenmp tileqr.f90 path/to/openblas/libopenblas.a 

As result for a test run I get:

$ OMP_THREAD_LIMIT=20 OMP_NUM_THREADS=20 ./a.out 10240 10240 256 32
LAPACK   -- Time:   46.312    Gflops/s:   30.913
Tiled    -- Time:    6.366    Gflops/s:  224.905
Max diff between R from tiled   and LAPACK is: 0.9E+05

which obviously did not return the correct result.

Using a single threaded OpenBLAS compiled with

make USE_THREAD=0 USE_OPENMP=0

I get

$  OMP_THREAD_LIMIT=20 OMP_NUM_THREADS=20 ./a.out 10240 10240 256 32
LAPACK   -- Time:   73.346    Gflops/s:   19.519
Tiled    -- Time:    5.364    Gflops/s:  266.911
 
Max diff between R from tiled   and LAPACK is: 0.2+226

which is wrong as well. Running the same code with RefBLAS and RefLAPACK the maximum difference is in the order of 10^(-16). The same holds true if I use IBM ESSL as BLAS library and the RefLAPACK. From this point of view it seems that the OpenBLAS functions, even in single thread operation mode are not thread safe.

tileqr.f90.txt
The code only works if m % b = 0 and n % b = 0 which is guaranteed above. Furthermore, the performance of DGEQRT from LAPACK seems to be poor.

@brada4
Copy link
Contributor

brada4 commented May 17, 2017

It is accuracy problem, not thread safety.

@grisuthedragon
Copy link
Contributor Author

grisuthedragon commented May 18, 2017

@brada4 On the one hand it works with on x86-64 and there with different OpenBLAS and MKL versions this can not be an accuracy problem. On the other hand, I am computing a QR decomposition which is build on to of orthogonal/unitary transformation so from the numerical point of view the differences between the LAPACK and the new implementation can not be explained. Furthermore that the single and the multithreaded OpenBLAS results differ in more than 200 orders of magnitude can not be explained by the accuracy problems.

Another argument against the accuracy problem is, that if I set the number of threads to 1 at runtime I obtain the correct result (independent of the threading capabilities of OpenBLAS):

$ OMP_THREAD_LIMIT=1 OMP_NUM_THREADS=1 ./a.out 5120 5120 256 32
LAPACK   -- Time:    8.671    Gflops/s:   20.637
Tiled    -- Time:    9.205    Gflops/s:   19.440
 
Max diff between R from tiled   and LAPACK is: 0.1E-15

@martin-frbg
Copy link
Collaborator

Possibly not a race this time but side effects of the allegedly broken assembly kernels for PPC (issue #1078)

@grisuthedragon
Copy link
Contributor Author

@martin-frbg I think that could be the only explainable reason at the moment. Because on x86-64 (Sandybride, Haswell) I never realized such a behavior.

I am not able to write the assembly stuff. The only thing I can do is performing some tests on the POWER platforms (Power 8 Little Endian/ Power 7 Big Endian).

@grisuthedragon
Copy link
Contributor Author

In order to see from which optimized routine the errors are coming I have the following idea.. I replaced the KERNEL.POWER8 by the most generic version, which only relies on routines written in pure C. KERNEL.POWER8
and then i compiled OpenBLAS using:

make USE_OPENMP=1 CC=gcc-5 CXX=g++-5 FC=gfortran-5 NO_LAPACK=1

which unfortunately leads to some errors in the BLAS checking: (for double, complex and complex*16 the results are similar)

cat test/SBLAT3.SUMM 
 TESTS OF THE REAL             LEVEL 3 BLAS

 THE FOLLOWING PARAMETER VALUES WILL BE USED:
   FOR N                   0     1     2     3     7    31
   FOR ALPHA             0.0   1.0   0.7
   FOR BETA              0.0   1.0   1.3

 ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN   16.00

 RELATIVE MACHINE PRECISION IS TAKEN TO BE  1.2E-07

 SGEMM  PASSED THE TESTS OF ERROR-EXITS

 SGEMM  PASSED THE COMPUTATIONAL TESTS ( 17496 CALLS)

 SSYMM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.262339         -0.849231E-01
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* SSYMM  FAILED ON CALL NUMBER:
    382: SSYMM ('R','U',  1,  7, 1.0, A,  8, B,  2, 0.0, C,  2)    .

 STRMM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.485095          0.186814    
      THESE ARE THE RESULTS FOR COLUMN   2
 ******* STRMM  FAILED ON CALL NUMBER:
    758: STRMM ('R','U','N','U',  1,  7, 1.0, A,  8, B,  2)        .

 STRSM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.186813         -0.166775E-01
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* STRSM  FAILED ON CALL NUMBER:
    764: STRSM ('R','U','T','U',  1,  7, 1.0, A,  8, B,  2)        .

 SSYRK  PASSED THE TESTS OF ERROR-EXITS

 SSYRK  PASSED THE COMPUTATIONAL TESTS (  1944 CALLS)

 SSYR2K PASSED THE TESTS OF ERROR-EXITS

 SSYR2K PASSED THE COMPUTATIONAL TESTS (  1944 CALLS)

 END OF TESTS
 

Level 1 and Level 2 operations are working without any error. If we get this correctly working, than I can plugin each optimized routine in the KERNEL.POWER8 and we see which one breaks at least my application.

@martin-frbg
Copy link
Collaborator

The stock KERNEL.POWER8 file has a number of commented-out entries for xSYMV functions near the end - maybe it would make sense to enable these (first?) to see if switching back to their generic implementations helps.

@grisuthedragon
Copy link
Contributor Author

If I only enable the xSYMV routines from the generic implementation, all BLAS test are passed but for my application this does not make any change.

@martin-frbg
Copy link
Collaborator

Getting the BLAS tests to pass would appear to be an important first step in any case. Alan Modra has fixed the inline assembly contained in local .c implementations, so the likely candidates for further tests are the implementations in the various power8 .S files - xGEMM, xTRSM, xTRMM - or do I take your reply to mean that switching everything back to the non-optimized defaults does not make your code run either ?

@grisuthedragon
Copy link
Contributor Author

grisuthedragon commented May 18, 2017

I think, having a generic OpenBLAS, which does not rely on any platform specific optimizations, i.e., wriiten in pure (ANSI/ISO) C would be preferable to find errors. Regardless of the problem I have with the POWER architecture at the moment, this allows us to run OpenBLAS on arbitrary platform which only provide a C/Fortran compiler. So we can check on the one hand if the general OpenBLAS concept works and searching errors in this part and on the other hand if we know that the general concept etc. works that we can enable the optimized kernels step by step to figure out which causes the problems. I think that is a more systematic way to find the complicated errors.

If there are no other race conditions in OpenBLAS then my code should run correctly with the unoptimized routines. As I said on x86-64 with OpenBLAS/MKL/RefBLAS/ATLAS and on ppc64le with RefBLAS/ESSL it works, so most likely the bug seems to be inside the ppc64 specific routines.

@brada4
Copy link
Contributor

brada4 commented May 18, 2017

You could exclude the three routines tests show as defective, not others btw...

I ran your sample on sandy bridge and broadwell 10-core CPUs, for me lapack is tiny bit faster than tiled routine but slows a bit as thread number is increased past some threshold (like faster until 3-4 then slightly down) - can you check for same effect once you manage to get accuracy within control?

@grisuthedragon
Copy link
Contributor Author

@brada4
On my workstation (Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, 4 cores, gcc 5.4.1) I get :

OMP_NUM_THREADS=4 ./a.out 5120 5120  256 32
LAPACK   -- Time:    5.515    Gflops/s:   32.447
Tiled    -- Time:    3.421    Gflops/s:   52.311
 
Max diff between R from tiled   and LAPACK is: 0.2E-16

@brada4
Copy link
Contributor

brada4 commented May 19, 2017

I suspect this is one or other way similar to #1051 for performance issue. Try perf record + perf report, i see sched_yield + schedule taking more percentage of CPU time the more CPUs you have (still looks almost optimal at 4 cores, but gets worse after)

@grisuthedragon
Copy link
Contributor Author

grisuthedragon commented May 19, 2017

@brada4 The performance is not the topic of this issue. But in order to obtain a good performance using the TILE-QR approach one have to change the data layout of the matrices and the presented code above was only for reproducing the issue. If i get correct results on the 20 core POWER server than I will take a look on the performance.

@martin-frbg
Copy link
Collaborator

So could you update about the actual issue please - do you still see your test failing on ppc with everything replaced by generic functions, or did you not get around to switching out individual xGEMM etc implementations yet ? (And I assume you are still building with the USE_SIMPLE_THREADED_LEVEL3=1 option to work around the original issue, though I did not see this mentioned in your recent messages ?)

@grisuthedragon
Copy link
Contributor Author

I started a new clear issue as #1191.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants