-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071
Comments
I wonder if valgrind/helgrind is available for power8, or if you might have access to some comparable thread debugger ? I made an amateurish attempt to fix the races I came upon in #1052 but may well have missed something ifdef'd for non-x86 platforms (conversely you may even want to try a version preceding 87c7d10 of Jan 8 in case I managed to break something for you) |
Helgrind is available and I have done a smaller example (solving for a 256x256 matrix) to do this. Otherwise it takes for days and I recognized that even for this small problem in <10% of the cases I got the wrong results as well. I compiled the current OPenBLAS again with enabled debug symbols and for using 20 OpenMP threads and I get the following helgrind output: https://gist.github.com/grisuthedragon/43a36d248454fd02c52ea18aa1b2614f (It will be too long to be posted here, therefore the gist link) The error log also shows some stuff from the HDF5 library but also in examples where the data is not read by HDF5 it computes wrong results. PS: I also tried to use 87c7d10 but this does not change anything. Due to the fact that I regard the same erros in PGIs OpenBLAS (released Nov 16, 2016) the change in 87c7d10 does not influence this behavior. I also tested it on an 16 Core x86-86 Haswell Xeon, and everything works fine, at least what I can say after 1000 runs of the code. |
Thanks for the helgrind.log - at first glance it looks as if at least part of the problem was in libgomp, but I now see from the helgrind manual that it simply cannot keep track of omp's threading primitives unless gcc including libgomp was specially compiled with the --disable-linux-futex option. Looking further, according to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55561 another option seems to be to compile everything with -fsanitize=thread to let gcc itself do the checking, but this again would also need a libgomp specifically recompiled with --disable-linux-futex (this bugzilla entry was last edited around the time of gcc 4.9 it seems, so it is unclear if your 5.4 contains any improvements over that stage). I am not sure how to proceed. |
IFF the reduced testcase runs quickly enough you could try commenting the DGEMVNKERNEL and/or DGEMMKERNEL line(s) in KERNEL.POWER8 to see if the issue crept in with more recent optimizations. |
helgrind points finger to interface/lapack/laswp.c. Rest of output looks like heap of false positives as in gcc bugzilla.... |
@martin-frbg Unfortunately, the thread sanitizer is only available on on x86_64 as far as I know. But I recompiled the gcc with --disable-linux-futex an will do a helgrind run again. Commenting out the specialized kernels is on my to do list as well. In general, when I remember right, I got into struggle with an similar error on an Power 7 (Big Endian), two years ago but there I stopped using OpenBLAS on this machine and switched to ATLAS. |
Turning on the --disable-linux-futex incredibly slow down the computations and make helgrind running 10 slower. But running the original large problem without helgrind gives the correct results (100 runs). The I triied the following, this time with the compiler with enabled linux-futex and I compiled OpenBLAS without LAPACK support and use Netlib LAPACK to reduce the number of possible errors a bit. This does not yield correct results but I can concluded that the dlaswp function seems to be ok. On top of this setup I took a look how the GEMM call is done and I recognized that the old threading model can be still turned on. So I compiled OpenBLAS with
Now, I get the correct results as well (checked for 1000 runs of the benchmark). From this point of view the problem must somewhere inside the new GEMM threading in combination with OpenMP. I will not say GNU OpenMP because the PGI version does the same strange things. PS: The OpenBLAS version delivered with the 2016.10 community edition of the PGI compiler suite is 0.2.18. |
Thanks for the extensive testing. Now you seem to have pushed the problem back into generic code when I was hoping that it was something specific to the POWER platform ? But if it is indeed the gemm threading then hopefully the existing gemm checks and benchmarks are sufficient to expose the problem. (Not sure if I am able to tackle this problem though...) |
Hmm. A quick check on Haswell, running helgrind on "benchmark/dgemm.goto 100 100 1" indeed yields complaints about a race between two instances of inner_thread without any locks held. (One at line 433 of level3_thread.c called from blas_thread_server, the other at line 494 of the same file, called from dgemm_thread_nn via exec_blas). And they go away when USE_SIMPLE_THREADED_LEVEL3 is set. Food for thought at least... |
@martin-frbg the other martin's helgrind ouput looked like laswp reads past what it is expected to. |
@brada4 Not sure we can trust the original log where helgrind had no idea what OMP was doing behind its back. Also grisuthedragon has apparently narrowed it down to GEMM already by removing the OpenBLAS dlaswp from the equation, and a thread data race demonstrably occurs in level3_thread.c even on x86_64 when the more elaborate multithreading code is used. (It is possible that the likelyhood of the race occuring depends on the implementation of YIELDING for the platform - at least I remember from #990 that doing away with sched_yield() on x86_64 seemed to increase the likelyhood of lockups - hopefully now fixed through #995) |
So as I promised the helgrind output of the small example with --disable-linux-futex in the compiler but enabled Power8 optimizations: https://gist.github.com/grisuthedragon/87df1ae18702fb30d1ee546c08085362 As well as in the small Haswell experiment of @martin-frbg the inner_thread routine is involved. Due to the disable futex one run takes up to 6 hours -.- |
I looked through your log - it looks like all false positives, i.e 1ct optimisation could be possible getting threads in next step in (in->kernel->out) processing path to start in warm data, but it does not explain numerical mayhem. Lets try another thing - all 0s, all 1s, checkerboard,(all combinations of those passed to dgemm) |
Wouldn't an error in stride calculation lead to rather obvious and unavoidable errors on any platform running more than a single thread, in contrast to thread data races where "merely" the likelyhood of bad access patterns increases with the number of threads ? |
could be base 0 vs base 1 numbering etc.... You never know. |
So guys, I started looking into the gemm scheduler but I do not get how this thing really works. Is there any paper/preprint/whatever publication describing the "new" scheduling? I only found the papers of Goto and Robert van de Geijn about the optimizations of the kernels. |
I looked already but did not find any, nor was there any explanation to be found in the various old versions of libgoto I had archived. The gotoblas mailing list archive is gone from the utexas website and is not available at archive.org either. On arxiv.org I did come across a very recent whitepaper by Smith & van de Geijn (labeled FLAME working note 83) that touches on how Goto's algorithm tries to optimize cache use but it does not address the topic of parallelization at all. |
It does not look complex by any means: |
After a few weeks I run into problems with this issue again. This time I have a piece of code which reproduces the error. The attached code below computes the QR decomposition of a matrix using the Tile-QR approach and is parallelized using OpenMP4. Details about the algorithm can be found here. System Details:
For the first experiment I compiled the current OpenBLAS using
and compiled the attached code by
As result for a test run I get:
which obviously did not return the correct result. Using a single threaded OpenBLAS compiled with
I get
which is wrong as well. Running the same code with RefBLAS and RefLAPACK the maximum difference is in the order of 10^(-16). The same holds true if I use IBM ESSL as BLAS library and the RefLAPACK. From this point of view it seems that the OpenBLAS functions, even in single thread operation mode are not thread safe. tileqr.f90.txt |
It is accuracy problem, not thread safety. |
@brada4 On the one hand it works with on x86-64 and there with different OpenBLAS and MKL versions this can not be an accuracy problem. On the other hand, I am computing a QR decomposition which is build on to of orthogonal/unitary transformation so from the numerical point of view the differences between the LAPACK and the new implementation can not be explained. Furthermore that the single and the multithreaded OpenBLAS results differ in more than 200 orders of magnitude can not be explained by the accuracy problems. Another argument against the accuracy problem is, that if I set the number of threads to 1 at runtime I obtain the correct result (independent of the threading capabilities of OpenBLAS):
|
Possibly not a race this time but side effects of the allegedly broken assembly kernels for PPC (issue #1078) |
@martin-frbg I think that could be the only explainable reason at the moment. Because on x86-64 (Sandybride, Haswell) I never realized such a behavior. I am not able to write the assembly stuff. The only thing I can do is performing some tests on the POWER platforms (Power 8 Little Endian/ Power 7 Big Endian). |
In order to see from which optimized routine the errors are coming I have the following idea.. I replaced the
which unfortunately leads to some errors in the BLAS checking: (for double, complex and complex*16 the results are similar)
Level 1 and Level 2 operations are working without any error. If we get this correctly working, than I can plugin each optimized routine in the KERNEL.POWER8 and we see which one breaks at least my application. |
The stock KERNEL.POWER8 file has a number of commented-out entries for xSYMV functions near the end - maybe it would make sense to enable these (first?) to see if switching back to their generic implementations helps. |
If I only enable the xSYMV routines from the generic implementation, all BLAS test are passed but for my application this does not make any change. |
Getting the BLAS tests to pass would appear to be an important first step in any case. Alan Modra has fixed the inline assembly contained in local .c implementations, so the likely candidates for further tests are the implementations in the various power8 .S files - xGEMM, xTRSM, xTRMM - or do I take your reply to mean that switching everything back to the non-optimized defaults does not make your code run either ? |
I think, having a generic OpenBLAS, which does not rely on any platform specific optimizations, i.e., wriiten in pure (ANSI/ISO) C would be preferable to find errors. Regardless of the problem I have with the POWER architecture at the moment, this allows us to run OpenBLAS on arbitrary platform which only provide a C/Fortran compiler. So we can check on the one hand if the general OpenBLAS concept works and searching errors in this part and on the other hand if we know that the general concept etc. works that we can enable the optimized kernels step by step to figure out which causes the problems. I think that is a more systematic way to find the complicated errors. If there are no other race conditions in OpenBLAS then my code should run correctly with the unoptimized routines. As I said on x86-64 with OpenBLAS/MKL/RefBLAS/ATLAS and on ppc64le with RefBLAS/ESSL it works, so most likely the bug seems to be inside the ppc64 specific routines. |
You could exclude the three routines tests show as defective, not others btw... I ran your sample on sandy bridge and broadwell 10-core CPUs, for me lapack is tiny bit faster than tiled routine but slows a bit as thread number is increased past some threshold (like faster until 3-4 then slightly down) - can you check for same effect once you manage to get accuracy within control? |
@brada4
|
I suspect this is one or other way similar to #1051 for performance issue. Try perf record + perf report, i see sched_yield + schedule taking more percentage of CPU time the more CPUs you have (still looks almost optimal at 4 cores, but gets worse after) |
@brada4 The performance is not the topic of this issue. But in order to obtain a good performance using the TILE-QR approach one have to change the data layout of the matrices and the presented code above was only for reproducing the issue. If i get correct results on the 20 core POWER server than I will take a look on the performance. |
So could you update about the actual issue please - do you still see your test failing on ppc with everything replaced by generic functions, or did you not get around to switching out individual xGEMM etc implementations yet ? (And I assume you are still building with the |
I started a new clear issue as #1191. |
After the discussion of compiling the HPL on the Power 8 platform I tried several of my codes with OpenBLAS from the development branch on an IBM OpenPower 8.
The machine is an IBM Model 8335-GTB (2x10 Cores, 8-way SMT, 160 virtual cores) with CentOS7.3, gcc 5.4, glibc 2.17 and Kernel 4.8.
I compiled the current OpenBLAS development ( ab2033f ) version using
and linked my code (Fortran, without any OpenMP parallelized parts). The code solves a Sylvester equation AXB+CXD=F, (A,B,C,D,F,X are 1024x1024 matrices) using some algorithm and it relies on the following BLAS operations (figured out using the profiling feature of OpenBLAS):
dlaswp(see below in the posts)The code solves the equation computing X 25 times and checks the forward error. When I execute the code with setting
OMP_NUM_THREADS=20
to avoid over-subscription of the cores more than the half of the performed computations are wrong (ferr := || X_true - X_computed || / || X_true ||):If I run the same code using
OMP_NUM_THREADS=1
or the reference BLAS implementation and having OMP_NUM_THREADS unset (to ensure that really nothing in my code depends on threading) I obtain a forward error ferr of approx. 10^-12 for all runs of the benchmark. I already check what happened if I restrict the number of threads to 20 at compile time (make ... NUM_THREADS=20
) and this yields the wrong result by more runs are correct. Disabling threading completely in OpenBLAS results in the correct reasults as well.From this observation I concluded that something with threading went wrong and there is a race condition.
Interestingly, if I use the PGI for OpenPower compiler suite, which delivers a BLAS implementation based on OpenBLAS (I think a slightly modified one to be able to be compiled with the PGI compiler on the ppc64le architecture) the same error appears. But this means the bug is not in the GNU OpenMP implementation because PGI uses its own separate one.
Unfortunately, I do not have a minimal-not-working example for this bug yet, because the code mentioned above is part of current research.
The text was updated successfully, but these errors were encountered: