Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results on s390x (-march=zEC12 -mtune=z13, ZARCH_GENERIC) #1743

Closed
jamesjer opened this issue Aug 28, 2018 · 10 comments · Fixed by #1745
Closed

Incorrect results on s390x (-march=zEC12 -mtune=z13, ZARCH_GENERIC) #1743

jamesjer opened this issue Aug 28, 2018 · 10 comments · Fixed by #1745
Milestone

Comments

@jamesjer
Copy link

I maintain the fflas-ffpack package for the Fedora Linux distribution. There is currently a push in Fedora to migrate from atlas and the reference blas implementation to openblas. However, the fflas-ffpack test suite failed on s390x when built with openblas. The issue is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1619074.

I found that the openblas test suite itself reported multiple failures when built on s390x, but did not return a nonzero exit code; the issue was therefore overlooked as the openblas build did not fail. The openblas test failures can be seen here: https://kojipkgs.fedoraproject.org//packages/openblas/0.3.2/3.fc29/data/logs/s390x/build.log. Here is the first error in the logs:

OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat3 < ./sblat3.dat
 TESTS OF THE REAL             LEVEL 3 BLAS
 THE FOLLOWING PARAMETER VALUES WILL BE USED:
   FOR N                   0     1     2     3     7    31
   FOR ALPHA             0.0   1.0   0.7
   FOR BETA              0.0   1.0   1.3
 ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN   16.00
 RELATIVE MACHINE PRECISION IS TAKEN TO BE  1.2E-07
 SGEMM  PASSED THE TESTS OF ERROR-EXITS
 SGEMM  PASSED THE COMPUTATIONAL TESTS ( 17496 CALLS)
 SSYMM  PASSED THE TESTS OF ERROR-EXITS
 SSYMM  PASSED THE COMPUTATIONAL TESTS (  1296 CALLS)
 STRMM  PASSED THE TESTS OF ERROR-EXITS
 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.186813          0.373626    
 ******* STRMM  FAILED ON CALL NUMBER:
    506: STRMM ('L','U','N','U',  1,  1, 1.0, A,  2, B,  2)        .

There are over 100 such failures. I have only looked at the first dozen or so, but in each case the computed result appears to be exactly double the expected result. Fedora packages built for s390x are built with gcc -march=zEC12 -mtune=z13. The openblas package selects TARGET=ZARCH_GENERIC.

@brada4
Copy link
Contributor

brada4 commented Aug 28, 2018

Could you, please, attach linked build log before it disappears?

Looks like same TLS problem being approached at #1742

@martin-frbg
Copy link
Collaborator

I am not yet convinced that the test failures are caused by the new thread-local storage allocator, but could you try current "develop" branch where I just merged the aforementioned PR #1742 (that reverts to the old allocation code unless OpenBLAS is built with -DUSE_TLS) ?
And do I get it right that this is an emulator, not actual zarch hardware ?

@sharkcz
Copy link
Contributor

sharkcz commented Aug 28, 2018

It's from a real hw (z13), but zEC12 gives same error messages. It looks like as a "feature" of the generic kernel. When the z13 kernel is used, there are no such errors (https://koji.fedoraproject.org/koji/buildinfo?buildID=1133326 built the z13 kernel by mistake). Fedora needs to stick to the generic kernel as we support running the distro on zEC12 or newer hw.

@martin-frbg
Copy link
Collaborator

Thanks - as the generic kernel is pure C it may be possible to reproduce this on more mundane hardware.

@martin-frbg
Copy link
Collaborator

No issue building on x86_64 with the KERNEL.ZARCH_GENERIC in place of its usual KERNEL.generic file, so not that simple unfortunately. Did earlier Fedora builds all use z13, or is this a recent failure with the generic kernel (which would indeed implicate the new memory.c as the most likely recent change) ?

@sharkcz
Copy link
Contributor

sharkcz commented Aug 28, 2018

AFAIK this is a long time issue

@martin-frbg
Copy link
Collaborator

xTRMM failing was once seen as fallout from #1419 (0.3.0), but that was reverted by #1564 in mid-May, well before 0.3.2. I see nothing suspicious in the build log.

@susilehtola
Copy link
Contributor

susilehtola commented Aug 28, 2018

So the issue has indeed been around for a while, this is a 0.2.20 build log from January when the builders were still z13.

https://kojipkgs.fedoraproject.org//packages/openblas/0.2.20/4.fc28/data/logs/s390x/build.log

@martin-frbg
Copy link
Collaborator

Interesting, thanks. As far as I can tell, much the same setup is used for GEMM/TRMM on ARMV8, but with USE_TRMM=1 defined in kernel/Makefile.L3 (This is also set when CORE is Z13, might be worthwile to add it for ifeq($(CORE), ZARCH_GENERIC) as well.)

@sharkcz
Copy link
Contributor

sharkcz commented Aug 28, 2018

with

diff --git a/kernel/Makefile.L3 b/kernel/Makefile.L3
index b37e536e..81ee93c1 100644
--- a/kernel/Makefile.L3
+++ b/kernel/Makefile.L3
@@ -20,6 +20,10 @@ ifeq ($(ARCH), arm64)
 USE_TRMM = 1
 endif
 
+ifeq ($(ARCH), zarch)
+USE_TRMM = 1
+endif
+
 ifeq ($(TARGET), LOONGSON3B)
 USE_TRMM = 1
 endif
@@ -44,10 +48,6 @@ ifeq ($(CORE), POWER8)
 USE_TRMM = 1
 endif
 
-ifeq ($(CORE), Z13)
-USE_TRMM = 1
-endif
-
 
 
 

I see no more those "half accurate" errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants