Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

18 "FATAL ERROR" messages compiling v0.3.7 in a Dockerfile #2244

Closed
1fish2 opened this issue Aug 31, 2019 · 21 comments
Closed

18 "FATAL ERROR" messages compiling v0.3.7 in a Dockerfile #2244

1fish2 opened this issue Aug 31, 2019 · 21 comments
Labels
Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS

Comments

@1fish2
Copy link

1fish2 commented Aug 31, 2019

Here's the first part of my Dockerfile:

FROM python:2.7.16

RUN apt-get update \
    && apt-get install -y swig gfortran llvm cmake ncurses-dev \
        libreadline7 libreadline-dev nano

RUN (mkdir -p openblas && cd openblas \
    && curl -SL https://github.com/xianyi/OpenBLAS/archive/v0.3.5.tar.gz | tar -xz \
    && cd OpenBLAS* \
    && make FC=gfortran \
    && make PREFIX=/usr/local install) \
    && rm -r openblas

Assuming you have the Docker Desktop installed, run

docker build .

Everything's great with openblas v0.3.5.

With v0.3.6 or v0.3.7, the tests print 18 "FATAL ERROR" messages. Excerpts:

...

 DSYMM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.710259         -0.384470
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* DSYMM  FAILED ON CALL NUMBER:
    418: DSYMM ('R','U',  1, 31, 1.0, A, 32, B,  2, 0.0, C,  2)    .

 DTRMM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.306693         -0.776846E-03
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* DTRMM  FAILED ON CALL NUMBER:
    830: DTRMM ('R','U','N','U',  1, 31, 1.0, A, 32, B,  2)        .

 DTRSM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.106893          0.201045
       2      0.336663          0.879989
       3      0.266733          0.664376
       4      0.226773          0.699590E-01
       5      0.256743          0.274558
       6     -0.129870E-01       1.25768
       7      0.416583           1.74596
       8     -0.452547         -0.881632
       9     -0.929071E-01     -0.207647
      10      0.136863          0.141120
      11      0.669331E-01     -0.210364
      12     -0.302697         -0.481668
      13      0.269730E-01     -0.332602
      14     -0.212787          0.336922
      15      0.216783          0.479765
      16      0.346653           1.56915
      17      0.176823         -0.592299E-01
      18     -0.292707          0.383843E-01
      19     -0.132867          0.404008E-03
      20      0.496503          0.797692
      21     -0.172827          0.674353E-01
      22     -0.142857         -0.205278
      23     -0.412587         -0.540019
      24      0.146853          0.278119
      25     -0.229770E-01     -0.229770E-01
      26     -0.492507         -0.492507
      27     -0.262737         -0.262737
      28     -0.332667         -0.332667
      29     -0.372627         -0.372627
      30     -0.342657         -0.342657
      31      0.386613          0.386613
 ******* DTRSM  FAILED ON CALL NUMBER:
   2234: DTRSM ('L','U','N','U', 31,  1, 1.0, A, 32, B, 32)        .

...

 ZGEMM  PASSED THE COMPUTATIONAL TESTS ( 17496 CALLS)

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (   0.103566    ,  -0.748033E-01)  (  -0.104710E-01,   0.766885E-01)
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* ZHEMM  FAILED ON CALL NUMBER:
    382: ZHEMM ('R','U',  1,  7,( 1.0, 0.0), A,  8, B,  2,( 0.0, 0.0), C,  2)    .

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (   0.373806    ,  -0.821975E-01)  (  -0.135625    ,  -0.115469E-01)
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* ZSYMM  FAILED ON CALL NUMBER:
    382: ZSYMM ('R','U',  1,  7,( 1.0, 0.0), A,  8, B,  2,( 0.0, 0.0), C,  2)    .

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (   0.642930    ,   0.160609E-01)  (   0.301080    ,  -0.203213E-01)
      THESE ARE THE RESULTS FOR COLUMN   2
 ******* ZTRMM  FAILED ON CALL NUMBER:
    758: ZTRMM ('R','U','N','U',  1,  7,( 1.0, 0.0), A,  8, B,  2)               .

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (   0.186813    ,  -0.156843    )  (   0.282561E-01,  -0.104668    )
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* ZTRSM  FAILED ON CALL NUMBER:
    764: ZTRSM ('R','U','T','U',  1,  7,( 1.0, 0.0), A,  8, B,  2)               .

...

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.710259         -0.384470
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* DSYMM  FAILED ON CALL NUMBER:
    418: DSYMM ('R','U',  1, 31, 1.0, A, 32, B,  2, 0.0, C,  2)    .

 DTRMM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.306693          0.209034
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* DTRMM  FAILED ON CALL NUMBER:
    830: DTRMM ('R','U','N','U',  1, 31, 1.0, A, 32, B,  2)        .

 DTRSM  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1      0.106893          0.201045
       2      0.336663          0.879989
       3      0.266733          0.664376
       4      0.226773          0.699590E-01
       5      0.256743          0.274558
       6     -0.129870E-01       1.25768
       7      0.416583           1.74596
       8     -0.452547         -0.881632
       9     -0.929071E-01     -0.207647
      10      0.136863          0.141120
      11      0.669331E-01     -0.210364
      12     -0.302697         -0.481668
      13      0.269730E-01     -0.332602
      14     -0.212787          0.336922
      15      0.216783          0.479765
      16      0.346653           1.56915
      17      0.176823         -0.592299E-01
      18     -0.292707          0.383843E-01
      19     -0.132867          0.404008E-03
      20      0.496503          0.797692
      21     -0.172827          0.674353E-01
      22     -0.142857         -0.205278
      23     -0.412587         -0.540019
      24      0.146853          0.278119
      25     -0.229770E-01     -0.229770E-01
      26     -0.492507         -0.492507
      27     -0.262737         -0.262737
      28     -0.332667         -0.332667
      29     -0.372627         -0.372627
      30     -0.342657         -0.342657
      31      0.386613          0.386613
 ******* DTRSM  FAILED ON CALL NUMBER:
   2234: DTRSM ('L','U','N','U', 31,  1, 1.0, A, 32, B, 32)        .

...

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (    1.04432    ,   0.519497    )  (   0.880628    ,  -0.222493    )
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* ZHEMM  FAILED ON CALL NUMBER:
    418: ZHEMM ('R','U',  1, 31,( 1.0, 0.0), A, 32, B,  2,( 0.0, 0.0), C,  2)    .

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (   0.238014    ,   0.879398    )  (   0.100946    ,    1.01861    )
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* ZSYMM  FAILED ON CALL NUMBER:
    418: ZSYMM ('R','U',  1, 31,( 1.0, 0.0), A, 32, B,  2,( 0.0, 0.0), C,  2)    .

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (   0.642930    ,   0.160609E-01)  (   0.301080    ,  -0.203213E-01)
      THESE ARE THE RESULTS FOR COLUMN   2
 ******* ZTRMM  FAILED ON CALL NUMBER:
    758: ZTRMM ('R','U','N','U',  1,  7,( 1.0, 0.0), A,  8, B,  2)               .

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (   0.186813    ,  -0.156843    )  (   0.282561E-01,  -0.104668    )
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* ZTRSM  FAILED ON CALL NUMBER:
    764: ZTRSM ('R','U','T','U',  1,  7,( 1.0, 0.0), A,  8, B,  2)               .

...

 cblas_dtrmm  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1     -0.108872          0.269345
       2     -0.395857         -0.166861
       3     -0.268112          0.103559
       4     -0.447473          0.405788E-02
       5      0.948133E-01     -0.593812E-03
       6      0.373093E-01     -0.610229E-01
       7     -0.585578E-01      0.258616
       8      0.251652         -0.961536E-01
       9      0.256743          0.256743
 ******* cblas_dtrmm  FAILED ON CALL NUMBER:
 ******* cblas_dtrmm  FAILED ON CALL NUMBER:

 ******* FATAL ERROR - TESTS ABANDONED *******
OPENBLAS_NUM_THREADS=2 ./xccblat3 < cin3
 TESTS OF THE COMPLEX          LEVEL 3 BLAS

...

 cblas_zhemm  PASSED THE TESTS OF ERROR-EXITS

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
                       EXPECTED RESULT                    COMPUTED RESULT
       1  (    1.13123    ,  -0.239697    )  (  -0.704381    ,  -0.298337    )
      THESE ARE THE RESULTS FOR COLUMN   1
 ******* cblas_zhemm  FAILED ON CALL NUMBER:
    490: cblas_zhemm ( CblasColMajor,    CblasRight,    CblasUpper,
            1, 35, ( 1.0, 0.0), A, 36, B,  2, ( 0.0, 0.0), C,  2).
 ******* cblas_zhemm  FAILED ON CALL NUMBER:
    289: cblas_zhemm ( CblasRowMajor,     CblasLeft,    CblasUpper,
            1,  1, ( 0.0, 0.0), A,  2, B,  2, ( 0.0, 0.0), C,  2).

 ******* FATAL ERROR - TESTS ABANDONED *******
make[1]: Leaving directory '/openblas/OpenBLAS-0.3.7/ctest'

...
@martin-frbg
Copy link
Collaborator

What hardware and operating system is your docker running on ?

@1fish2
Copy link
Author

1fish2 commented Aug 31, 2019

I'm running this on macOS Mojave 10.14.6 on a 2018 MacBook Pro.

Docker is supposed to isolate programs from all of that, yes? Or are there holes in that abstraction?

@1fish2
Copy link
Author

1fish2 commented Aug 31, 2019

Inside the container, more /etc/os-release reports

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

@martin-frbg
Copy link
Collaborator

We have another open issue (#2194) where a docker container under OSX behaves differently from
both a native Linux installation and a (probably identical) docker container under Linux. (Though in that case, 0.3.5 worked in the OSX case as weil).

@1fish2
Copy link
Author

1fish2 commented Aug 31, 2019

Oh, man, another leaky abstraction to contend with!

It must be really hard maintaining OpenBLAS across hardware, OS, environment, and compiler variations. Props to the team!

@martin-frbg
Copy link
Collaborator

Unfortunately I do not have a Mac to try and bisect this in case there was really some change in 0.3.6 responsible for this effect. The two OSX builds in our Travis CI setup do pass though...

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

Try to set build variables NO_AVX2=1 and then NO_AVX=1

It could be that virtualisation between linux and osx breaks one of those (and also check /proc/cpuinfo in container if possible)

EDIT: it is sometimes seen in less mature virtualizers that advanced ISA bits are not masked off in a virtual machine. Say early Windows Linux subsystem had no AVX(1) while it was in CPUID....

@martin-frbg
Copy link
Collaborator

@brada4 not sure how that would explain 0.3.5 still working though ?

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

Would be nice to get compiler warnings out, and check gfortran's side of ABI breakage.... It is sort of addressed in 0.3.7
My rough idea is -
If AVX state is not saved it is highly likely that for example with power connected all cores are always up and faulty context switch and back still preserves registers, but say power disconnected, or battery below half would get cores parked and unparked, and cause more AVX state corruption.
Also running other AVX software, like playing video in the browser, or some native math package would bring failure closer.

@1fish2
Copy link
Author

1fish2 commented Aug 31, 2019

/proc/cpuinfo inside the container running on the Mac reports this x 6:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 158
model name	: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
stepping	: 10
cpu MHz		: 2900.000
cache size	: 12288 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht pbe syscall nx pdpe1gb lm constant_tsc
rep_good nopl xtopology nonstop_tsc pni pclmulqdq dtes64 ds_cpl ssse3 sdbg fma cx16 xtpr pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lah
f_lm abm 3dnowprefetch kaiser fsgsbase bmi1 hle avx2 bmi2 erms rtm xsaveopt arat
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds
bogomips	: 5808.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

BTW Docker is set to allot 6 of 12 hyperthreaded "CPUs" to a container.

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

Looks like NO_AVX2=1 will be needed, at least I could not find any sign of AVX2 support.
https://github.com/machyve/xhyve/blob/1f1dbe3059904f885e4ab2b3328f4bb350ea5c37/src/vmm/vmm_host.c

@1fish2
Copy link
Author

1fish2 commented Aug 31, 2019

Experimental confirmation:

  • Compiling v0.3.7 from source on Mac outside of Docker via make FC=gfortran prints no fatal errors.

  • Building the Dockerfile (as above except with v0.3.7.tar.gz) on Linux on Google Cloud Build via

     gcloud builds submit --timeout=2h --tag gcr.io/$(gcloud config get-value core/project)/issue-2244
    

    prints no fatal errors.

Thinking about @brada4's hypothesis and Docker allocating half of the hardware CPU threads, does Docker tie down a fixed group of cores or could there be faulty context switches between cores? Just brainstorming hypotheses here beyond my expertise.

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

Could you check (by passing NO_AVX2=1 to OSX docker) if....
xhyve virtualisation layer does not preserve AVX2 registers of processor, though virtualised CPUID claims so.
i.e. wherever you have `make' add parameter,

    && make NO_AVX2=1 FC=gfortran \
    && make NO_AVX2=1 PREFIX=/usr/local install) \

Docker allocating half of the hardware CPU threads

Nope, that's at lower level in kernel, OpenBLAS is userspace library,

switches between cores?

AVX2 is new instruction set in Haswell CPU, improvement over original AVX introduced with Sandy Bridge, it is not 2 cores with AVX

@1fish2
Copy link
Author

1fish2 commented Aug 31, 2019

The test result is: With v0.3.7.tar.gz and the make NO_AVX2=1 substitution, docker build . completes with no fatal errors!

(Nor any other errors that I can see. There are build warnings.)

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

Since we have no MAC PC,
Could you, please, report a bug to xhyve, as essentially all things compiled with -march=native, and most video coders and most math apps will fail same way.
To serve OpenBLAS it is sufficient to mask out any CPUID bits with the effect on CPU registers exceeding original AVX, an extension they seem to support.
Proper stable, complete and safe implementation is to their discretion.
Please link this issue so we can add their fixed release number to our documents/faqs

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

There are build warnings.

Some small amount is expected, probably you see them in same places in CI logs

@1fish2
Copy link
Author

1fish2 commented Aug 31, 2019

I'm happy do that but I'm not sure how to write a clear Issue. I'll at-reference you for more details, or alternatively you could file the Issue and at-reference me.

Q. Did you know why v0.3.6 and v0.3.7 have this symptom but v0.3.5 doesn't?
Q. Is make NO_AVX2=1 a good workaround or are we better off staying on v0.3.5 for now?

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

Just link my comment #2244 (comment)
It has enough detail && light.

A1: just a wild guess suspecting virtualisation layer on first encounter

A2: if you can tell apart from inside the docker container xhyve from hypothetical later fixed xhyve with AVX2 support and from linux docker which has AVX2 since inception - yes, otherwise with persistent option you lose 20-30% performance on real AVX2 CPUs, like all those made in last 5-some years
0.3.5 does not detect your CPU correctly, so uses very basic SSE3 as in first x86_64 CPUs, ignoring even more performance gained with advanced instructions in 15 years....
Short - degraded versions will work, but may not meet the performance expectations

@brada4
Copy link
Contributor

brada4 commented Aug 31, 2019

Thanks, I added the detail missing.

@martin-frbg martin-frbg added the Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS label Sep 5, 2019
1fish2 added a commit to CovertLab/wcEcoli that referenced this issue May 18, 2020
In case the bug fixes matter, update OpenBLAS to 0.3.9 inside the Docker container and in the create-pyenv instructions.

I retested the problem with AVX2 instructions in Docker Desktop for Mac (OpenMathLib/OpenBLAS#2244) with the latest OpenBLAS and Docker Desktop and filed the Issue in the Docker repo this time, docker/for-mac#4576

On macOS outside of Docker, `brew install openblas` now installs 0.3.9 . We no longer need to compile it from source.

Use Python 3.8.3 in the test workflows.
@1fish2
Copy link
Author

1fish2 commented Sep 22, 2020

After a year there are no comments or fixes for xhyve Issue 171 or docker-for-mac Issue 4576.

Unexpectedly, building (make && make install) OpenBLAS 0.3.10 with NO_AVX2=1 is only needed when building in Docker-for-Mac. That is, building OpenBLAS with NO_AVX2=0 in Docker-for-Linux builds a Docker Image that runs fine in Docker-for-Mac (contrary to my interpretation of previous posts, above) and my attempts to run OpenBLAS self-tests didn't print FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE although I did not figure out how to run as many self-tests as make does when building it.

Curiouser, OpenBLAS 0.3.10 built with and without NO_AVX2=1 produce different results from a 15 minute computation that calculates parameters for our bio cell simulation. This happens whether building in Docker-for-Linux, on Mac outside of Docker, and IIRC on Linux outside of Docker. The OpenBLAS that's embedded in NumPy and SciPy matches the NO_AVX2=1 results within Docker, it matches the NO_AVX2=0 case on Mac outside of Docker, and gets completely different results on Linux outside of Docker. I have not tried to localize these differences from the large computation down to a minimal test case.

We run with OPENBLAS_NUM_THREADS=1 to avoid some small result differences and significant slowdowns. We do have a minimal test case for those result differences -- a dot product of two large vectors. My hypothesis is those result differences come from dividing the data into different size blocks, yielding different floating point roundoffs.

Summary: We can install OpenBLAS

  • v0.3.9 (the version embedded in NumPy and SciPy), or v0.3.10
  • with NO_AVX2=1, or without NO_AVX2=1, or embedded in NumPy and SciPy
  • inside or outside of Docker
  • along with NumPy 1.19.1 or 1.19.2
  • then run on Mac or Linux.

Any suggestions how to install OpenBLAS in a way that will get consistent results cross-platform rather than at least 7 equivalence classes of results? The installation instructions needn't be the same outside Docker vs. in the Dockerfile.

@martin-frbg
Copy link
Collaborator

Added a warning to the FAQ section in the wiki as there has been no activity on the xhyve issue tracker for the past 3 years

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS
Projects
None yet
Development

No branches or pull requests

3 participants