Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverse tests are failing for MAGMA builds #539

Closed
mewall opened this issue Sep 2, 2021 · 4 comments · Fixed by #540
Closed

Inverse tests are failing for MAGMA builds #539

mewall opened this issue Sep 2, 2021 · 4 comments · Fixed by #540

Comments

@mewall
Copy link
Collaborator

mewall commented Sep 2, 2021

Describe the bug
Inverse tests are failing for builds of BML against MAGMA using modules on summit and spock

To Reproduce
Steps to reproduce the behavior:

Log into summit

Modify scripts/build_olcf_summit_gnu_mgma_openblas.sh to load the revised gcc module:
module load gcc/9.1.0

Build against MAGMA:
bash scripts/build_olcf_summit_gnu_mgma_openblas.sh

Load the modules:
module load cmake
module load cuda
module load gcc/9.1.0
module load netlib-lapack
module load openblas
module load magma

Allocate a debug node:
bsub -q debug -P csc304 -W 00:30 -nnodes 1 -Is $SHELL

Run an inverse test:
(base) bash-4.4$ jsrun -n1 -a1 -g1 -c7 build/tests/C-tests/bml-test -n inverse -t dense -p single_real
inverse
N = 13
CUDA Hook Library: Failed to find symbol mem_find_dreg_entries, build/tests/C-tests/bml-test: undefined symbol: __PAMI_Invalidate_region
magma_queue_create

Expected behavior
The test should pass

Screenshots
N/A

Desktop (please complete the following information):
Fails on OLCF summit and spock

@nicolasbock
Copy link
Collaborator

I found a potentially relevant bug report at olcf/olcf-user-docs#78 . Could you try whether that solves it?

@jeanlucf22
Copy link
Collaborator

Thanks @nicolasbock , that seems to fix the issue on Summit. I will push a PR with a new build script for Summit

@mewall
Copy link
Collaborator Author

mewall commented Sep 3, 2021

Here is the error message on Spock:

AA^{-1}:
48125112740076505563008870422958571520.000 0.045 0.091 0.128 -0.050 -0.054 -0.117 0.004 0.007 -0.076 -0.016 -0.058 0.056
2228057713808967300907821900194906112.000 1.020 0.040 0.056 -0.022 -0.024 -0.051 0.002 0.003 -0.033 -0.007 -0.026 0.025
831977549091652779792945609428697088.000 0.007 1.015 0.021 -0.008 -0.009 -0.019 0.001 0.001 -0.012 -0.003 -0.010 0.009
2057876571466277674928940617400582144.000 0.018 0.037 1.052 -0.020 -0.022 -0.047 0.002 0.003 -0.031 -0.006 -0.024 0.023
5385257365250968479796922994439225344.000 0.048 0.097 0.136 0.946 -0.057 -0.124 0.004 0.007 -0.081 -0.017 -0.062 0.060
4431442827073042515916963091727777792.000 0.039 0.080 0.112 -0.044 0.953 -0.102 0.004 0.006 -0.066 -0.014 -0.051 0.049
5423184837384496989375003605992472576.000 0.048 0.098 0.137 -0.054 -0.058 0.875 0.004 0.007 -0.081 -0.017 -0.062 0.060
2983457520106945427992740385058193408.000 0.026 0.054 0.075 -0.030 -0.032 -0.069 1.002 0.004 -0.045 -0.009 -0.034 0.033
2042563193759195635230185255591739392.000 0.018 0.037 0.052 -0.020 -0.022 -0.047 0.002 1.003 -0.031 -0.006 -0.023 0.023
1362710292632419067556864294956040192.000 0.012 0.025 0.034 -0.014 -0.014 -0.031 0.001 0.002 0.980 -0.004 -0.016 0.015
3654926655532787571990587711957237760.000 0.032 0.066 0.092 -0.036 -0.039 -0.084 0.003 0.005 -0.055 0.988 -0.042 0.041
842216521446021217197509668306419712.000 0.007 0.015 0.021 -0.008 -0.009 -0.019 0.001 0.001 -0.013 -0.003 0.990 0.009
5150532208034408268554085197064175616.000 0.046 0.093 0.130 -0.051 -0.055 -0.118 0.004 0.007 -0.077 -0.016 -0.059 1.057
[/autofs/nccs-svm1_home1/mewall/packages/bml/tests/C-tests/inverse_matrix_typed.c:49] Error in matrix inverse; ssum(A
A_inverse) = inf
Obtained 8 stack frames.
/autofs/nccs-svm1_home1/mewall/packages/bml/tests/C-tests/inverse_matrix_typed.c:50
/autofs/nccs-svm1_home1/mewall/packages/bml/tests/C-tests/inverse_matrix.c:16
/autofs/nccs-svm1_home1/mewall/packages/bml/tests/C-tests/bml_test.c:327
??:0
/home/abuild/rpmbuild/BUILD/glibc-2.26/csu/../sysdeps/x86_64/start.S:122
srun: error: spock16: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=283434.0

@nicolasbock
Copy link
Collaborator

I created a new issue to track this problem @mewall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants