-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ellpack threshold tests sometimes fail for cuSPARSE build #623
Comments
I am wondering if we are having a race condition in the function void TYPED_FUNC( which is used in that test. |
Parallel is over i, so each row should be independent, and there is no coupling across rows in the algorithm? What is the threshold supposed to be in the test, and what should the results be? |
Correct. What about the variable 'll'? Looks like it is shared by default, and thus may lead to a problem? In the test code, we set the diagonal of a matrix to 1. But it looks like an extra element is set to 1, beside the diagonal. |
So maybe defining ll inside the parallel region will fix the issue? I will be offline for the next ~1 hour. |
@mewall Could you try to remove the declaration 'int ll = 0' at line 187 of src/C-interface/ellpack/bml_setters_ellpack_typed.c and add an 'int' in front of 'll = 0' at line 194? That would make 'll' private and should fix the issue. |
@mewall Can we close this issue? |
Probably addressed by #631 |
Describe the bug
Here are the failures. Unfortunately it cannot be reproduced, they were only seen once.
514 - C-threshold-ellpack-single_real (Failed)
515 - C-threshold-ellpack-double_real (Failed)
516 - C-threshold-ellpack-single_complex (Failed)
517 - C-threshold-ellpack-double_complex (Failed)
This looks like a different problem. The Scaled matrix and Thresholded matrix appear to have the wrong values at [1,0]:
Scaled matrix
1.000 0.009 0.001 0.000 0.009 0.006 0.002 0.002 0.004 0.009 0.008 0.004 0.006
1.000 1.000 0.007 0.001 0.005 0.008 0.003 0.010 0.006 0.009 0.007 0.006 0.000
0.007 0.008 1.000 0.009 0.008 0.010 0.007 0.009 0.010 0.007 0.005 0.002 0.008
[DEBUG] [/projects/icapt/mewall/copa/bml/src/C-interface/bml_allocate.c:581] identity matrix of size 13
Thresholded matrix
1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Tests should (and usually do) pass
Desktop (please complete the following information):
LANL darwin shared-gpu-ampere A100 node
gcc 11.0.0, Cuda 11.4
Additional context
Code touches bml_allocate_ellpack_typed.c
The text was updated successfully, but these errors were encountered: