Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ellpack threshold tests sometimes fail for cuSPARSE build #623

Closed
mewall opened this issue Jun 15, 2022 · 8 comments
Closed

ellpack threshold tests sometimes fail for cuSPARSE build #623

mewall opened this issue Jun 15, 2022 · 8 comments
Assignees

Comments

@mewall
Copy link
Collaborator

mewall commented Jun 15, 2022

Describe the bug
Here are the failures. Unfortunately it cannot be reproduced, they were only seen once.

514 - C-threshold-ellpack-single_real (Failed)
515 - C-threshold-ellpack-double_real (Failed)
516 - C-threshold-ellpack-single_complex (Failed)
517 - C-threshold-ellpack-double_complex (Failed)
This looks like a different problem. The Scaled matrix and Thresholded matrix appear to have the wrong values at [1,0]:

Scaled matrix
1.000 0.009 0.001 0.000 0.009 0.006 0.002 0.002 0.004 0.009 0.008 0.004 0.006
1.000 1.000 0.007 0.001 0.005 0.008 0.003 0.010 0.006 0.009 0.007 0.006 0.000
0.007 0.008 1.000 0.009 0.008 0.010 0.007 0.009 0.010 0.007 0.005 0.002 0.008

[DEBUG] [/projects/icapt/mewall/copa/bml/src/C-interface/bml_allocate.c:581] identity matrix of size 13
Thresholded matrix
1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
Tests should (and usually do) pass

Desktop (please complete the following information):
LANL darwin shared-gpu-ampere A100 node
gcc 11.0.0, Cuda 11.4

Additional context
Code touches bml_allocate_ellpack_typed.c

@jeanlucf22
Copy link
Collaborator

I am wondering if we are having a race condition in the function

void TYPED_FUNC(
bml_set_diagonal_ellpack) (
bml_matrix_ellpack_t * A,
void *_diagonal,
double threshold)

which is used in that test.
In src/C-interface/ellpack/bml_setters_ellpack_typed.c, line 207, we may need an "atomic" directive.
@jmohdyusof would you agree?

@jmohdyusof
Copy link
Collaborator

Parallel is over i, so each row should be independent, and there is no coupling across rows in the algorithm? What is the threshold supposed to be in the test, and what should the results be?

@jeanlucf22
Copy link
Collaborator

Correct. What about the variable 'll'? Looks like it is shared by default, and thus may lead to a problem?

In the test code, we set the diagonal of a matrix to 1. But it looks like an extra element is set to 1, beside the diagonal.

@jmohdyusof
Copy link
Collaborator

jmohdyusof commented Jun 20, 2022

So maybe defining ll inside the parallel region will fix the issue? I will be offline for the next ~1 hour.
I think the default behavior for shared variables should be firstprivate, but perhaps not for scalars, only arrays.

@jeanlucf22
Copy link
Collaborator

@mewall Could you try to remove the declaration 'int ll = 0' at line 187 of src/C-interface/ellpack/bml_setters_ellpack_typed.c and add an 'int' in front of 'll = 0' at line 194? That would make 'll' private and should fix the issue.

@jeanlucf22
Copy link
Collaborator

@mewall Can we close this issue?

@mewall
Copy link
Collaborator Author

mewall commented Jul 21, 2022

Probably addressed by #631

@mewall mewall closed this as completed Jul 21, 2022
@mewall
Copy link
Collaborator Author

mewall commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants