-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kaldi's NAN assert on Graviton3 #4750
Comments
nothing that would be already known,I think. If similar characteristics as #189 - SVE kernels for DOT were added in 0.3.22 and last modified in 0.3.26, but that would not explain NeoverseN1&2. (In fact the only difference in kernel inventory between the two is in the NRM2 kernels - dznrm2_thunderx2t99.c has seen repeat changes in 0.3.24 to avoid the generation of NANs when the input data contained +/-Infinity (though I believe most of the "critical" work in that regard happened before 0.3.21) |
You could try changing the NRM2 kernels in KERNEL.NEOVERSEN2 and KERNEL.ARMV8SVE to
as used in KERNEL.NEOVERSEN1 to see if my guess is correct... in particular the CNRM2KERNEL and ZNRM2KERNEL lines |
Thanks, Martin. I ran some tests overnight. It seems v0.3.22 and v0.3.25 also work without the NaN issue. Whereas v0.3.26 has the NaN issue. I'll run the test you suggested today and report on it. |
Just patching the KERNEL.* files and rebuilding should be enough, right? For some reason that doesn't seem to be fixing the NaN. Still trying for a small case, but it's being a bit hard to construct it. |
Yes, this should be sufficient. (But maybe the issue is not in NRM2 - although that is the only relevant difference between the Neoverse N1 and N2). Or we are chasing an uninitialized variable or register, and the NAN comes and goes depending on previous content :( |
I built openblas on the host directly (i was cross compiling before), and it seems the DDOT test is failing, which might explain the kaldi error (looking at where it happens in the code, it's definitely in DOT).
I had built it as follows: This is with gcc 13.2 on a nixos machine. |
Oh, ok. The SVE kernel for DDOT was indeed modified in 0.3.26. Strange how this did not come up in CI. |
Our CI still uses gcc 11.4 and runs on V1 (c7g instance), but the ddot kernel is pure inline assembly... |
I'm using r7g, but looking at the aws docs it doesn't seem it should be different than c7g. (?) |
I tested a build with GCC 11.4, and get the same failure on that r7g host. Could this perhaps be a case of misrecognized CPU at runtime with DYNAMIC_ARCH on?
|
Misrecognized cpu should end up in NeoverseN1, or worst case the "generic ARMV8" target, so unlikely. (And so far I cannot reproduce your DDOT error with gcc12) |
i see. ok, perhaps a silly question since it's in pure assembly, but Could it be related to my use of |
that is about the only option I have not tried yet, if true it would suggest a miscompilation of the ddot test rather than the actual BLAS kernel or library. (As in the NO_FORTRAN case, f2c-generated C files would be substituted for the Fortran sources - but these are limited to the imported Reference-LAPACK and the test/ctest folders that stem from the Reference BLAS). |
Tried running it on a c7g instance too. DDOT failed there too. I'm not sure why 0.3.25 would work with NO_FORTRAN though. |
Hmm... DDOT works when I leave the target name as NeoverseV1, but it fails when I compile for TARGET=NEOVERSEN2 on exactly the same hardware. Both targets use the same dot.c kernel that employs dot_kernel_sve.c (the inline assembly) when HAVE_SVE is defined (which it is for both targets). About the only difference that I can see is in the |
Git bisect tells me that this is the first bad commit: 7a4fef4 |
Thanks - sort-of makes sense as it is about the only recent change, and making something simpler&faster is always a bit dangerous... still do not understand how this leads to failure on one platform and not another closely related... |
Oh good,with the added clobbers the test passes now. Need to double-check that I did not change anything else... |
Thank you!! I'll test with that commit and build the whole application including kaldi to see what happens. |
Built kaldi (and the rest of the app) and the bug appears to be fixed with this change. I tried running it NEOVERSE{N,V}{1,2} as the OPENBLAS_CORETYPE and the app was stable. Thank you so much for such a quick fix! I'll look forward to the next release. |
Well, thank you for your help,and for confirming that the fix actually works |
I recently ran into this issue with Kaldi's NAN assert check failing after I updated openblas: https://github.com/kaldi-asr/kaldi/blob/67548a31c45f93d8b25ee553c5969d6a6d5d9408/src/feat/mel-computations.cc#L239-L242
CPU causing the issue: AWS Graviton3 (r7g instance)
Seems to be similar to #189
Version known not affected: 0.3.21
Version known to be affected: 0.3.27
With 0.3.27, on an r7g instance, setting the coretype has these results:
I think there are two things I should do here:
I'll try to get these and report here, but just in case there might be a usual suspect, I thought I'll report this first.
The text was updated successfully, but these errors were encountered: