Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in dgeqrf_ QR factorization since 0.3.21 - Output contains NaN values #5006

Closed
robert-hardwick opened this issue Dec 5, 2024 · 5 comments · Fixed by #5007
Closed

Comments

@robert-hardwick
Copy link

This was discovered from a PyTorch unit test failure on aarch64 ( link will be attached )

Code to reproduce the problem is attached

reproducer.zip

See attached code for more details

   dgeqrf_(&m, &n, input_buffer, &lda, tau_buffer, workspace, &lwork, &info);

    for(int i = 0; i < TAU_STRIDE; i++){
        if(std::isnan(tau_buffer[i]))
            throw std::invalid_argument( "TAU contains NaN" );
    }

Under libopenblas/openblas 0.3.20 there is no NaN contained in the tau output, however in 0.3.21+ the output contains NaN values. It appears this is when the implementation was changed from Fortrain to C.

@martin-frbg
Copy link
Collaborator

The implementation as such did not change, 0.3.21 only introduced an optional C translation of the LAPACK code that can be used when no Fortran compiler is available. If you see it "only" in CI, it suggests that your setup lost its fortran compiler in about the same timeframe and you wouldn't be able to compile 0.3.20 there now.
I cannot reproduce this on x86_64 with the current develop branch, will try aarch64 shortly.

@martin-frbg
Copy link
Collaborator

Also not reproducible on NeoverseN1 with either NeoverseN1 or ARMV8 target (and NOFORTRAN=1 of course). If anything, this could implicate one of the newer SVE kernels (N1 not having SVE), but 0.3.21 did not have any additions or changes there.

@robert-hardwick
Copy link
Author

Ah apologies about that, I missed that crucial peiece of information. This issue appeared on a Neoverse-v1 machine. I will check n1 from my side, but i don't think there is an issue as it came through a pytorch unit test failure on v1.

I agree with you, this is suggestive of an SVE related issue.

@martin-frbg
Copy link
Collaborator

Reproduced now on my phone :) I have a hunch that it could be the DNRM2 kernel (which isn't even SVE at the moment, just a different "big server" implementation), will see in a moment

@martin-frbg
Copy link
Collaborator

DNRM2 it is indeed, I had already retired this particular implementation of NRM2 on the Apple M "Vortex" targets earlier. Will create a PR later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants