Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make: add --device-debug to NVCC debug flags #7542

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

This PR adds the --device-debug flag to the NVCC compile flags if LLAMA_DEBUG is set. This adds device debugging information but also turns off device code optimization so the compilation is faster.

@ggerganov
Copy link
Owner

This improves the compile time, but it generates the following warnings:

nvcc warning : '--device-debug (-G)' overrides '--generate-line-info (-lineinfo)'
ptxas warning : Conflicting options --device-debug and --generate-line-info specified, ignoring --generate-line-info option

Not sure what are the implications of overriding the -lineinfo option

@JohannesGaessler
Copy link
Collaborator Author

-lineinfo essentially attaches a copy of the CUDA source code and allows you to map the binary code to specific source code lines when profiling the code with NVIDIA NSight compute. --device-debug additionally disables optimizations but this makes the information from profiling inaccurate. So I would interpret the warning in that context, that the two options do similar things but are intended for different purposes. For debug builds I think disabling optimizations is the correct things to do; when I profile the code I never use LLAMA_DEBUG anyways but just manually add -lineinfo to the NVCC flags.

@slaren
Copy link
Collaborator

slaren commented May 26, 2024

This needs to be optional. The only case this helps is when looking for bugs on the kernels that are slow to compile, otherwise these files do not need to be recompiled when making changes to other parts of the code. However, this makes the CUDA backend so slow that it will make debug builds unusable for quick iteration in other areas.

GPU Model Test t/s master t/s cuda-device-debug Speedup
RTX 3090 Ti llama 7B Q4_0 pp512 4081.50 618.90 0.15
RTX 3090 Ti llama 7B Q4_0 tg128 150.52 6.28 0.04

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 26, 2024

I thought I had measured a much smaller performance regression when I tested it but I must have done something wrong because upon renewed testing I'm getting very similar results. I added a new flag LLAMA_CUDA_DEBUG instead. There is still a warning if one compiles both with LLAMA_DEBUG and LLAMA_CUDA_DEBUG but I think it's inconsequential.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 27, 2024
@JohannesGaessler JohannesGaessler merged commit 10b1e45 into ggerganov:master May 27, 2024
64 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants