-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caffe crashes with multiple GPUs in machine #441
Comments
Try to use the latest |
Caffe itself has no trouble with multi-GPU machines. Our group has run it extensively on machines with multiple K40s, GTX 770s, and more in multiple processes at once. I suspect you may have a power issue. 1000W should suffice. |
Thank you for your replies. @kloudkl I did a clean build from the dev branch, restarted the computer and ran it. I get this error message: @shelhamer We have 1350W PSUs in the machines, so I don't think there is not enough power for these cards. |
@shelhamer: Hi Evan, when you run caffe on machine with multiple K40s, do you see "ghost memory" gets allocated on GPU 0? In the above picture, GPU 0 still has 0% utility because I don't use it, but its Power usage is > idle (as opposed to GPU 2). |
There were some initializations done in CUDA before the device_id was set, and that caused some ghost memory being allocated in device_id = 0. With this PR #521 we tried to solve that problem. So please try to get an updated version of the |
@sguada Thank you. I'll make a clean build of the dev branch and retry. |
Closing since it could not be replicated and there are multi-GPU machine installations of Caffe that work fine. Comment if you still have this issue. |
I'm actually having this same exact issue on a recent (Jan 5) dev branch. Many of the details match up with onyxcube's description. I'm running caffe with MPI on a machine with multiple GPUs, and caffe will occasionally crash with Unspecified Launch Failure after 20-100k iterations. I get the problem when running with mpirun -np 1, so it does not require multi gpus running to repro, although it seems to occur more frequently when all gpus are running. I also have a 1300W PSU for 3 780 tis and a k40. I removed all of my custom kernels for CPU implementations and am only using BLAS GPU kernels, but I still am getting the issue. I've caught it in a debugger a few times and the error is thrown on caffe_gpu_copy, but the caller of the error-producing caffe_gpu_copy call changes each time. To my understanding, this shouldn't be possible when all the kernels do checking after returning, unless the bug is something more subtle. @onyxcube perhaps the problems have to do with the 780 ti having issues. I've never had the issue when running on my K40. |
Update: I've confirmed that the problem is localized to a single one of my 3 780 ti gpus. |
Our configuration is:
2 GTX 780 Ti
Cuda 5.5
NVIDIA 319.82
Additional notes:
This is the situation:
With only a single GPU in the machine (GPU0), Caffe works fine and trains without crashing. When we put the second GPU in the machine, the training only works if we use GPU1 (second gpu), but crashes if we train with GPU0. The training fails in random locations and times (sometimes it fails within 1000 iterations, other times it doesn't fail until we hit like 100,000+ iterations) with errors such as the following at different points in training:
F0514 10:39:14.850065 3012 im2col.cu:49] Cuda kernel failed. Error: unspecified launch failure
*** Check failure stack trace: ***
@ 0x7f41020b0b7d google::LogMessage::Fail()
@ 0x7f41020b2c7f google::LogMessage::SendToLog()
@ 0x7f41020b076c google::LogMessage::Flush()
@ 0x7f41020b351d google::LogMessageFatal::~LogMessageFatal()
@ 0x49caee caffe::im2col_gpu<>()
@ 0x47bdee caffe::ConvolutionLayer<>::Forward_gpu()
@ 0x428ffa caffe::Net<>::ForwardPrefilled()
@ 0x434341 caffe::Solver<>::Solve()
@ 0x412992 EntryPoint()
@ 0x7f4101a6ae9a start_thread
@ 0x7f4100f853fd (unknown)
I0514 18:16:10.356845 5600 solver.cpp:145] [thread0] Iteration 35460, loss = 3.62997
F0514 18:16:22.306464 5600 math_functions.cpp:67] Check failed: (cublasSgemm_v2(Caffe::cublas_handle(thread_id), cuTransB, cuTransA, N, M, K, &alpha, B, ldb, A, lda, &beta, C, N)) == CUBLAS_STATUS_SUCCESS (13 vs. 0)
*** Check failure stack trace: ***
@ 0x7fe0c0e4bb7d google::LogMessage::Fail()
@ 0x7fe0c0e4dc7f google::LogMessage::SendToLog()
@ 0x7fe0c0e4b76c google::LogMessage::Flush()
@ 0x7fe0c0e4e51d google::LogMessageFatal::~LogMessageFatal()
@ 0x46038b caffe::caffe_gpu_gemm<>()
@ 0x47db0a caffe::ConvolutionLayer<>::Backward_gpu()
@ 0x429c06 caffe::Net<>::Backward()
@ 0x434349 caffe::Solver<>::Solve()
@ 0x412992 EntryPoint()
@ 0x7fe0c0805e9a start_thread
@ 0x7fe0bfd203fd (unknown)
F0515 15:20:59.607784 4099 syncedmem.cpp:42] Check failed: (cudaMemcpy(cpu_ptr_, gpu_ptr_, size_, cudaMemcpyDeviceToHost)) == cudaSuccess (4 vs. 0)
*** Check failure stack trace: ***
@ 0x7f8fe0515b7d google::LogMessage::Fail()
@ 0x7f8fe0517c7f google::LogMessage::SendToLog()
@ 0x7f8fe051576c google::LogMessage::Flush()
@ 0x7f8fe051851d google::LogMessageFatal::~LogMessageFatal()
@ 0x45dc1a caffe::SyncedMemory::cpu_data()
@ 0x4594b2 caffe::Blob<>::cpu_diff()
@ 0x431af6 caffe::SGDSolver<>::UpdateDeltaParam()
@ 0x43436d caffe::Solver<>::Solve()
@ 0x412992 EntryPoint()
@ 0x7f8fdfecfe9a start_thread
@ 0x7f8fdf3ea3fd (unknown)
Browsing through the issues on Caffe, I've noticed that there were similar experiences (maybe not with multiple gpu) in the past such as:
Cuda kernel crash #39
Crash after iteration 1620 #58
Cuda kernel failed #285
However, none of these issues have a concrete solution to the problem.
#39: Following this thread, we increased the fan speeds on both GPUs since we believed it to be an overheating issue. Both cards run at 70C or lower, but it still crashes. For reference, a single GPU running at 85C is fine.
#58: Some unspecified GPU configuration solved the problem?
#285: Problem miraculously disappeared after reinstalling...
Can someone who has encountered and solved this issue please provide some insight into what the GPU configuration problem was?
Thanks in advance.
The text was updated successfully, but these errors were encountered: