Caffe crashes with multiple GPUs in machine #441

edwardhsiao · 2014-05-23T00:42:11Z

Our configuration is:
2 GTX 780 Ti
Cuda 5.5
NVIDIA 319.82

Additional notes:

we have tried other configurations as well with Cuda 6.0, and almost every NVIDIA driver from 319, 331, 337 to no avail
the problem is repeatable, we have tried on three different machines Dells and Lenovo with the same exact issue

This is the situation:
With only a single GPU in the machine (GPU0), Caffe works fine and trains without crashing. When we put the second GPU in the machine, the training only works if we use GPU1 (second gpu), but crashes if we train with GPU0. The training fails in random locations and times (sometimes it fails within 1000 iterations, other times it doesn't fail until we hit like 100,000+ iterations) with errors such as the following at different points in training:

F0514 10:39:14.850065 3012 im2col.cu:49] Cuda kernel failed. Error: unspecified launch failure
*** Check failure stack trace: ***
@ 0x7f41020b0b7d google::LogMessage::Fail()
@ 0x7f41020b2c7f google::LogMessage::SendToLog()
@ 0x7f41020b076c google::LogMessage::Flush()
@ 0x7f41020b351d google::LogMessageFatal::~LogMessageFatal()
@ 0x49caee caffe::im2col_gpu<>()
@ 0x47bdee caffe::ConvolutionLayer<>::Forward_gpu()
@ 0x428ffa caffe::Net<>::ForwardPrefilled()
@ 0x434341 caffe::Solver<>::Solve()
@ 0x412992 EntryPoint()
@ 0x7f4101a6ae9a start_thread
@ 0x7f4100f853fd (unknown)

I0514 18:16:10.356845 5600 solver.cpp:145] [thread0] Iteration 35460, loss = 3.62997
F0514 18:16:22.306464 5600 math_functions.cpp:67] Check failed: (cublasSgemm_v2(Caffe::cublas_handle(thread_id), cuTransB, cuTransA, N, M, K, &alpha, B, ldb, A, lda, &beta, C, N)) == CUBLAS_STATUS_SUCCESS (13 vs. 0)
*** Check failure stack trace: ***
@ 0x7fe0c0e4bb7d google::LogMessage::Fail()
@ 0x7fe0c0e4dc7f google::LogMessage::SendToLog()
@ 0x7fe0c0e4b76c google::LogMessage::Flush()
@ 0x7fe0c0e4e51d google::LogMessageFatal::~LogMessageFatal()
@ 0x46038b caffe::caffe_gpu_gemm<>()
@ 0x47db0a caffe::ConvolutionLayer<>::Backward_gpu()
@ 0x429c06 caffe::Net<>::Backward()
@ 0x434349 caffe::Solver<>::Solve()
@ 0x412992 EntryPoint()
@ 0x7fe0c0805e9a start_thread
@ 0x7fe0bfd203fd (unknown)

F0515 15:20:59.607784 4099 syncedmem.cpp:42] Check failed: (cudaMemcpy(cpu_ptr_, gpu_ptr_, size_, cudaMemcpyDeviceToHost)) == cudaSuccess (4 vs. 0)
*** Check failure stack trace: ***
@ 0x7f8fe0515b7d google::LogMessage::Fail()
@ 0x7f8fe0517c7f google::LogMessage::SendToLog()
@ 0x7f8fe051576c google::LogMessage::Flush()
@ 0x7f8fe051851d google::LogMessageFatal::~LogMessageFatal()
@ 0x45dc1a caffe::SyncedMemory::cpu_data()
@ 0x4594b2 caffe::Blob<>::cpu_diff()
@ 0x431af6 caffe::SGDSolver<>::UpdateDeltaParam()
@ 0x43436d caffe::Solver<>::Solve()
@ 0x412992 EntryPoint()
@ 0x7f8fdfecfe9a start_thread
@ 0x7f8fdf3ea3fd (unknown)

Browsing through the issues on Caffe, I've noticed that there were similar experiences (maybe not with multiple gpu) in the past such as:
Cuda kernel crash #39
Crash after iteration 1620 #58
Cuda kernel failed #285

However, none of these issues have a concrete solution to the problem.
#39: Following this thread, we increased the fan speeds on both GPUs since we believed it to be an overheating issue. Both cards run at 70C or lower, but it still crashes. For reference, a single GPU running at 85C is fine.
#58: Some unspecified GPU configuration solved the problem?
#285: Problem miraculously disappeared after reinstalling...

Can someone who has encountered and solved this issue please provide some insight into what the GPU configuration problem was?

Thanks in advance.

kloudkl · 2014-05-24T00:16:19Z

Try to use the latest dev branch which displays more specific error messages.

shelhamer · 2014-05-26T07:29:55Z

Caffe itself has no trouble with multi-GPU machines. Our group has run it extensively on machines with multiple K40s, GTX 770s, and more in multiple processes at once.

I suspect you may have a power issue. 1000W should suffice.

edwardhsiao · 2014-05-27T17:11:10Z

Thank you for your replies.

@kloudkl I did a clean build from the dev branch, restarted the computer and ran it. I get this error message:
F0527 10:05:54.573184 2567 syncedmem.cpp:35] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
@ 0x7fcbe41b0b7d google::LogMessage::Fail()
@ 0x7fcbe41b2c7f google::LogMessage::SendToLog()
@ 0x7fcbe41b076c google::LogMessage::Flush()
@ 0x7fcbe41b351d google::LogMessageFatal::~LogMessageFatal()
@ 0x460af4 caffe::SyncedMemory::cpu_data()
@ 0x43be11 caffe::Blob<>::cpu_data()
@ 0x47872f caffe::AccuracyLayer<>::Forward_cpu()
@ 0x4585a9 caffe::Net<>::ForwardPrefilled()
@ 0x438fba caffe::Solver<>::Test()
@ 0x439908 caffe::Solver<>::Solve()
@ 0x4139fe main
@ 0x7fcbe1cd776d (unknown)
@ 0x4157fd (unknown)

@shelhamer We have 1350W PSUs in the machines, so I don't think there is not enough power for these cards.

nguyentu1602 · 2014-06-25T22:23:33Z

@shelhamer: Hi Evan, when you run caffe on machine with multiple K40s, do you see "ghost memory" gets allocated on GPU 0?
We have four K40m, and when I run caffe on any of GPU #{1, 2, 3}, I see some memory allocated on GPU 0.

In the above picture, GPU 0 still has 0% utility because I don't use it, but its Power usage is > idle (as opposed to GPU 2).
Have you experienced something like this with your system? I can also provide more information if needed.
Thank you.

sguada · 2014-06-25T23:44:08Z

There were some initializations done in CUDA before the device_id was set, and that caused some ghost memory being allocated in device_id = 0. With this PR #521 we tried to solve that problem. So please try to get an updated version of the dev branch and let us know if you still experience the same problem

nguyentu1602 · 2014-06-26T03:18:23Z

@sguada Thank you. I'll make a clean build of the dev branch and retry.

shelhamer · 2014-07-14T10:44:49Z

Closing since it could not be replicated and there are multi-GPU machine installations of Caffe that work fine. Comment if you still have this issue.

ghost · 2015-02-14T11:24:29Z

I'm actually having this same exact issue on a recent (Jan 5) dev branch. Many of the details match up with onyxcube's description. I'm running caffe with MPI on a machine with multiple GPUs, and caffe will occasionally crash with Unspecified Launch Failure after 20-100k iterations. I get the problem when running with mpirun -np 1, so it does not require multi gpus running to repro, although it seems to occur more frequently when all gpus are running. I also have a 1300W PSU for 3 780 tis and a k40. I removed all of my custom kernels for CPU implementations and am only using BLAS GPU kernels, but I still am getting the issue. I've caught it in a debugger a few times and the error is thrown on caffe_gpu_copy, but the caller of the error-producing caffe_gpu_copy call changes each time. To my understanding, this shouldn't be possible when all the kernels do checking after returning, unless the bug is something more subtle. @onyxcube perhaps the problems have to do with the 780 ti having issues. I've never had the issue when running on my K40.

ghost · 2015-02-14T23:20:25Z

Update: I've confirmed that the problem is localized to a single one of my 3 780 ti gpus.

shelhamer added downstream problem? labels May 26, 2014

shelhamer closed this as completed Jul 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caffe crashes with multiple GPUs in machine #441

Caffe crashes with multiple GPUs in machine #441

edwardhsiao commented May 23, 2014

kloudkl commented May 24, 2014

shelhamer commented May 26, 2014

edwardhsiao commented May 27, 2014

nguyentu1602 commented Jun 25, 2014

sguada commented Jun 25, 2014

nguyentu1602 commented Jun 26, 2014

shelhamer commented Jul 14, 2014

ghost commented Feb 14, 2015

ghost commented Feb 14, 2015

Caffe crashes with multiple GPUs in machine #441

Caffe crashes with multiple GPUs in machine #441

Comments

edwardhsiao commented May 23, 2014

kloudkl commented May 24, 2014

shelhamer commented May 26, 2014

edwardhsiao commented May 27, 2014

nguyentu1602 commented Jun 25, 2014

sguada commented Jun 25, 2014

nguyentu1602 commented Jun 26, 2014

shelhamer commented Jul 14, 2014

ghost commented Feb 14, 2015

ghost commented Feb 14, 2015