Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with cudaFreeHost(ptr) in syncedmem.hpp:30 #3053

Closed
LiberiFatali opened this issue Sep 10, 2015 · 16 comments · Fixed by #3073
Closed

Error with cudaFreeHost(ptr) in syncedmem.hpp:30 #3053

LiberiFatali opened this issue Sep 10, 2015 · 16 comments · Fixed by #3073

Comments

@LiberiFatali
Copy link

In my case, when caffe model finishs predicting an image. There is the below error:

F0910 15:09:52.590445 21913 syncedmem.hpp:30] Check failed: error == cudaSuccess (11 vs. 0) invalid argument
*** Check failure stack trace: ***
Aborted

It comes from CUDA_CHECK(cudaFreeHost(ptr)); in syncedmem.hpp.
Everything else works well. Any ideas for fixing this? I use latest code in the master.

I'm using Ubuntu 14.04, NVIDIA driver 352.41, Cuda 7.5 and CuDNN v2.

@hberntsen
Copy link

I get the same error in Python 2.7:

import caffe
#standard bvlc_alexnet
net = caffe.Net('deploy.prototxt', 'bvlc_alexnet.caffemodel', caffe.TEST)
caffe.set_mode_gpu()
exit

On Ubuntu 14.04, NVIDIA 346.82, Cuda 7. The error is encountered after the exit command.

@beniz
Copy link

beniz commented Sep 12, 2015

I've seen this before as well as a crash whenever a net is deleted after a predict. This not guaranteed to help you immediately, but you may want to try to get your Caffe tree back to one of these two commits and try your code again:

Hopefully, one of the versions above will work for you, thus helping to pinpoint the issue.

EDIT: d2f0457 that activates pinned memory is crashing some of my setups at Net destruction, and this looks similar to the problem reported in this issue. Reverting the commit clears the bug.

@ronghanghu
Copy link
Member

I think we have a bug here. I'll try to look into this today.

@ronghanghu
Copy link
Member

@hberntsen @beniz I just took a look today. Can you reproduce the same error if you run caffe.set_mode_gpu() before creating your net? That is,

import caffe
#standard bvlc_alexnet
caffe.set_mode_gpu()
net = caffe.Net('deploy.prototxt', 'bvlc_alexnet.caffemodel', caffe.TEST)
exit

Right now the mode is not (but actually should be) an attribute of the net that is set up during creation (just like phase), since in CPU mode malloc is used while in GPU mode cudaMallocHost is used (introduced in #2903). So, if you run caffe.set_mode_gpu() after creating a net, caffe will allocate CPU memory using malloc (since it is CPU mode during net construction) and try to use cudaFreeHost instead of free to free memory allocated by malloc when destroying a net, resulting in this error.

@ronghanghu
Copy link
Member

One solution is to always use cudaMallocHost to allocate host memory unless using CPU_ONLY build. @shelhamer do you agree? Some docs regarding this function:

The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation (making cudaMemcpy faster). These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.

Alternatively, we can also add mode to be an attribute of net, but that involves more hacking at the cost of interface change and at risk of introducing new mistakes.

@beniz
Copy link

beniz commented Sep 12, 2015

@ronghanghu I believe this is indeed a good catch, thanks for the quick reaction! Though I cannot test-run immediately, my code appears to be calling set_mode_gpu after Net creation.

Alternatively, we can also add mode to be an attribute of net, but that involves more hacking at the cost of interface change and at risk of introducing new mistakes.

If this gives the ability to have multiple nets in memory, some using CPU, some using GPU, as a user of the caffelib, I would rate it as a very good feature to conserve (since as far as I understand, this was working prior enforcing the use of `cudaMallocHost').

@ronghanghu
Copy link
Member

If this gives the ability to have multiple nets in memory, some using CPU, some using GPU, as a user of the caffelib, I would rate it as a very good feature to conserve (since as far as I understand, this was working prior enforcing the use of `cudaMallocHost').

Eventually we would like to give mode and device to net and layers, in the spirit of #1500. But it seems like a short-term fix to always use cudaMallocHost to allocate cpu memory regardless of the mode of the net to avoid this crash. cudaMallocHost seems to assume at least one GPU is there (I don't know why). seem to be running into the same issue as mentioned in 46a431a

However, although mode/device is right now not a member of Net and Layer and thus changable in runtime, it is better to set them in advance and not change them during a lifecycle of a net.

@hberntsen
Copy link

@ronghanghu In that case the error is does not appear. So setting the mode to GPU before loading the net circumvents the error.

@LiberiFatali
Copy link
Author

I use caffe.Classifier to contruct the net, not caffe.Net. So when I use

self.net = caffe.Classifier(MODEL_FILE, PRETRAINED) 
caffe.set_mode_gpu()

or

caffe.set_mode_gpu()
self.net = caffe.Classifier(MODEL_FILE, PRETRAINED)

this error is still there.

I will try suggested commits above. Thanks.

@LiberiFatali
Copy link
Author

I also test on GPUs with and without connected monitor to see if it is the problem of freeing in-use GPU memory. Still got the error.

@ronghanghu
Copy link
Member

@LiberiFatali I could not reproduce your mistake (here is the code snap I used).

import caffe
caffe.set_mode_gpu()

model   = './caffe/models/bvlc_reference_caffenet/deploy.prototxt'
weights = './caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'
net = caffe.Classifier(model, weights)

impath = './caffe/examples/images/cat.jpg'
im = caffe.io.load_image(impath)
probs = net.predict([im])

It doesn't produce any error on a BVLC machine. Can you try out this code snap on your machine?

@LiberiFatali
Copy link
Author

@ronghanghu Above codes work well on my machine. Setting mode gpu before creating the net solves it now.

@eldar
Copy link

eldar commented Sep 23, 2015

I have exactly the same issue when cleaning up network instances in Matlab with call to caffe.reset_all(). Placing caffe.set_mode_gpu(); before loading the model also solved it.

@cheer37
Copy link

cheer37 commented Mar 4, 2016

I put the setmode before net creation. But i got this error still.
How can i solve this problem?
Thanks.

@ucasqcz
Copy link

ucasqcz commented May 4, 2016

@cheer37 ,have you fix the problem ,i did not use the latest PR of caffe,and i get the same error
Check failed: error == cudaSuccess (29 vs. 0) driver shutting down
how did you solve this problem ?
thanks.

@zhxjlbs
Copy link

zhxjlbs commented Dec 19, 2016

@ucasqcz have you fix the problem, i get the same error
Check failed: error == cudaSuccess (29 vs. 0) driver shutting down
how did you solve this problem ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants