Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training from solverstate? #77

Open
linamede opened this issue Feb 10, 2016 · 10 comments
Open

Resume training from solverstate? #77

linamede opened this issue Feb 10, 2016 · 10 comments

Comments

@linamede
Copy link

What other parameters do I have to set when training using
./experiments/scripts/faster_rcnn_end2end.sh 0 ZF
in order to resume training from a snapshot?
In Caffe I know that the command
./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt --snapshot=models/bvlc_reference_caffenet/caffenet_train_iter_10000.solverstate
is used for this purpose. However, in the directory py-faster-rcnn/output/faster_rcnn_end2end/voc_2007_trainval/ there is not any file *.solverstate, only *.caffemodel.

Thank you

@ericromanenghi
Copy link

I have the same issue. Could you solve it?

@ericromanenghi
Copy link

If you look at the end of this issue rbgirshick/fast-rcnn#35 the solution seems change the value of snapshot from 0 to 1000 or some number that you like in teh solver.prototxt

I tried this and i could obtain the solver state. The only problem is that this solver state is saved in the folder that you run caffe, not in the output folder.

I hope that this help you.

@ericromanenghi
Copy link

I think you can close the issue.

@ericromanenghi
Copy link

Hi! You must omit the weight parameter when you want to start from a solver
state.

If you see the lib/fast_rcnn/train.py you will see that in the function
init of the SolverWrapper class something like this:

    if pretrained_model is not None:
        print ('Loading pretrained model '
               'weights from {:s}').format(pretrained_model)
        self.solver.net.copy_from(pretrained_model)
    elif previous_state is not None:
        print ('Restoring State from {:s}').format(previous_state)
        self.solver.restore(previous_state)

So if you pass a weight parameter, you will never fall in the else part,
and in this part is where you can restart your train from an advanced
iteration.

So, you should start the train with something like this: (i think that you
can call this with less parameters also, but i can not test this now, any
trouble try a little, or write me again):

./tools/train_net.py
--gpu 0
--solver my_solver.prototxt
--snapshot xxx_iter_2000.solverstate
--iters 10000
--imdb imagenet_2015_train
--cfg experiments/cfgs/faster_rcnn_end2end.yml

Hope i help you.

God bless you =)

Eric

2016-04-13 1:47 GMT-03:00 daf11865 notifications@github.com:

@eternautaCAT https://github.com/eternautaCAT

I have a question, said my total iters = 10000, snapshot = 2000, step size
= 4000, gamma = 0.1
now I cancel my training at iters = 2100, so I get a
xxx_iter_2000.solverstate.

I restart training with command:
./tools/train_net.py
--gpu 0
--solver my_solver.prototxt
--snapshot xxx_iter_2000.solverstate
--weights vgg16_faster_rcnn.caffemodel
--iters 10000
--imdb imagenet_2015_train
--cfg experiments/cfgs/faster_rcnn_end2end.yml

but my iters start at 0, 20 ,40... doesn't start at 2000, 2020, 2040...
so my question is the base lr will multiply 0.1 at iters = 2000 or 4000?
because the snapshot is already trained for 2000 iters.
when the new training is at 2000 iters, the total iters now should be
4000.

thank you


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#77 (comment)

@daf11865
Copy link

@eternautaCAT
Yes! I figured out later by myself after I posed that, so I delete the question.
Still thank you for your kind reply, now it resumes training well. THX!

@ajeetksingh
Copy link

ajeetksingh commented Apr 21, 2016

I changed everything as mentioned by @eternautaCAT.
But when I try to run the script file, there is one error asking the selective search boxes.
But in the faster rcnn, we generate proposals on the fly, we don't store them.
I may have done some silly mistake. but not able to pinpoint that mistake.
You can find the log below:

Appending horizontally-flipped training examples...
coco_2014_train gt roidb loaded from /home/exx/ajeet/py-faster-rcnn/data/cache/coco_2014_train_gt_roidb.pkl
Loading selective_search boxes
1 / 82783
Traceback (most recent call last):
File "./tools/train_net.py", line 107, in
imdb, roidb = combined_roidb(args.imdb_name)
File "./tools/train_net.py", line 72, in combined_roidb
roidbs = [get_roidb(s) for s in imdb_names.split('+')]
File "./tools/train_net.py", line 69, in get_roidb
roidb = get_training_roidb(imdb)
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 121, in get_training_roidb
imdb.append_flipped_images()
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 106, in append_flipped_images
boxes = self.roidb[i]['boxes'].copy()
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 67, in roidb
self._roidb = self.roidb_handler()
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/coco.py", line 124, in selective_search_roidb
return self._roidb_from_proposals('selective_search')
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/coco.py", line 150, in _roidb_from_proposals
method_roidb = self._load_proposals(method, gt_roidb)
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/coco.py", line 189, in _load_proposals
raw_data = sio.loadmat(box_file)['boxes']
File "/usr/local/lib/python2.7/dist-packages/scipy/io/matlab/mio.py", line 134, in loadmat
MR = mat_reader_factory(file_name, appendmat, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scipy/io/matlab/mio.py", line 57, in mat_reader_factory
byte_stream = _open_file(file_name, appendmat)
File "/usr/local/lib/python2.7/dist-packages/scipy/io/matlab/mio.py", line 23, in _open_file
return open(file_like, 'rb')
IOError: [Errno 2] No such file or directory: '/home/exx/ajeet/py-faster-rcnn/data/coco_proposals/selective_search/mat/COCO_train2014/COCO_train2014_0000002/COCO_train2014_000000262145.mat'

@ericromanenghi
Copy link

How are you running the train? Would you show me the command?

@SaiAdityaG
Copy link

Is it possible to resume training in 'alternate training' method also?

@ericromanenghi
Copy link

I didn't try with the alternate training, but should work, because you only have to change lib/fast_rcnn/train.py (and maybe train_net.py if the option to take the solver state is not there, I changed my code so I don't remember), and those files are the same for end2end and alternate.

@pitLog
Copy link

pitLog commented Jul 20, 2016

Hi,
I can't see this part of your code "elif previous_state is not None: ..." in train.py on the github.
Also, when I put the option --snapshot, I have an error message "unrecognized arguments"

Is this option always available ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants