Resume training from solverstate? #77

linamede · 2016-02-10T09:46:13Z

What other parameters do I have to set when training using
./experiments/scripts/faster_rcnn_end2end.sh 0 ZF
in order to resume training from a snapshot?
In Caffe I know that the command
./build/tools/caffe train --solver=models/bvlc_reference_caffenet/solver.prototxt --snapshot=models/bvlc_reference_caffenet/caffenet_train_iter_10000.solverstate
is used for this purpose. However, in the directory py-faster-rcnn/output/faster_rcnn_end2end/voc_2007_trainval/ there is not any file *.solverstate, only *.caffemodel.

Thank you

ericromanenghi · 2016-02-29T17:29:19Z

I have the same issue. Could you solve it?

ericromanenghi · 2016-02-29T21:07:52Z

If you look at the end of this issue rbgirshick/fast-rcnn#35 the solution seems change the value of snapshot from 0 to 1000 or some number that you like in teh solver.prototxt

I tried this and i could obtain the solver state. The only problem is that this solver state is saved in the folder that you run caffe, not in the output folder.

I hope that this help you.

ericromanenghi · 2016-03-15T12:12:38Z

I think you can close the issue.

ericromanenghi · 2016-04-13T11:58:06Z

Hi! You must omit the weight parameter when you want to start from a solver
state.

If you see the lib/fast_rcnn/train.py you will see that in the function
init of the SolverWrapper class something like this:

    if pretrained_model is not None:
        print ('Loading pretrained model '
               'weights from {:s}').format(pretrained_model)
        self.solver.net.copy_from(pretrained_model)
    elif previous_state is not None:
        print ('Restoring State from {:s}').format(previous_state)
        self.solver.restore(previous_state)

So if you pass a weight parameter, you will never fall in the else part,
and in this part is where you can restart your train from an advanced
iteration.

So, you should start the train with something like this: (i think that you
can call this with less parameters also, but i can not test this now, any
trouble try a little, or write me again):

./tools/train_net.py
--gpu 0
--solver my_solver.prototxt
--snapshot xxx_iter_2000.solverstate
--iters 10000
--imdb imagenet_2015_train
--cfg experiments/cfgs/faster_rcnn_end2end.yml

Hope i help you.

God bless you =)

Eric

2016-04-13 1:47 GMT-03:00 daf11865 notifications@github.com:

@eternautaCAT https://github.com/eternautaCAT

I have a question, said my total iters = 10000, snapshot = 2000, step size
= 4000, gamma = 0.1
now I cancel my training at iters = 2100, so I get a
xxx_iter_2000.solverstate.

I restart training with command:
./tools/train_net.py
--gpu 0
--solver my_solver.prototxt
--snapshot xxx_iter_2000.solverstate
--weights vgg16_faster_rcnn.caffemodel
--iters 10000
--imdb imagenet_2015_train
--cfg experiments/cfgs/faster_rcnn_end2end.yml

but my iters start at 0, 20 ,40... doesn't start at 2000, 2020, 2040...
so my question is the base lr will multiply 0.1 at iters = 2000 or 4000?
because the snapshot is already trained for 2000 iters.
when the new training is at 2000 iters, the total iters now should be
4000.

thank you

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#77 (comment)

daf11865 · 2016-04-13T13:26:53Z

@eternautaCAT
Yes! I figured out later by myself after I posed that, so I delete the question.
Still thank you for your kind reply, now it resumes training well. THX!

ajeetksingh · 2016-04-21T17:55:55Z

I changed everything as mentioned by @eternautaCAT.
But when I try to run the script file, there is one error asking the selective search boxes.
But in the faster rcnn, we generate proposals on the fly, we don't store them.
I may have done some silly mistake. but not able to pinpoint that mistake.
You can find the log below:

Appending horizontally-flipped training examples...
coco_2014_train gt roidb loaded from /home/exx/ajeet/py-faster-rcnn/data/cache/coco_2014_train_gt_roidb.pkl
Loading selective_search boxes
1 / 82783
Traceback (most recent call last):
File "./tools/train_net.py", line 107, in
imdb, roidb = combined_roidb(args.imdb_name)
File "./tools/train_net.py", line 72, in combined_roidb
roidbs = [get_roidb(s) for s in imdb_names.split('+')]
File "./tools/train_net.py", line 69, in get_roidb
roidb = get_training_roidb(imdb)
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 121, in get_training_roidb
imdb.append_flipped_images()
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 106, in append_flipped_images
boxes = self.roidb[i]['boxes'].copy()
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 67, in roidb
self._roidb = self.roidb_handler()
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/coco.py", line 124, in selective_search_roidb
return self._roidb_from_proposals('selective_search')
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/coco.py", line 150, in _roidb_from_proposals
method_roidb = self._load_proposals(method, gt_roidb)
File "/home/exx/ajeet/py-faster-rcnn/tools/../lib/datasets/coco.py", line 189, in _load_proposals
raw_data = sio.loadmat(box_file)['boxes']
File "/usr/local/lib/python2.7/dist-packages/scipy/io/matlab/mio.py", line 134, in loadmat
MR = mat_reader_factory(file_name, appendmat, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/scipy/io/matlab/mio.py", line 57, in mat_reader_factory
byte_stream = _open_file(file_name, appendmat)
File "/usr/local/lib/python2.7/dist-packages/scipy/io/matlab/mio.py", line 23, in _open_file
return open(file_like, 'rb')
IOError: [Errno 2] No such file or directory: '/home/exx/ajeet/py-faster-rcnn/data/coco_proposals/selective_search/mat/COCO_train2014/COCO_train2014_0000002/COCO_train2014_000000262145.mat'

ericromanenghi · 2016-04-22T12:13:29Z

How are you running the train? Would you show me the command?

SaiAdityaG · 2016-05-24T08:17:05Z

Is it possible to resume training in 'alternate training' method also?

ericromanenghi · 2016-05-24T11:58:23Z

I didn't try with the alternate training, but should work, because you only have to change lib/fast_rcnn/train.py (and maybe train_net.py if the option to take the solver state is not there, I changed my code so I don't remember), and those files are the same for end2end and alternate.

pitLog · 2016-07-20T08:15:26Z

Hi,
I can't see this part of your code "elif previous_state is not None: ..." in train.py on the github.
Also, when I put the option --snapshot, I have an error message "unrecognized arguments"

Is this option always available ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume training from solverstate? #77

Resume training from solverstate? #77

linamede commented Feb 10, 2016

ericromanenghi commented Feb 29, 2016

ericromanenghi commented Feb 29, 2016

ericromanenghi commented Mar 15, 2016

ericromanenghi commented Apr 13, 2016

daf11865 commented Apr 13, 2016

ajeetksingh commented Apr 21, 2016 •

edited

Loading

ericromanenghi commented Apr 22, 2016

SaiAdityaG commented May 24, 2016

ericromanenghi commented May 24, 2016

pitLog commented Jul 20, 2016 •

edited

Loading

Resume training from solverstate? #77

Resume training from solverstate? #77

Comments

linamede commented Feb 10, 2016

ericromanenghi commented Feb 29, 2016

ericromanenghi commented Feb 29, 2016

ericromanenghi commented Mar 15, 2016

ericromanenghi commented Apr 13, 2016

daf11865 commented Apr 13, 2016

ajeetksingh commented Apr 21, 2016 • edited Loading

ericromanenghi commented Apr 22, 2016

SaiAdityaG commented May 24, 2016

ericromanenghi commented May 24, 2016

pitLog commented Jul 20, 2016 • edited Loading

ajeetksingh commented Apr 21, 2016 •

edited

Loading

pitLog commented Jul 20, 2016 •

edited

Loading