Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use new snapshotting? #35

Open
xksteven opened this issue Jul 10, 2015 · 16 comments
Open

How to use new snapshotting? #35

xksteven opened this issue Jul 10, 2015 · 16 comments

Comments

@xksteven
Copy link

fast-rcnn doesn't take as an argument --snapshot so I'm not sure how to use a snapshot.

I'm asking because in the /models/VGG16/solver.prototxt it has this :
"We disable standard caffe solver snapshotting and implement our own snapshot"

Thanks

@WilsonWangTHU
Copy link

It's in the ./lib/fast-rcnn/config.py

@xksteven
Copy link
Author

xksteven commented Aug 6, 2015

In that file I can change the time between snapshots and snapshot infix but nothing on using the snapshot during training.

Would I just change in the solver.prototxt the snapshot number to reference the current snapshot?

@WilsonWangTHU
Copy link

@xksteven I guess you would like to do the validation during the training?
I am not sure whether that's supported by the current fast-rcnn edition, as all the forward job is started from the python part and I don't think we have a testing function during the training for now.
I am afraid in that way you might need to revise the code yourself.

@xksteven
Copy link
Author

xksteven commented Aug 7, 2015

@WilsonWangTHU
You know when using caffe you can provide the snapshot option such as -snapshot=model_iter_xxx.solverstate to restart the training from that point? Normally in caffe the solverstate and the caffemodel saved as model_iter_xxx.caffemodel are both in the same directory but with fast-rcnn I only see the caffemodel saved in the output/default/imdb_trainval. I'd like to be able to restart the training using those weights stored there.

I'm running it on a cluster with a certain time limit and it will kill my process at certain time intervals. I just want to be able to restart the training from that snapshot.

@kyuusaku
Copy link

I have the same problem.

@kyuusaku
Copy link

How to restart the training from a snapshot? Can anyone provide some tips? Thanks.

@IdiosyncraticDragon
Copy link

@kyuusaku @xksteven I have met the same problem, do you guys get some effective solutions?Thanks

@lynetcha
Copy link

Make the following modifications and you will be able to use the --snapshot argument

In tools/train_net.py

    def parse_args():
        """
        Parse input arguments
        """
        parser = argparse.ArgumentParser(description='Train a Fast R-CNN network')
        parser.add_argument('--gpu', dest='gpu_id',
                            help='GPU device id to use [0]',
                            default=0, type=int)
        parser.add_argument('--solver', dest='solver',
                            help='solver prototxt',
                            default=None, type=str)
        parser.add_argument('--iters', dest='max_iters',
                            help='number of iterations to train',
                            default=40000, type=int)
        parser.add_argument('--weights', dest='pretrained_model',
                            help='initialize with pretrained model weights',
                            default=None, type=str)
        parser.add_argument('--snapshot', dest='previous_state',
                            help='initialize with previous state',
                            default=None, type=str) 
        parser.add_argument('--cfg', dest='cfg_file',
                            help='optional config file',
                            default=None, type=str)
        parser.add_argument('--imdb', dest='imdb_name',
                            help='dataset to train on',
                            default='voc_2007_trainval', type=str)
        parser.add_argument('--rand', dest='randomize',
                            help='randomize (do not use a fixed seed)',
                            action='store_true')
        parser.add_argument('--set', dest='set_cfgs',
                            help='set config keys', default=None,
                            nargs=argparse.REMAINDER)

In lib/fast_rcnn/train.py

        class SolverWrapper(object):
            """A simple wrapper around Caffe's solver.
            This wrapper gives us control over he snapshotting process, which we
            use to unnormalize the learned bounding-box regression weights.
            """
            def __init__(self, solver_prototxt, roidb, output_dir,
                         pretrained_model=None, previous_state=None):
                """Initialize the SolverWrapper."""
                self.output_dir = output_dir
                print 'Computing bounding-box regression targets...'
                self.bbox_means, self.bbox_stds = \
                        rdl_roidb.add_bbox_regression_targets(roidb)
                print 'done'
                self.solver = caffe.SGDSolver(solver_prototxt)
                if pretrained_model is not None:
                    print ('Loading pretrained model '
                           'weights from {:s}').format(pretrained_model)
                    self.solver.net.copy_from(pretrained_model)
                 elif previous_state is not None:
                    print ('Restoring State from '
                              ' from {:s}').format(previous_state)
                    self.solver.restore(previous_state)
                self.solver_param = caffe_pb2.SolverParameter()
                with open(solver_prototxt, 'rt') as f:
                    pb2.text_format.Merge(f.read(), self.solver_param)
                self.solver.net.layers[0].set_roidb(roidb)
.
.
.
def train_net(solver_prototxt, roidb, output_dir,
              pretrained_model=None, max_iters=40000,previous_state=None):
    """Train a Fast R-CNN network."""
    sw = SolverWrapper(solver_prototxt, roidb, output_dir,
                       pretrained_model=pretrained_model,previous_state=previous_state)
    print 'Solving...'
    sw.train_model(max_iters)
    print 'done solving'

@chrert
Copy link

chrert commented Jan 19, 2016

Thanks for the code but how to save the solverstate during fast r-cnn training? It looks like the method Solver::SnapshotSolverState isn't exported to pycaffe...

@lynetcha
Copy link

Did you change "snapshot: 0" to "snapshot: 10000" in your solver.prototxt? That allows you to save the state at iteration 10000 for example.

@chrert
Copy link

chrert commented Jan 21, 2016

Ah, thanks! Didn't think of that...

@smichalowski
Copy link

@lynetcha, one more modification:

In tools/train_net.py

    output_dir = get_output_dir(imdb)
    print 'Output will be saved to `{:s}`'.format(output_dir)

    train_net(args.solver, roidb, output_dir,
              pretrained_model=args.pretrained_model,
              max_iters=args.max_iters, **previous_state=args.previous_state**)

also remember to omit --weights param

@twmht
Copy link

twmht commented Aug 27, 2016

hi @po0ya

what if I don't save the extra file for the last layer weights? would be bad mAP after retraining?

@po0ya
Copy link

po0ya commented Aug 29, 2016

Hello @twmht

Basically it'll mess up the whole network if you want to continue training. The network is trained to work for zero mean and unit variance bboxes. For test time convenience, the weights and bias of the last layer is scaled by the std and shifted by the mean. If it has not been done, the prediction should've been scaled and shifted manually. It's for convenience in testing time, but the weights are not the ones that were learned by backprop, so retraining with these weights would be meaningless for the network.

EDIT: Add these couple of lines to the end of SolverWrapper constructor init

        found = False
        for k in net.params.keys():
            if 'bbox_pred' in k:
                bbox_pred = k
                found = True
            print('[#] Renormalizing the final layers back')
            net.params[bbox_pred][0].data[4:, :] = \
                (net.params[bbox_pred][0].data[4:, :] *
                 1.0 / self.bbox_stds[4:, np.newaxis])
            net.params[bbox_pred][1].data[4:] = \
                    (net.params[bbox_pred][1].data - self.bbox_means)[4:] * 1.0 / self.bbox_stds[4:]
        if not found:
            print('Warning layer \"bbox_pred\" not found')

zhangjiangqige added a commit to zhangjiangqige/py-R-FCN-multiGPU that referenced this issue May 5, 2017
…tate file (--snapshot /a/b/c.solverstate) (rbgirshick/fast-rcnn#35)

solver.cpp is modified according to the master branch of caffe, seemed that miscrosoft made some changes that prevented restoring multiple solvers
zhangjiangqige added a commit to zhangjiangqige/py-R-FCN-multiGPU that referenced this issue Sep 29, 2017
…tate file (--snapshot /a/b/c.solverstate) (rbgirshick/fast-rcnn#35)

solver.cpp is modified according to the master branch of caffe, seemed that miscrosoft made some changes that prevented restoring multiple solvers
@ds2268
Copy link

ds2268 commented Nov 17, 2017

@po0ya but aren't the weights (*.caffemodel) that are saved by the default solver already normalized (because they were never unnormalizied, because the caffemodel was not saved using provided snapshot functionality). So I guess the produced *.solverstate is linked to the *.caffemodel model that was not produced by the faster rcnn snapshot function. Using resuming functionality you get 2 versions of caffemodel, the one provided by the default solver snapshot and the one provided by the snapshot function in faster r-cnn that the weights are unnormalized before saving. So I guess that normalization is not needed.

@misssprite
Copy link

misssprite commented May 23, 2018

Net params in snapshot function in SolverWrapper is first unnormalized, saved and restored with normalized version. So the param version is up to when the snapshot in Caffe is called.

I didn't dig the code of Caffe, but I think disabling snapshot in solver.prototxt and manually calling solver.snapshot() will be better to control exactly which version is snapshotted.

Actually, I look into the log and found that the Caffe snapshot is called before snapshot in SolverWrapper. diff the params file shows that Caffe snapshot indeed save a different (normalized) version than SolverWrapper. Manually invocation of solver.snapshop obtained a identical .caffemodel.

So we can resume the .solverstate safely without unnormalizing the parameters with Caffe snapshot. But this produces two version of '.caffemodel's. It's up to you to snapshot which version of parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests