-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about training #6
Comments
Hi, Here are answers:
|
question about 1:so you randomed choose the first frame,and base the index of the first frame to choose the other two frames,right?for each epoch,did you used all images of each video?or you only used 3 frames of each video?I found it would be too slow if i use all images to train. question about 3:in youtube-vos,the video "b938d79dff" only has four frames for training,the number is less than maximum_skip, how did you deal with this case? |
|
Did you used all images when you trained on coco?The coco datasets is too large... |
Have you done random affine transform to those samples in main training @seoungwugoh |
@aaaaannnnn11111 We use entire training set of CoCo. |
Thanks a lot! @seoungwugoh |
@seoungwugoh some images in coco have more than 90 mask objects, did you use a threshold to limit the number of objects in one image? |
@aaaaannnnn11111 Yes, we randomly select 3 objects if images/video contains more than 3 objects. |
you said maximum_skip was increased by 5 at every 20 epoch during main-training. I am wondering whether training 20 epoch is enough for each maximum_skip. Does it mean that for each maximum_skip we need to train it util it converges then change to another maximum_skip or not. @seoungwugoh |
@siyueyu It is just a empirically chosen hyper-parameter, not thoroughly validated. You may train the model until it's convergence for each training curriculum. |
Hi, I got a question about how to add YouTubeVOS for training the DAVIS model. |
Hi, I got a question about training on main dataset such as DAVIS/ Youtube. When you sample 3 frames from the video, do you resize them to (384, 384)? I you do, do you keep the aspect ratio between width and height? |
@gewenbin292 We use ConcatDataset and DAVIS is weighted 5 times larger than Youtube-VOS. @npmhung Firstly, random resizing [384, shorter_side_original_size] & random crop [384, 384] is performed. Then, affine transform is performed with following parameter range. |
@aaaaannnnn11111 @siyueyu @gewenbin292 it seems that you wrote some training code. I think many people (including me!) would be interested in that. Could you please share that, maybe create a fork? (Sorry for posting this here, but it seems none of you provide any email address on github). |
@seoungwugoh Sorry to tell that I can't. I failed to completely reproduce your work. |
@siyueyu can you maybe share your attempt/code anyway? It could be a good starting point for others/me to try to reproduce the results even if it does not fully work yet. |
@seoungwugoh I have some questions about training.The paper said that you used 4 gpus and the batch_size of 4.Do you calculate the four images at the same time and then update the key and value according to the time sequence? |
|
@seoungwugoh hi,Did you use the pretrained model(trained on coco) test on DAVIS-2017 validation set ?How about the result? |
@pvoigtlaender Sorry to tell you that I haven't really tried reproducing the code. I only considered the idea of two-stage training. In my attempt of two-stage training, I found that it was easy to get overfitting in fine-tuning stage. So, I think some parameters matter. But I haven't got any idea of which parameters are more important. |
@fangziyi0904 We used the DataParallel functionality in Pytorch. So, the gradient is computed based on 4 samples in the batch. Backpropagation is done after all the frames are processed. @npmhung During training, in case of DAVIS, I starts with 480p. For Youtube-VOS, I starts with original resolution (mostly in 720p). Then, they are resized and cropped to be 384x384 as mentioned above. After some data augmentations, training is done 384x384 patches. Testing is done 480p resolution. In case of Youtube-VOS testing, we resize them to have 480px height keeping the aspect ratio. @aaaaannnnn11111 Please see Table.4 in our paper. |
Hi @seoungwugoh , you said you performed random resizing to |
Hi, thank you for sharing the code. I still have a question about training samples. I wonder how many samples (3-frame triplet) do you use for each video. From your previous information, I understand that for each video you only sample 3 frames, so in the case of main training on DAVIS17, for example, the training set contains 60 videos, so each epoch only 60 x 3 = 180 frames are used for training ? Will too much training samples from training set result in severe overfitting ? |
@seoungwugoh hi, thanks for sharing the codes! I have some questions about how to simulate training videos using image datasets.
|
Hi, Just a silly question. When pretraining the model on the image dataset, how do you choose the validation set?
|
Hi all, sorry for the late reply. Here are my answers: @chenz97 Sorry for leading you misunderstanding. The symbol [a, b] is intended for "from a to b". We first choose the new size of shorter side between 384px and the original shorter side size. Then resize the longer side accordingly keeping the aspect ratio. @lyxok1 We randomly choose 3 frame from each video at every iteration. All the frame in the video dataset can be used. @OasisYang We use all the datasets written, in the paper, we used. We found the affine transform is okay for pre-training purpose. @npmhung We did not care much about the validation during pre-training. We simply fit the model as much as possible, as our model will be fine-tuned afterward. |
For maximum_skip of youtubevos: if maximum_skip is 25, will you then skip from 00000.jpg to 00025.jpg or to 00125.jpg (since there's only one image for each 5 frames)? Also, do you use any other kind of augmentations? Thanks for your help! |
Hi @seoungwugoh , I got it, thanks for your reply! |
@pvoigtlaender In addition to the augmentations you mentioned, we also use color shift that randomly multiplies a value in [0.97, 1.03] to the RGB values. That's all. |
In the fine tuning phase, how do you define "an epoch"? And did you freeze any part of the model while fine tuning? |
Hello,I have one question about pretrain and train. |
@npmhung @fangziyi0904 |
Thank you for your reply~ I have another question, I see you apply the random affine transformation in both stages of main-training and pre-training, I wonder how much will these augmentation affect the final segmentation results ? e.g. If I do not apply the transformation you mentioned above in main-training, how much will the results drop? And I wonder in main-training, whether the all frames in the 3-frame tuplete are applied with transformations of the same parameters or each frame is transformed independently thank you~ |
@lyxok1 The random affine transforms are essential for pre-training to synthesize frames from an image. I'm not sure about the accuracy without data augmentation because I have never done without it. Yes, for main training, all the sampled frames are applied with the same parameters. |
Hi, I've only done the main-training on DAVIS2017 without pre-training, and when I test model on DAVIS2017 validation dataset, I found that if there is only one object in the video, the results could be great, but once there are multi-objects in the video, it tends to be that only one object can get segmentation, while the remain objects' J mean sometimes are even close to 0. Does it mean the model is overfitting? |
@pixelsmaker if the model works fine with one object, it should be okay for multiple objects. I think there may be some bugs when combining the probabilities. Try to segment objects one-by-one then see the results. If you don't see reasonable results when you do this, then, of course, the model is poorly trained. I recommend to use YouTube-VOS for training. |
@seoungwugoh |
Hi @seoungwugoh, I am also trying to reproduce your work. I have used COCO, MSRA, ECSSD and VOC in pre-training, and follow your instructions about the affine and crop parameters (currently no color shift). However, my reproduce model can only achieve around 50 JF-mean in davis 2017, which is far from the result in table 4 (about 60). After I check the output result, I found that only the first few frames are done well. The later frames are quite bad. It seems reasonable as in pretrain, the model only need to process a 3 frame training example. So when you test the pretrain-model, is the setting the same as the test for main training? Memory every 5 frames and go through all the frames in davis? And is any additional procedure before test for a pre-train only model? Thanks~ |
Hi, Does "maximum_skip" represent the interval of each nearby frames or the interval of the first sampled frame and the last sample frame? |
Thanks for you answered my previous question,but i still have many questions......
how did you choose the first frame of the 3 temporally ordered frames?
how many epochs will you increase the maximum_skip?
what is maximum_skip when the dataset is youtube vos?
Thk a lot !
The text was updated successfully, but these errors were encountered: