Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train a working model #155

Closed
blaine141 opened this issue Jan 8, 2021 · 24 comments
Closed

Unable to train a working model #155

blaine141 opened this issue Jan 8, 2021 · 24 comments

Comments

@blaine141
Copy link
Contributor

I am trying to train any working model with DOPE and am struggling. I tried to run train.py on the FAT dataset to train the soup and am unable to produce any model that matches the performance of the given pre-trained weights.

Here is the image I have been using for my tests

net_soup

To start, here are the results from the soup_60.pth file. Here are the output beliefs. It detected the soup properly in this sample.

soup_60 beliefs

The first thing I tried when training my own soup model was to isolate all of the samples with the soup in it. This came down to 8321 samples, and I only used the images from the left camera. I split the data into train and test sets. 90/10 random split. After training for 60 epochs, these were the average test losses per epoch.

soup_without_negatives_chart

It was unable to detect the object in the image, and here are the belief maps.

soup_without_negatives

It appeared to be training, but I would expect from the wiki for it to be able to detect things after 60 epochs.

The other thing that I tried is to use both samples with soup and without soup to see if results were improved. I had an equal amount of positives and negatives. As you can expect, the loss was cut in half, but it did not train any better.

soup_with_negatives_chart

soup_with_negatives

From the loss graph it seems the model was not done larning, but the wiki suggested 30 epochs. I can train for longer, but already am dealing with train times over a day. How did you guys train your soup model and why are my results so much worse? I want to be able to do a proof-of-concept train before moving to a custom object. Let me know any suggestions you may have.

@blaine141
Copy link
Contributor Author

Now that I think about it more, would it be helpful to include both the left and right image? Would that just be duplicate data or contribute something?

@TontonTremblay
Copy link
Collaborator

I would say you do not have enough training data. Have you tried to generate DR data using NDDS. We have been working on a new tool that you could use as well to generate training data, https://github.com/owl-project/ViSII.

@blaine141
Copy link
Contributor Author

Thanks, I'll look into it. I also just noticed that in train.py there is an inconsistency in the default learning rate.

parser.add_argument('--lr', 
    type=float, 
    default=0.0001, 
    help='learning rate, default=0.001')

Should the default be 0.001 or 0.0001? Would this cause it to not learn in 30 epochs like I am seeing? Asking because I am running a training session that I do not want to interrupt.

@TontonTremblay
Copy link
Collaborator

This is a good question, I do not remember the details of the learning rate. Since it uses ADAM, it should adapt fairly quickly to something more appropriate.

@TontonTremblay
Copy link
Collaborator

What sort of results are you getting on the training data as well, to me the results you share looks like you did not let it train for long enough. The pre-train weights were train on 4 p100 for 24h. this is 60 epochs with batchsize of 128 for dataset of size 200k images.

@blaine141
Copy link
Contributor Author

blaine141 commented Jan 8, 2021 via email

@hoangminh1104
Copy link

What sort of results are you getting on the training data as well, to me the results you share looks like you did not let it train for long enough. The pre-train weights were train on 4 p100 for 24h. this is 60 epochs with batchsize of 128 for dataset of size 200k images.

Hi, can I know what is the ratio of 200k images? Is it 50% DR and 50% FAT or the contribution is different? Also, did you use all the depth images as well? I'm new so pardon me if my question is silly.

@blaine141
Copy link
Contributor Author

I'm not completely sure, but in the fat dataset I was only able to find 8k samples containing the soup. So if you want 200k, you will have to generate almost all of them. In terms of how to get good results, I'm training the fat data for 500 epochs to see when it actually gets good results with only the 8k samples. Because @TontonTremblay mentioned he did 60 epochs with 200k samples, that made me think I needed more batches to be run for it to train.

I asked in another issue and the depth images are not used.

@blaine141
Copy link
Contributor Author

Back to being unable to train a working model. As expected, training it for longer did nothing. I did however encounter a troubling issue. The model I trained is unable to detect the training samples. I am taking the checkpoint from 60 epochs and ran it against a training sample. Here are the results.

My model:
train_sample

From these belief maps, DOPE was unable to detect any object. Compare this to the pre-trained sample which did detect the soup correctly.

Pretrained:
pretrained

Here is another example of my model running on another training sample. The belief maps look correct but the algorithm does not detect any object.

My model:
my_train_2

Why does my model not detect the object? Why does the pre-trained model, which looks similar, detect it? What is the difference, and what needs to change to have my model detect correctly in training samples? I'm thinking that this has to be some bug in the code. The model should be able to detect its training samples, even if I have a small dataset.

@blaine141
Copy link
Contributor Author

blaine141 commented Mar 5, 2021

I was able to train a working model and the system is working great! I'm very happy with the performance. There were a few things that I had to change along the way from what I was initially doing. Here is what I was doing wrong for anyone else who may be experiencing issues.

  • I had a small batch size due to my small GPU memory. I had to implement sub-batching to get the effective size up to the 128 that was recommended. Made loss converge much faster. See my train.py
  • I implemented amp to speed up training by 2x, detection by 3x, and decrease memory footprint. Did not improve performance but I thought I should mention it.
  • I generated 50k samples to train off of rather than the small set given in this repo. I used NVISII.
  • Changed the normalization parameters in detector.py to match what is used during training. Didn't test to verify if this actually did anything.
  • Made sure my training samples were easy to detect. Results were bad one time when my object was confusing, colorless, and had a large, mostly empty bounding box. I instead trained it on this object with limited orientations.

Here is a link to my modified train.py. If anyone wants an example dataset to train off of, reach out to me and I will share mine. Feel free to ask me any questions as well.

Here's a video of DOPE working in our simulator if anyone is interested.
DOPE

@fatooo
Copy link

fatooo commented May 20, 2021

@blaine141 hi can you please share your dataset with me?

@blaine141
Copy link
Contributor Author

Yeah just shoot an email to blaine141@gmail.com and I will respond with a link

@TontonTremblay
Copy link
Collaborator

very cool results @blaine141, could you share some of the renders in nvisii, thank you for the update. I can confirm that the normalization will have an impact if they do not match (we noticed that recently in a project (https://arxiv.org/pdf/2011.07748v3.pdf). Also I am sorry I did not answer your Jan 10 message. But it looks like you made some great progresses. I will refer to your post on the readme for people to look at when training on single GPU. Thank you for sharing.

@blaine141
Copy link
Contributor Author

If you want to look at some samples here is my dataset. https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/miller_8545_buckeyemail_osu_edu/EaK9JhScTaRDhgnCTsC6yoEBzPHJ2gZWK_z4PFYqzcYSlA?e=qjuUAb. Is very specific to our situation but it can show you what works.

@TontonTremblay
Copy link
Collaborator

Could you share a few renders (I am not sure I want to download the full dataset) to see how you approached rendering your object to train DOPE. I did not have time to give more extensive ways of generating synthetic data scenes in nvisii (I have a few internally for this paper https://www.dropbox.com/s/xmdo7k6dxvqv52b/visii_sdg_iclr21_workshop.pdf?dl=0), so I am intrigued. Did you share you rendering script online? Also how was your experience with nvisii (hopefully better than training DOPE on a single GPU)?

@blaine141
Copy link
Contributor Author

Here is one of our samples. We are trying to train to work in underwater environments for RoboSub.

cutie250

We randomized the pose of the model, wrapped a random background image over the dome, and used the dome space behind the camera as lighting with random color and brightness. Also added a lot to data augmentation in DOPE to try and improve generalization.

The script used to generate these is gen.py

@TontonTremblay
Copy link
Collaborator

This looks very good, thank you for sharing :P Good work.

@an99990
Copy link

an99990 commented Sep 16, 2021

Hi,
Thank you @blaine141 for the dataset and the train.py
I have my images and would like to train the model on them.
What are we suppose to have in the json file for each image ?

thank you

@ArghyaChatterjee
Copy link

@blaine141 do you have a command line in order to run your custom training file ? I tried these 3 differently (gave your file name as train efficiently) but none of it is working.

python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/

or

python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/

or

python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/dataset/

For all the commands, it says:

start: 21:05:55.081980
usage: train_efficiently.py [-h] [--data DATA] [--datatest DATATEST] [--object OBJECT] [--workers WORKERS] [--batchsize BATCHSIZE] [--subbatchsize SUBBATCHSIZE] [--imagesize IMAGESIZE] [--lr LR]
                            [--noise NOISE] [--net NET] [--namefile NAMEFILE] [--manualseed MANUALSEED] [--epochs EPOCHS] [--loginterval LOGINTERVAL] [--gpuids GPUIDS [GPUIDS ...]] [--outf OUTF]
                            [--sigma SIGMA] [--save] [--pretrained PRETRAINED] [--nbupdates NBUPDATES] [--datasize DATASIZE] [-n N] [-g GPUS] [-nr NR] [--option OPTION]
train_efficiently.py: error: unrecognized arguments: --local_rank=0 --network dope
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train_efficiently.py', '--local_rank=0', '--network', 'dope', '--epochs', '20', '--batchsize', '10', '--outf', 'tmp/', '--data', '../nvisii_data_gen/output/dataset/']' returned non-zero exit status 2.

@blaine141
Copy link
Contributor Author

blaine141 commented Apr 8, 2023 via email

@ArghyaChatterjee
Copy link

Nope. Still having issues. Can you look into it for a bit quickly ?

@ArghyaChatterjee
Copy link

@blaine141 also, looks like you had done some good work on this thing. I am trying to make it work with ROS2. Do you know any ROS2 implementation of this one ? It seems to be the Isaac ros pose estimation is more focused on pose estimation than the detection itself. I want something similar in this repo but converted in ROS 2 like detection instance I'd with the pose of the object as a ROS2 message.

@ArghyaChatterjee
Copy link

@blaine141 Also, there is another problem I am having issues with:

The training time is taking too long. I have a 40k annotated dataset of ironrod created using NViSII. With 64 GB ram, single Nvidia RTX 3060 6GB graphics, it took around 6 hours to generate 2 epochs of training. Getting that 60 epochs of training for that single object will take quite a long time. Can we minimize that time?

I am using this script for training inside train2 folder.

python3 -m torch.distributed.launch --nproc_per_node=1 train.py --network dope --epochs 2 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/output_example/

@blaine141
Copy link
Contributor Author

blaine141 commented Apr 18, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants