Unable to train a working model #155

blaine141 · 2021-01-08T05:54:33Z

I am trying to train any working model with DOPE and am struggling. I tried to run train.py on the FAT dataset to train the soup and am unable to produce any model that matches the performance of the given pre-trained weights.

Here is the image I have been using for my tests

To start, here are the results from the soup_60.pth file. Here are the output beliefs. It detected the soup properly in this sample.

The first thing I tried when training my own soup model was to isolate all of the samples with the soup in it. This came down to 8321 samples, and I only used the images from the left camera. I split the data into train and test sets. 90/10 random split. After training for 60 epochs, these were the average test losses per epoch.

It was unable to detect the object in the image, and here are the belief maps.

It appeared to be training, but I would expect from the wiki for it to be able to detect things after 60 epochs.

The other thing that I tried is to use both samples with soup and without soup to see if results were improved. I had an equal amount of positives and negatives. As you can expect, the loss was cut in half, but it did not train any better.

From the loss graph it seems the model was not done larning, but the wiki suggested 30 epochs. I can train for longer, but already am dealing with train times over a day. How did you guys train your soup model and why are my results so much worse? I want to be able to do a proof-of-concept train before moving to a custom object. Let me know any suggestions you may have.

blaine141 · 2021-01-08T06:01:43Z

Now that I think about it more, would it be helpful to include both the left and right image? Would that just be duplicate data or contribute something?

TontonTremblay · 2021-01-08T16:49:55Z

I would say you do not have enough training data. Have you tried to generate DR data using NDDS. We have been working on a new tool that you could use as well to generate training data, https://github.com/owl-project/ViSII.

blaine141 · 2021-01-08T17:58:59Z

Thanks, I'll look into it. I also just noticed that in train.py there is an inconsistency in the default learning rate.

parser.add_argument('--lr', 
    type=float, 
    default=0.0001, 
    help='learning rate, default=0.001')

Should the default be 0.001 or 0.0001? Would this cause it to not learn in 30 epochs like I am seeing? Asking because I am running a training session that I do not want to interrupt.

TontonTremblay · 2021-01-08T19:36:19Z

This is a good question, I do not remember the details of the learning rate. Since it uses ADAM, it should adapt fairly quickly to something more appropriate.

TontonTremblay · 2021-01-08T19:38:35Z

What sort of results are you getting on the training data as well, to me the results you share looks like you did not let it train for long enough. The pre-train weights were train on 4 p100 for 24h. this is 60 epochs with batchsize of 128 for dataset of size 200k images.

blaine141 · 2021-01-08T19:52:20Z

Oh! I though the pre-trained weights were only trained on the fat dataset. That explains a lot. If that's the case, I need more data or train time. For the learning rate change, I am running a test now. I know Adam is supposed to adapt, but I've been screwed by Adam before when my learning rate was too low. Training took forever. One other thing, I have a GPU with turing architecture (RTX 2070), so I implemented pytorch's automatic mixed-precision feature. It reduced the training time by about 40% and because of the memory usage reduction, it allowed me to increase batch size from 16 to 24. Would other people be interested in that feature?

…

On Fri, Jan 8, 2021, 14:38 jtremblay ***@***.***> wrote: What sort of results are you getting on the training data as well, to me the results you share looks like you did not let it train for long enough. The pre-train weights were train on 4 p100 for 24h. this is 60 epochs with batchsize of 128 for dataset of size 200k images. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#155 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQRTE4SWFHWARZ4JAXMRULSY5NMVANCNFSM4VZ7PDUQ> .

hoangminh1104 · 2021-01-10T13:11:22Z

What sort of results are you getting on the training data as well, to me the results you share looks like you did not let it train for long enough. The pre-train weights were train on 4 p100 for 24h. this is 60 epochs with batchsize of 128 for dataset of size 200k images.

Hi, can I know what is the ratio of 200k images? Is it 50% DR and 50% FAT or the contribution is different? Also, did you use all the depth images as well? I'm new so pardon me if my question is silly.

blaine141 · 2021-01-10T17:28:26Z

I'm not completely sure, but in the fat dataset I was only able to find 8k samples containing the soup. So if you want 200k, you will have to generate almost all of them. In terms of how to get good results, I'm training the fat data for 500 epochs to see when it actually gets good results with only the 8k samples. Because @TontonTremblay mentioned he did 60 epochs with 200k samples, that made me think I needed more batches to be run for it to train.

I asked in another issue and the depth images are not used.

blaine141 · 2021-01-10T20:59:23Z

Back to being unable to train a working model. As expected, training it for longer did nothing. I did however encounter a troubling issue. The model I trained is unable to detect the training samples. I am taking the checkpoint from 60 epochs and ran it against a training sample. Here are the results.

My model:

From these belief maps, DOPE was unable to detect any object. Compare this to the pre-trained sample which did detect the soup correctly.

Pretrained:

Here is another example of my model running on another training sample. The belief maps look correct but the algorithm does not detect any object.

My model:

Why does my model not detect the object? Why does the pre-trained model, which looks similar, detect it? What is the difference, and what needs to change to have my model detect correctly in training samples? I'm thinking that this has to be some bug in the code. The model should be able to detect its training samples, even if I have a small dataset.

blaine141 · 2021-03-05T04:39:58Z

I was able to train a working model and the system is working great! I'm very happy with the performance. There were a few things that I had to change along the way from what I was initially doing. Here is what I was doing wrong for anyone else who may be experiencing issues.

I had a small batch size due to my small GPU memory. I had to implement sub-batching to get the effective size up to the 128 that was recommended. Made loss converge much faster. See my train.py
I implemented amp to speed up training by 2x, detection by 3x, and decrease memory footprint. Did not improve performance but I thought I should mention it.
I generated 50k samples to train off of rather than the small set given in this repo. I used NVISII.
Changed the normalization parameters in detector.py to match what is used during training. Didn't test to verify if this actually did anything.
Made sure my training samples were easy to detect. Results were bad one time when my object was confusing, colorless, and had a large, mostly empty bounding box. I instead trained it on this object with limited orientations.

Here is a link to my modified train.py. If anyone wants an example dataset to train off of, reach out to me and I will share mine. Feel free to ask me any questions as well.

Here's a video of DOPE working in our simulator if anyone is interested.

fatooo · 2021-05-20T12:28:59Z

@blaine141 hi can you please share your dataset with me?

blaine141 · 2021-05-20T12:49:51Z

Yeah just shoot an email to blaine141@gmail.com and I will respond with a link

TontonTremblay · 2021-05-21T02:45:48Z

very cool results @blaine141, could you share some of the renders in nvisii, thank you for the update. I can confirm that the normalization will have an impact if they do not match (we noticed that recently in a project (https://arxiv.org/pdf/2011.07748v3.pdf). Also I am sorry I did not answer your Jan 10 message. But it looks like you made some great progresses. I will refer to your post on the readme for people to look at when training on single GPU. Thank you for sharing.

blaine141 · 2021-05-21T03:31:00Z

If you want to look at some samples here is my dataset. https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/miller_8545_buckeyemail_osu_edu/EaK9JhScTaRDhgnCTsC6yoEBzPHJ2gZWK_z4PFYqzcYSlA?e=qjuUAb. Is very specific to our situation but it can show you what works.

TontonTremblay · 2021-05-21T04:19:20Z

Could you share a few renders (I am not sure I want to download the full dataset) to see how you approached rendering your object to train DOPE. I did not have time to give more extensive ways of generating synthetic data scenes in nvisii (I have a few internally for this paper https://www.dropbox.com/s/xmdo7k6dxvqv52b/visii_sdg_iclr21_workshop.pdf?dl=0), so I am intrigued. Did you share you rendering script online? Also how was your experience with nvisii (hopefully better than training DOPE on a single GPU)?

blaine141 · 2021-05-22T16:40:58Z

Here is one of our samples. We are trying to train to work in underwater environments for RoboSub.

We randomized the pose of the model, wrapped a random background image over the dome, and used the dome space behind the camera as lighting with random color and brightness. Also added a lot to data augmentation in DOPE to try and improve generalization.

The script used to generate these is gen.py

TontonTremblay · 2021-05-23T17:15:31Z

This looks very good, thank you for sharing :P Good work.

an99990 · 2021-09-16T17:19:05Z

Hi,
Thank you @blaine141 for the dataset and the train.py
I have my images and would like to train the model on them.
What are we suppose to have in the json file for each image ?

thank you

ArghyaChatterjee · 2023-04-07T02:12:58Z

@blaine141 do you have a command line in order to run your custom training file ? I tried these 3 differently (gave your file name as train efficiently) but none of it is working.

python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/

or

python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/

or

python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/dataset/

For all the commands, it says:

start: 21:05:55.081980
usage: train_efficiently.py [-h] [--data DATA] [--datatest DATATEST] [--object OBJECT] [--workers WORKERS] [--batchsize BATCHSIZE] [--subbatchsize SUBBATCHSIZE] [--imagesize IMAGESIZE] [--lr LR]
                            [--noise NOISE] [--net NET] [--namefile NAMEFILE] [--manualseed MANUALSEED] [--epochs EPOCHS] [--loginterval LOGINTERVAL] [--gpuids GPUIDS [GPUIDS ...]] [--outf OUTF]
                            [--sigma SIGMA] [--save] [--pretrained PRETRAINED] [--nbupdates NBUPDATES] [--datasize DATASIZE] [-n N] [-g GPUS] [-nr NR] [--option OPTION]
train_efficiently.py: error: unrecognized arguments: --local_rank=0 --network dope
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train_efficiently.py', '--local_rank=0', '--network', 'dope', '--epochs', '20', '--batchsize', '10', '--outf', 'tmp/', '--data', '../nvisii_data_gen/output/dataset/']' returned non-zero exit status 2.

blaine141 · 2023-04-08T21:21:54Z

I think it's been too long since I looked at this. The repo has changed a fair bit. If you still don't have it figured out let me know and I can help

…

On Thu, Apr 6, 2023, 7:13 PM Arghya Chatterjee ***@***.***> wrote: @blaine141 <https://github.com/blaine141> do you have a command line in order to run your custom training file ? I tried these 3 differently (gave your file name as train efficiently) but none of it is working. python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/ or python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/ or python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/dataset/ For all the commands, it says: start: 21:05:55.081980 usage: train_efficiently.py [-h] [--data DATA] [--datatest DATATEST] [--object OBJECT] [--workers WORKERS] [--batchsize BATCHSIZE] [--subbatchsize SUBBATCHSIZE] [--imagesize IMAGESIZE] [--lr LR] [--noise NOISE] [--net NET] [--namefile NAMEFILE] [--manualseed MANUALSEED] [--epochs EPOCHS] [--loginterval LOGINTERVAL] [--gpuids GPUIDS [GPUIDS ...]] [--outf OUTF] [--sigma SIGMA] [--save] [--pretrained PRETRAINED] [--nbupdates NBUPDATES] [--datasize DATASIZE] [-n N] [-g GPUS] [-nr NR] [--option OPTION] train_efficiently.py: error: unrecognized arguments: --local_rank=0 --network dope Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module> main() File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train_efficiently.py', '--local_rank=0', '--network', 'dope', '--epochs', '20', '--batchsize', '10', '--outf', 'tmp/', '--data', '../nvisii_data_gen/output/dataset/']' returned non-zero exit status 2. — Reply to this email directly, view it on GitHub <#155 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQRTE6U7PD3MP3PCHHQFZ3W75Z3LANCNFSM4VZ7PDUQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ArghyaChatterjee · 2023-04-08T22:26:57Z

Nope. Still having issues. Can you look into it for a bit quickly ?

ArghyaChatterjee · 2023-04-08T23:37:40Z

@blaine141 also, looks like you had done some good work on this thing. I am trying to make it work with ROS2. Do you know any ROS2 implementation of this one ? It seems to be the Isaac ros pose estimation is more focused on pose estimation than the detection itself. I want something similar in this repo but converted in ROS 2 like detection instance I'd with the pose of the object as a ROS2 message.

ArghyaChatterjee · 2023-04-09T17:49:11Z

@blaine141 Also, there is another problem I am having issues with:

The training time is taking too long. I have a 40k annotated dataset of ironrod created using NViSII. With 64 GB ram, single Nvidia RTX 3060 6GB graphics, it took around 6 hours to generate 2 epochs of training. Getting that 60 epochs of training for that single object will take quite a long time. Can we minimize that time?

I am using this script for training inside train2 folder.

python3 -m torch.distributed.launch --nproc_per_node=1 train.py --network dope --epochs 2 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/output_example/

blaine141 · 2023-04-18T00:49:44Z

You can enable mixed-precision and increase your batch size to roughly half the training time. There are lots of ways to do it, here is one: https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/ I am unaware of any work to upgrade this repo to ROS 2. You can try to do it yourself, it shouldn't be too hard. Should probably get familiar with the codebase anyway in case you want to make optimizations in the future.

…

On Sun, Apr 9, 2023 at 10:49 AM Arghya Chatterjee ***@***.***> wrote: @blaine141 <https://github.com/blaine141> Also, there is another problem I am having issues with: The training time is taking too long. I have a 40k annotated dataset of ironrod created using NViSII. With 64 GB ram, single Nvidia RTX 3060 6GB graphics, it took around 6 hours to generate 2 epochs of training. Getting that 60 epochs of training for that single object will take quite a long time. Can we minimize that time? I am using this script for training inside train2 folder. python3 -m torch.distributed.launch --nproc_per_node=1 train.py --network dope --epochs 2 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/output_example/ — Reply to this email directly, view it on GitHub <#155 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQRTEZTDBJZYEACXMZMKDTXALZCFANCNFSM4VZ7PDUQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

blaine141 closed this as completed Mar 5, 2021

kyouma9s mentioned this issue Jun 20, 2021

Training loss fails to converge #176

Closed

hsisfree mentioned this issue Sep 5, 2023

data for train2.py #316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train a working model #155

Unable to train a working model #155

blaine141 commented Jan 8, 2021

blaine141 commented Jan 8, 2021

TontonTremblay commented Jan 8, 2021

blaine141 commented Jan 8, 2021

TontonTremblay commented Jan 8, 2021

TontonTremblay commented Jan 8, 2021

blaine141 commented Jan 8, 2021 via email

hoangminh1104 commented Jan 10, 2021

blaine141 commented Jan 10, 2021

blaine141 commented Jan 10, 2021

blaine141 commented Mar 5, 2021 •

edited

Loading

fatooo commented May 20, 2021

blaine141 commented May 20, 2021

TontonTremblay commented May 21, 2021

blaine141 commented May 21, 2021

TontonTremblay commented May 21, 2021

blaine141 commented May 22, 2021

TontonTremblay commented May 23, 2021

an99990 commented Sep 16, 2021 •

edited

Loading

ArghyaChatterjee commented Apr 7, 2023

blaine141 commented Apr 8, 2023 via email

ArghyaChatterjee commented Apr 8, 2023

ArghyaChatterjee commented Apr 8, 2023

ArghyaChatterjee commented Apr 9, 2023

blaine141 commented Apr 18, 2023 via email

Unable to train a working model #155

Unable to train a working model #155

Comments

blaine141 commented Jan 8, 2021

blaine141 commented Jan 8, 2021

TontonTremblay commented Jan 8, 2021

blaine141 commented Jan 8, 2021

TontonTremblay commented Jan 8, 2021

TontonTremblay commented Jan 8, 2021

blaine141 commented Jan 8, 2021 via email

hoangminh1104 commented Jan 10, 2021

blaine141 commented Jan 10, 2021

blaine141 commented Jan 10, 2021

blaine141 commented Mar 5, 2021 • edited Loading

fatooo commented May 20, 2021

blaine141 commented May 20, 2021

TontonTremblay commented May 21, 2021

blaine141 commented May 21, 2021

TontonTremblay commented May 21, 2021

blaine141 commented May 22, 2021

TontonTremblay commented May 23, 2021

an99990 commented Sep 16, 2021 • edited Loading

ArghyaChatterjee commented Apr 7, 2023

blaine141 commented Apr 8, 2023 via email

ArghyaChatterjee commented Apr 8, 2023

ArghyaChatterjee commented Apr 8, 2023

ArghyaChatterjee commented Apr 9, 2023

blaine141 commented Apr 18, 2023 via email

blaine141 commented Mar 5, 2021 •

edited

Loading

an99990 commented Sep 16, 2021 •

edited

Loading