-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to train a working model #155
Comments
Now that I think about it more, would it be helpful to include both the left and right image? Would that just be duplicate data or contribute something? |
I would say you do not have enough training data. Have you tried to generate DR data using NDDS. We have been working on a new tool that you could use as well to generate training data, https://github.com/owl-project/ViSII. |
Thanks, I'll look into it. I also just noticed that in train.py there is an inconsistency in the default learning rate.
Should the default be 0.001 or 0.0001? Would this cause it to not learn in 30 epochs like I am seeing? Asking because I am running a training session that I do not want to interrupt. |
This is a good question, I do not remember the details of the learning rate. Since it uses ADAM, it should adapt fairly quickly to something more appropriate. |
What sort of results are you getting on the training data as well, to me the results you share looks like you did not let it train for long enough. The pre-train weights were train on 4 p100 for 24h. this is 60 epochs with batchsize of 128 for dataset of size 200k images. |
Oh! I though the pre-trained weights were only trained on the fat dataset.
That explains a lot. If that's the case, I need more data or train time.
For the learning rate change, I am running a test now. I know Adam is
supposed to adapt, but I've been screwed by Adam before when my learning
rate was too low. Training took forever.
One other thing, I have a GPU with turing architecture (RTX 2070), so I
implemented pytorch's automatic mixed-precision feature. It reduced the
training time by about 40% and because of the memory usage reduction, it
allowed me to increase batch size from 16 to 24. Would other people be
interested in that feature?
…On Fri, Jan 8, 2021, 14:38 jtremblay ***@***.***> wrote:
What sort of results are you getting on the training data as well, to me
the results you share looks like you did not let it train for long enough.
The pre-train weights were train on 4 p100 for 24h. this is 60 epochs with
batchsize of 128 for dataset of size 200k images.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#155 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQRTE4SWFHWARZ4JAXMRULSY5NMVANCNFSM4VZ7PDUQ>
.
|
Hi, can I know what is the ratio of 200k images? Is it 50% DR and 50% FAT or the contribution is different? Also, did you use all the depth images as well? I'm new so pardon me if my question is silly. |
I'm not completely sure, but in the fat dataset I was only able to find 8k samples containing the soup. So if you want 200k, you will have to generate almost all of them. In terms of how to get good results, I'm training the fat data for 500 epochs to see when it actually gets good results with only the 8k samples. Because @TontonTremblay mentioned he did 60 epochs with 200k samples, that made me think I needed more batches to be run for it to train. I asked in another issue and the depth images are not used. |
Back to being unable to train a working model. As expected, training it for longer did nothing. I did however encounter a troubling issue. The model I trained is unable to detect the training samples. I am taking the checkpoint from 60 epochs and ran it against a training sample. Here are the results. From these belief maps, DOPE was unable to detect any object. Compare this to the pre-trained sample which did detect the soup correctly. Here is another example of my model running on another training sample. The belief maps look correct but the algorithm does not detect any object. Why does my model not detect the object? Why does the pre-trained model, which looks similar, detect it? What is the difference, and what needs to change to have my model detect correctly in training samples? I'm thinking that this has to be some bug in the code. The model should be able to detect its training samples, even if I have a small dataset. |
I was able to train a working model and the system is working great! I'm very happy with the performance. There were a few things that I had to change along the way from what I was initially doing. Here is what I was doing wrong for anyone else who may be experiencing issues.
Here is a link to my modified train.py. If anyone wants an example dataset to train off of, reach out to me and I will share mine. Feel free to ask me any questions as well. Here's a video of DOPE working in our simulator if anyone is interested. |
@blaine141 hi can you please share your dataset with me? |
Yeah just shoot an email to blaine141@gmail.com and I will respond with a link |
very cool results @blaine141, could you share some of the renders in nvisii, thank you for the update. I can confirm that the normalization will have an impact if they do not match (we noticed that recently in a project (https://arxiv.org/pdf/2011.07748v3.pdf). Also I am sorry I did not answer your Jan 10 message. But it looks like you made some great progresses. I will refer to your post on the readme for people to look at when training on single GPU. Thank you for sharing. |
If you want to look at some samples here is my dataset. https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/miller_8545_buckeyemail_osu_edu/EaK9JhScTaRDhgnCTsC6yoEBzPHJ2gZWK_z4PFYqzcYSlA?e=qjuUAb. Is very specific to our situation but it can show you what works. |
Could you share a few renders (I am not sure I want to download the full dataset) to see how you approached rendering your object to train DOPE. I did not have time to give more extensive ways of generating synthetic data scenes in nvisii (I have a few internally for this paper https://www.dropbox.com/s/xmdo7k6dxvqv52b/visii_sdg_iclr21_workshop.pdf?dl=0), so I am intrigued. Did you share you rendering script online? Also how was your experience with nvisii (hopefully better than training DOPE on a single GPU)? |
Here is one of our samples. We are trying to train to work in underwater environments for RoboSub. We randomized the pose of the model, wrapped a random background image over the dome, and used the dome space behind the camera as lighting with random color and brightness. Also added a lot to data augmentation in DOPE to try and improve generalization. The script used to generate these is gen.py |
This looks very good, thank you for sharing :P Good work. |
Hi, thank you |
@blaine141 do you have a command line in order to run your custom training file ? I tried these 3 differently (gave your file name as train efficiently) but none of it is working.
or
or
For all the commands, it says:
|
I think it's been too long since I looked at this. The repo has changed a
fair bit. If you still don't have it figured out let me know and I can help
…On Thu, Apr 6, 2023, 7:13 PM Arghya Chatterjee ***@***.***> wrote:
@blaine141 <https://github.com/blaine141> do you have a command line in
order to run your custom training file ? I tried these 3 differently (gave
your file name as train efficiently) but none of it is working.
python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/
or
python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --outf tmp/ --data ../nvisii_data_gen/output/dataset/
or
python3 -m torch.distributed.launch --nproc_per_node=1 train_efficiently.py --network dope --epochs 20 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/dataset/
For all the commands, it says:
start: 21:05:55.081980
usage: train_efficiently.py [-h] [--data DATA] [--datatest DATATEST] [--object OBJECT] [--workers WORKERS] [--batchsize BATCHSIZE] [--subbatchsize SUBBATCHSIZE] [--imagesize IMAGESIZE] [--lr LR]
[--noise NOISE] [--net NET] [--namefile NAMEFILE] [--manualseed MANUALSEED] [--epochs EPOCHS] [--loginterval LOGINTERVAL] [--gpuids GPUIDS [GPUIDS ...]] [--outf OUTF]
[--sigma SIGMA] [--save] [--pretrained PRETRAINED] [--nbupdates NBUPDATES] [--datasize DATASIZE] [-n N] [-g GPUS] [-nr NR] [--option OPTION]
train_efficiently.py: error: unrecognized arguments: --local_rank=0 --network dope
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/home/arghya/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train_efficiently.py', '--local_rank=0', '--network', 'dope', '--epochs', '20', '--batchsize', '10', '--outf', 'tmp/', '--data', '../nvisii_data_gen/output/dataset/']' returned non-zero exit status 2.
—
Reply to this email directly, view it on GitHub
<#155 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQRTE6U7PD3MP3PCHHQFZ3W75Z3LANCNFSM4VZ7PDUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Nope. Still having issues. Can you look into it for a bit quickly ? |
@blaine141 also, looks like you had done some good work on this thing. I am trying to make it work with ROS2. Do you know any ROS2 implementation of this one ? It seems to be the Isaac ros pose estimation is more focused on pose estimation than the detection itself. I want something similar in this repo but converted in ROS 2 like detection instance I'd with the pose of the object as a ROS2 message. |
@blaine141 Also, there is another problem I am having issues with: The training time is taking too long. I have a 40k annotated dataset of ironrod created using NViSII. With 64 GB ram, single Nvidia RTX 3060 6GB graphics, it took around 6 hours to generate 2 epochs of training. Getting that 60 epochs of training for that single object will take quite a long time. Can we minimize that time? I am using this script for training inside train2 folder.
|
You can enable mixed-precision and increase your batch size to roughly half
the training time. There are lots of ways to do it, here is one:
https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/
I am unaware of any work to upgrade this repo to ROS 2. You can try to do
it yourself, it shouldn't be too hard. Should probably get familiar with
the codebase anyway in case you want to make optimizations in the future.
…On Sun, Apr 9, 2023 at 10:49 AM Arghya Chatterjee ***@***.***> wrote:
@blaine141 <https://github.com/blaine141> Also, there is another problem
I am having issues with:
The training time is taking too long. I have a 40k annotated dataset of
ironrod created using NViSII. With 64 GB ram, single Nvidia RTX 3060 6GB
graphics, it took around 6 hours to generate 2 epochs of training. Getting
that 60 epochs of training for that single object will take quite a long
time. Can we minimize that time?
I am using this script for training inside train2 folder.
python3 -m torch.distributed.launch --nproc_per_node=1 train.py --network dope --epochs 2 --batchsize 10 --outf tmp/ --data ../nvisii_data_gen/output/output_example/
—
Reply to this email directly, view it on GitHub
<#155 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQRTEZTDBJZYEACXMZMKDTXALZCFANCNFSM4VZ7PDUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am trying to train any working model with DOPE and am struggling. I tried to run train.py on the FAT dataset to train the soup and am unable to produce any model that matches the performance of the given pre-trained weights.
Here is the image I have been using for my tests
To start, here are the results from the soup_60.pth file. Here are the output beliefs. It detected the soup properly in this sample.
The first thing I tried when training my own soup model was to isolate all of the samples with the soup in it. This came down to 8321 samples, and I only used the images from the left camera. I split the data into train and test sets. 90/10 random split. After training for 60 epochs, these were the average test losses per epoch.
It was unable to detect the object in the image, and here are the belief maps.
It appeared to be training, but I would expect from the wiki for it to be able to detect things after 60 epochs.
The other thing that I tried is to use both samples with soup and without soup to see if results were improved. I had an equal amount of positives and negatives. As you can expect, the loss was cut in half, but it did not train any better.
From the loss graph it seems the model was not done larning, but the wiki suggested 30 epochs. I can train for longer, but already am dealing with train times over a day. How did you guys train your soup model and why are my results so much worse? I want to be able to do a proof-of-concept train before moving to a custom object. Let me know any suggestions you may have.
The text was updated successfully, but these errors were encountered: