Update dataset_prepare.md to fix download path for NYU dataset

Add a simple warmup strategy for sigloss as discussed in Issue #20 Enhance DPT and fix bugs reported in Issue #23 Fix typos in docs and add several introductions
zhyever · Jun 5, 2022 · 02c1966 · 02c1966
1 parent 6c8fb13
commit 02c1966
Show file tree

Hide file tree

Showing 18 changed files with 5,020 additions and 3,066 deletions.
diff --git a/README.md b/README.md
@@ -69,8 +69,21 @@ This repo benefits from awesome works of [mmsegmentation](https://github.com/ope
 [BTS](https://github.com/cleinc/bts). Please also consider citing them.
 
 ## Cite
+If you find this toolbox helpful for your projects or research, consider citing one of our works listed below. I may conduct a technique report based on this toolbox to discuss training details for supervised monocular depth estimation in the future.
 
 ```bibtex
+@article{li2022binsformer,
+  title={BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation},
+  author={Li, Zhenyu and Wang, Xuyang and Liu, Xianming and Jiang, Junjun},
+  journal={arXiv preprint arXiv:2204.00987},
+  year={2022}
+}
+@article{li2022depthformer,
+  title={DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation},
+  author={Li, Zhenyu and Chen, Zehui and Liu, Xianming and Jiang, Junjun},
+  journal={arXiv preprint arXiv:2203.14211},
+  year={2022}
+}
 @article{li2021simipu,
   title={SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations},
   author={Li, Zhenyu and Chen, Zehui and Li, Ang and Fang, Liangji and Jiang, Qinhong and Liu, Xianming and Jiang, Junjun and Zhou, Bolei and Zhao, Hang},
@@ -80,9 +93,9 @@ This repo benefits from awesome works of [mmsegmentation](https://github.com/ope
 ```
 
 ## Changelog
-- **Apr 16**: Finish most of docs and provide all trained parameters. Release codes about BTS, Adabins, DPT, SimIPU, and DepthFormer. Support KITTI, NYU-v2, SUN RGB-D(eval), and CityScapes.
+- **Jun. 5, 2022**: Add support for custom dataset training. Add a warmup interface for sigloss to help convergence as discussed in Issue [#20](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/issues/20). Enhance the DPT support and fix bugs in provided pre-trained models as reported in Issue [#23](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/issues/23). 
+- **Apr. 16, 2022**: Finish most of docs and provide all pre-trained parameters. Release codes about BTS, Adabins, DPT, SimIPU, and DepthFormer. Support KITTI, NYU-v2, SUN RGB-D(eval), and CityScapes.
 
 ## TODO
-- Some annotations in codes are futile, waiting to be rewritten.
-- I will release codes of BinsFormer soon.
+- I will release codes of BinsFormer soon (Delaying).
 - I would like to include self-supervised depth estimation methods, such as MonoDepth2.
diff --git a/configs/_base_/datasets/kitti_benchmark.py b/configs/_base_/datasets/kitti_benchmark.py
@@ -7,6 +7,7 @@
 train_pipeline = [
     dict(type='LoadImageFromFile'),
     dict(type='DepthLoadAnnotations'),
+    dict(type='LoadKITTICamIntrinsic'),
     dict(type='KBCrop', depth=True),
     dict(type='RandomRotate', prob=0.5, degree=2.5),
     dict(type='RandomFlip', prob=0.5),
@@ -71,18 +72,6 @@
         eigen_crop=False,
         min_depth=1e-3,
         max_depth=88),
-    # test=dict(
-    #     type=dataset_type,
-    #     data_root=data_root,
-    #     img_dir='input',
-    #     ann_dir='gt_depth',
-    #     depth_scale=256,
-    #     split='benchmark_val.txt',
-    #     pipeline=test_pipeline,
-    #     garg_crop=True,
-    #     eigen_crop=False,
-    #     min_depth=1e-3,
-    #     max_depth=88)
     test=dict(
         type=dataset_type,
         data_root=data_root,

diff --git a/configs/adabins/adabins_efnetb5ap_nyu_24e.py b/configs/adabins/adabins_efnetb5ap_nyu_24e.py
@@ -23,11 +23,9 @@
     weight_decay=0.1,
     paramwise_cfg=dict(
         custom_keys={
-            'decode_head': dict(lr_mult=10), # 10 lr
-            # 'adaptive_bins_layer': dict(lr_mult=10), # 10 lr
-            # 'decoder': dict(lr_mult=10), # 10 lr
-            # 'conv_out': dict(lr_mult=10), # 10 lr
+            'decode_head': dict(lr_mult=10), # x10 lr
         }))
+
 # learning policy
 lr_config = dict(
     policy='OneCycle',

diff --git a/configs/depthformer/README.md b/configs/depthformer/README.md
@@ -26,6 +26,11 @@ This paper aims to address the problem of supervised monocular depth estimation.
 
 ## Results and models
 
+*As discussed in Issue [#20](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/issues/20), the loss may not converge as normal when directly regressing the depth value. Hence, we add a simple warmup strategy in the sigloss function. Hence, consider setting the `warm_up=True` for the sigloss as: `loss_decode=dict(type='SigLoss', valid_mask=True, loss_weight=1.0, warm_up=True)`, or setting the `scale_up=True` for the decode_head to predict depth in a sigmoid manner.*
+
+*The results here are obtained following the default setting (i.e., direct regression).*
+
+
 ### KITTI
 
 | Method | Backbone | Train Iters | Abs Rel (+flip) | RMSE (+flip) | Config | Download | GPUs |

diff --git a/configs/dpt/README.md b/configs/dpt/README.md
@@ -46,18 +46,20 @@ This script convert model from `PRETRAIN_PATH` and store the converted model in
 
 ## Results and models
 
-*This is a simple implementation. Only model structure is aligned with original paper. More experiments about training settings or loss functions are needed to be done.*
+*This is a simple implementation. Only model structure can be aligned with original paper. More experiments about training settings or loss functions are needed to be done.*
+
+*We have achieved better results compared with results presented in our paper DepthFormer by conducting more carefully designed tricks.*
 
 In our reproduction, we utilize the standard ImageNet pre-trained ViT-Base instead of the ADE20K pre-trained model in the original paper, which is fairer to compare with other monodepth methods. We find it seems that with direct training on a small dataset (like KITTI and NYU), the model tends to be overfitting and cannot achieve satisfying results.
 
 ### KITTI
 
 | Method | Backbone | Train Epoch | Abs Rel (+flip) | RMSE (+flip) | Config | Download | GPUs |
 | ------ | :--------: | :----: | :--------------: | :------: | :------: | :--------: | :---:|
-| DPT  |  ViT-Base  |  24   | 0.072 | 2.676 |  [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_kitti.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_kitti_50e.txt) \| [model](https://drive.google.com/file/d/1ZuFh7COIgPs4Aml3Rrld54A5eYmBHggP/view?usp=sharing) | 8 V100s |
+| DPT  |  ViT-Base  |  24   | 0.073 | 2.604 |  [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_kitti.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_kitti_24e.txt) \| [model](https://drive.google.com/file/d/1ZuFh7COIgPs4Aml3Rrld54A5eYmBHggP/view?usp=sharing) | 8 V100s |
 
 ### NYU
 
 | Method | Backbone | Train Epoch | Abs Rel (+flip) | RMSE (+flip) | Config | Download | GPUs |
 | ------ | :--------: | :----: | :--------------: | :------: | :------: | :--------: | :---:|
-| DPT  |  ViT-Base  |  24   | 0.134 | 0.415 |  [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_nyu.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_nyu_50e.txt) \| [model](https://drive.google.com/file/d/13lxVNf-B5qt1cOoxSWTkVf3HlJGE-olv/view?usp=sharing) | 8 V100s |
+| DPT  |  ViT-Base  |  24   | 0.135 | 0.413 |  [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_nyu.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_nyu_24e.txt) \| [model](https://drive.google.com/file/d/13lxVNf-B5qt1cOoxSWTkVf3HlJGE-olv/view?usp=sharing) | 8 V100s |
diff --git a/configs/dpt/dpt_vit-b16_kitti.py b/configs/dpt/dpt_vit-b16_kitti.py
@@ -8,7 +8,7 @@
         min_depth=1e-3,
         max_depth=80,
         loss_decode=dict(
-            type='SigLoss', valid_mask=True, loss_weight=1.0),
+            type='SigLoss', valid_mask=True, loss_weight=1.0, warm_up=True),
     )
 )
 
@@ -30,16 +30,69 @@
         }))
 
 lr_config = dict(
-    _delete_=True,
-    policy='poly',
-    warmup='linear',
-    warmup_iters=1500,
-    warmup_ratio=1e-6,
-    power=1.0,
-    min_lr=0.0,
-    by_epoch=False)
+    policy='OneCycle',
+    max_lr=max_lr,
+    div_factor=25,
+    final_div_factor=100,
+    by_epoch=False,
+)
+momentum_config = dict(
+    policy='OneCycle'
+)
+
+evaluation = dict(interval=1)
+
+img_norm_cfg = dict(
+    mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
+crop_size= (352, 704)
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='DepthLoadAnnotations'),
+    dict(type='LoadKITTICamIntrinsic'),
+    dict(type='KBCrop', depth=True),
+    dict(type='RandomRotate', prob=0.5, degree=2.5),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='RandomCrop', crop_size=(352, 704)),
+    dict(type='ColorAug', prob=0.5, gamma_range=[0.9, 1.1], brightness_range=[0.9, 1.1], color_range=[0.9, 1.1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', 
+         keys=['img', 'depth_gt'],
+         meta_keys=('filename', 'ori_filename', 'ori_shape',
+                    'img_shape', 'pad_shape', 'scale_factor', 
+                    'flip', 'flip_direction', 'img_norm_cfg',
+                    'cam_intrinsic')),
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='LoadKITTICamIntrinsic'),
+    dict(type='KBCrop', depth=False),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(1216, 352),
+        flip=True,
+        flip_direction='horizontal',
+        transforms=[
+            dict(type='RandomFlip', direction='horizontal'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', 
+                 keys=['img'],
+                 meta_keys=('filename', 'ori_filename', 'ori_shape',
+                            'img_shape', 'pad_shape', 'scale_factor', 
+                            'flip', 'flip_direction', 'img_norm_cfg',
+                            'cam_intrinsic')),
+        ])
+]
 
 # By default, models are trained on 8 GPUs with 2 images per GPU
-data = dict(samples_per_gpu=2, workers_per_gpu=2)
+data = dict(
+    samples_per_gpu=2,
+    workers_per_gpu=2,
+    train=dict(
+        pipeline=train_pipeline,),
+    val=dict(
+        pipeline=test_pipeline),
+    test=dict(
+        pipeline=test_pipeline,))
 
-evaluation = dict(interval=1)
diff --git a/configs/dpt/dpt_vit-b16_nyu.py b/configs/dpt/dpt_vit-b16_nyu.py
@@ -8,7 +8,7 @@
         min_depth=1e-3,
         max_depth=10,
         loss_decode=dict(
-            type='SigLoss', valid_mask=True, loss_weight=1.0),
+            type='SigLoss', valid_mask=True, loss_weight=1.0, warm_up=True),
     )
 )
 
@@ -30,16 +30,65 @@
         }))
 
 lr_config = dict(
-    _delete_=True,
-    policy='poly',
-    warmup='linear',
-    warmup_iters=1500,
-    warmup_ratio=1e-6,
-    power=1.0,
-    min_lr=0.0,
-    by_epoch=False)
-
-# By default, models are trained on 8 GPUs with 2 images per GPU
-data = dict(samples_per_gpu=2, workers_per_gpu=2)
+    policy='OneCycle',
+    max_lr=max_lr,
+    div_factor=25,
+    final_div_factor=100,
+    by_epoch=False,
+)
+momentum_config = dict(
+    policy='OneCycle'
+)
 
 evaluation = dict(interval=1)
+
+img_norm_cfg = dict(
+    mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
+crop_size= (416, 544)
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='DepthLoadAnnotations'),
+    dict(type='NYUCrop', depth=True),
+    dict(type='RandomRotate', prob=0.5, degree=2.5),
+    dict(type='RandomFlip', prob=0.5),
+    dict(type='RandomCrop', crop_size=(416, 544)),
+    dict(type='ColorAug', prob=0.5, gamma_range=[0.9, 1.1], brightness_range=[0.75, 1.25], color_range=[0.9, 1.1]),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', 
+         keys=['img', 'depth_gt'], 
+         meta_keys=('filename', 'ori_filename', 'ori_shape',
+                    'img_shape', 'pad_shape', 'scale_factor', 
+                    'flip', 'flip_direction', 'img_norm_cfg',
+                    'cam_intrinsic')),
+]
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(480, 640),
+        flip=True,
+        flip_direction='horizontal',
+        transforms=[
+            dict(type='RandomFlip', direction='horizontal'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', 
+                 keys=['img'],
+                 meta_keys=('filename', 'ori_filename', 'ori_shape',
+                            'img_shape', 'pad_shape', 'scale_factor', 
+                            'flip', 'flip_direction', 'img_norm_cfg',
+                            'cam_intrinsic')),
+        ])
+]
+
+# By default, models are trained on 8 GPUs with 2 images per GPU
+data = dict(
+    samples_per_gpu=2,
+    workers_per_gpu=2,
+    train=dict(
+        pipeline=train_pipeline),
+    val=dict(
+        pipeline=test_pipeline),
+    test=dict(
+        pipeline=test_pipeline))