Skip to content

Commit

Permalink
Update dataset_prepare.md to fix download path for NYU dataset
Browse files Browse the repository at this point in the history
Add a simple warmup strategy for sigloss as discussed in Issue #20

Enhance DPT and fix bugs reported in Issue #23

Fix typos in docs and add several introductions
  • Loading branch information
zhyever committed Jun 5, 2022
1 parent 6c8fb13 commit 02c1966
Show file tree
Hide file tree
Showing 18 changed files with 5,020 additions and 3,066 deletions.
19 changes: 16 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,21 @@ This repo benefits from awesome works of [mmsegmentation](https://github.com/ope
[BTS](https://github.com/cleinc/bts). Please also consider citing them.

## Cite
If you find this toolbox helpful for your projects or research, consider citing one of our works listed below. I may conduct a technique report based on this toolbox to discuss training details for supervised monocular depth estimation in the future.

```bibtex
@article{li2022binsformer,
title={BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation},
author={Li, Zhenyu and Wang, Xuyang and Liu, Xianming and Jiang, Junjun},
journal={arXiv preprint arXiv:2204.00987},
year={2022}
}
@article{li2022depthformer,
title={DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation},
author={Li, Zhenyu and Chen, Zehui and Liu, Xianming and Jiang, Junjun},
journal={arXiv preprint arXiv:2203.14211},
year={2022}
}
@article{li2021simipu,
title={SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations},
author={Li, Zhenyu and Chen, Zehui and Li, Ang and Fang, Liangji and Jiang, Qinhong and Liu, Xianming and Jiang, Junjun and Zhou, Bolei and Zhao, Hang},
Expand All @@ -80,9 +93,9 @@ This repo benefits from awesome works of [mmsegmentation](https://github.com/ope
```

## Changelog
- **Apr 16**: Finish most of docs and provide all trained parameters. Release codes about BTS, Adabins, DPT, SimIPU, and DepthFormer. Support KITTI, NYU-v2, SUN RGB-D(eval), and CityScapes.
- **Jun. 5, 2022**: Add support for custom dataset training. Add a warmup interface for sigloss to help convergence as discussed in Issue [#20](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/issues/20). Enhance the DPT support and fix bugs in provided pre-trained models as reported in Issue [#23](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/issues/23).
- **Apr. 16, 2022**: Finish most of docs and provide all pre-trained parameters. Release codes about BTS, Adabins, DPT, SimIPU, and DepthFormer. Support KITTI, NYU-v2, SUN RGB-D(eval), and CityScapes.

## TODO
- Some annotations in codes are futile, waiting to be rewritten.
- I will release codes of BinsFormer soon.
- I will release codes of BinsFormer soon (Delaying).
- I would like to include self-supervised depth estimation methods, such as MonoDepth2.
13 changes: 1 addition & 12 deletions configs/_base_/datasets/kitti_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='DepthLoadAnnotations'),
dict(type='LoadKITTICamIntrinsic'),
dict(type='KBCrop', depth=True),
dict(type='RandomRotate', prob=0.5, degree=2.5),
dict(type='RandomFlip', prob=0.5),
Expand Down Expand Up @@ -71,18 +72,6 @@
eigen_crop=False,
min_depth=1e-3,
max_depth=88),
# test=dict(
# type=dataset_type,
# data_root=data_root,
# img_dir='input',
# ann_dir='gt_depth',
# depth_scale=256,
# split='benchmark_val.txt',
# pipeline=test_pipeline,
# garg_crop=True,
# eigen_crop=False,
# min_depth=1e-3,
# max_depth=88)
test=dict(
type=dataset_type,
data_root=data_root,
Expand Down
6 changes: 2 additions & 4 deletions configs/adabins/adabins_efnetb5ap_nyu_24e.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,9 @@
weight_decay=0.1,
paramwise_cfg=dict(
custom_keys={
'decode_head': dict(lr_mult=10), # 10 lr
# 'adaptive_bins_layer': dict(lr_mult=10), # 10 lr
# 'decoder': dict(lr_mult=10), # 10 lr
# 'conv_out': dict(lr_mult=10), # 10 lr
'decode_head': dict(lr_mult=10), # x10 lr
}))

# learning policy
lr_config = dict(
policy='OneCycle',
Expand Down
5 changes: 5 additions & 0 deletions configs/depthformer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,11 @@ This paper aims to address the problem of supervised monocular depth estimation.

## Results and models

*As discussed in Issue [#20](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/issues/20), the loss may not converge as normal when directly regressing the depth value. Hence, we add a simple warmup strategy in the sigloss function. Hence, consider setting the `warm_up=True` for the sigloss as: `loss_decode=dict(type='SigLoss', valid_mask=True, loss_weight=1.0, warm_up=True)`, or setting the `scale_up=True` for the decode_head to predict depth in a sigmoid manner.*

*The results here are obtained following the default setting (i.e., direct regression).*


### KITTI

| Method | Backbone | Train Iters | Abs Rel (+flip) | RMSE (+flip) | Config | Download | GPUs |
Expand Down
8 changes: 5 additions & 3 deletions configs/dpt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,18 +46,20 @@ This script convert model from `PRETRAIN_PATH` and store the converted model in

## Results and models

*This is a simple implementation. Only model structure is aligned with original paper. More experiments about training settings or loss functions are needed to be done.*
*This is a simple implementation. Only model structure can be aligned with original paper. More experiments about training settings or loss functions are needed to be done.*

*We have achieved better results compared with results presented in our paper DepthFormer by conducting more carefully designed tricks.*

In our reproduction, we utilize the standard ImageNet pre-trained ViT-Base instead of the ADE20K pre-trained model in the original paper, which is fairer to compare with other monodepth methods. We find it seems that with direct training on a small dataset (like KITTI and NYU), the model tends to be overfitting and cannot achieve satisfying results.

### KITTI

| Method | Backbone | Train Epoch | Abs Rel (+flip) | RMSE (+flip) | Config | Download | GPUs |
| ------ | :--------: | :----: | :--------------: | :------: | :------: | :--------: | :---:|
| DPT | ViT-Base | 24 | 0.072 | 2.676 | [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_kitti.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_kitti_50e.txt) \| [model](https://drive.google.com/file/d/1ZuFh7COIgPs4Aml3Rrld54A5eYmBHggP/view?usp=sharing) | 8 V100s |
| DPT | ViT-Base | 24 | 0.073 | 2.604 | [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_kitti.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_kitti_24e.txt) \| [model](https://drive.google.com/file/d/1ZuFh7COIgPs4Aml3Rrld54A5eYmBHggP/view?usp=sharing) | 8 V100s |

### NYU

| Method | Backbone | Train Epoch | Abs Rel (+flip) | RMSE (+flip) | Config | Download | GPUs |
| ------ | :--------: | :----: | :--------------: | :------: | :------: | :--------: | :---:|
| DPT | ViT-Base | 24 | 0.134 | 0.415 | [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_nyu.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_nyu_50e.txt) \| [model](https://drive.google.com/file/d/13lxVNf-B5qt1cOoxSWTkVf3HlJGE-olv/view?usp=sharing) | 8 V100s |
| DPT | ViT-Base | 24 | 0.135 | 0.413 | [config](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/dpt_vit-b16_nyu.py) | [log](https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox/blob/main/configs/dpt/resources/logs/dpt_vitb_nyu_24e.txt) \| [model](https://drive.google.com/file/d/13lxVNf-B5qt1cOoxSWTkVf3HlJGE-olv/view?usp=sharing) | 8 V100s |
75 changes: 64 additions & 11 deletions configs/dpt/dpt_vit-b16_kitti.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
min_depth=1e-3,
max_depth=80,
loss_decode=dict(
type='SigLoss', valid_mask=True, loss_weight=1.0),
type='SigLoss', valid_mask=True, loss_weight=1.0, warm_up=True),
)
)

Expand All @@ -30,16 +30,69 @@
}))

lr_config = dict(
_delete_=True,
policy='poly',
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-6,
power=1.0,
min_lr=0.0,
by_epoch=False)
policy='OneCycle',
max_lr=max_lr,
div_factor=25,
final_div_factor=100,
by_epoch=False,
)
momentum_config = dict(
policy='OneCycle'
)

evaluation = dict(interval=1)

img_norm_cfg = dict(
mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
crop_size= (352, 704)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='DepthLoadAnnotations'),
dict(type='LoadKITTICamIntrinsic'),
dict(type='KBCrop', depth=True),
dict(type='RandomRotate', prob=0.5, degree=2.5),
dict(type='RandomFlip', prob=0.5),
dict(type='RandomCrop', crop_size=(352, 704)),
dict(type='ColorAug', prob=0.5, gamma_range=[0.9, 1.1], brightness_range=[0.9, 1.1], color_range=[0.9, 1.1]),
dict(type='Normalize', **img_norm_cfg),
dict(type='DefaultFormatBundle'),
dict(type='Collect',
keys=['img', 'depth_gt'],
meta_keys=('filename', 'ori_filename', 'ori_shape',
'img_shape', 'pad_shape', 'scale_factor',
'flip', 'flip_direction', 'img_norm_cfg',
'cam_intrinsic')),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadKITTICamIntrinsic'),
dict(type='KBCrop', depth=False),
dict(
type='MultiScaleFlipAug',
img_scale=(1216, 352),
flip=True,
flip_direction='horizontal',
transforms=[
dict(type='RandomFlip', direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect',
keys=['img'],
meta_keys=('filename', 'ori_filename', 'ori_shape',
'img_shape', 'pad_shape', 'scale_factor',
'flip', 'flip_direction', 'img_norm_cfg',
'cam_intrinsic')),
])
]

# By default, models are trained on 8 GPUs with 2 images per GPU
data = dict(samples_per_gpu=2, workers_per_gpu=2)
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
pipeline=train_pipeline,),
val=dict(
pipeline=test_pipeline),
test=dict(
pipeline=test_pipeline,))

evaluation = dict(interval=1)
73 changes: 61 additions & 12 deletions configs/dpt/dpt_vit-b16_nyu.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
min_depth=1e-3,
max_depth=10,
loss_decode=dict(
type='SigLoss', valid_mask=True, loss_weight=1.0),
type='SigLoss', valid_mask=True, loss_weight=1.0, warm_up=True),
)
)

Expand All @@ -30,16 +30,65 @@
}))

lr_config = dict(
_delete_=True,
policy='poly',
warmup='linear',
warmup_iters=1500,
warmup_ratio=1e-6,
power=1.0,
min_lr=0.0,
by_epoch=False)

# By default, models are trained on 8 GPUs with 2 images per GPU
data = dict(samples_per_gpu=2, workers_per_gpu=2)
policy='OneCycle',
max_lr=max_lr,
div_factor=25,
final_div_factor=100,
by_epoch=False,
)
momentum_config = dict(
policy='OneCycle'
)

evaluation = dict(interval=1)

img_norm_cfg = dict(
mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
crop_size= (416, 544)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='DepthLoadAnnotations'),
dict(type='NYUCrop', depth=True),
dict(type='RandomRotate', prob=0.5, degree=2.5),
dict(type='RandomFlip', prob=0.5),
dict(type='RandomCrop', crop_size=(416, 544)),
dict(type='ColorAug', prob=0.5, gamma_range=[0.9, 1.1], brightness_range=[0.75, 1.25], color_range=[0.9, 1.1]),
dict(type='Normalize', **img_norm_cfg),
dict(type='DefaultFormatBundle'),
dict(type='Collect',
keys=['img', 'depth_gt'],
meta_keys=('filename', 'ori_filename', 'ori_shape',
'img_shape', 'pad_shape', 'scale_factor',
'flip', 'flip_direction', 'img_norm_cfg',
'cam_intrinsic')),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(480, 640),
flip=True,
flip_direction='horizontal',
transforms=[
dict(type='RandomFlip', direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect',
keys=['img'],
meta_keys=('filename', 'ori_filename', 'ori_shape',
'img_shape', 'pad_shape', 'scale_factor',
'flip', 'flip_direction', 'img_norm_cfg',
'cam_intrinsic')),
])
]

# By default, models are trained on 8 GPUs with 2 images per GPU
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
pipeline=train_pipeline),
val=dict(
pipeline=test_pipeline),
test=dict(
pipeline=test_pipeline))
Loading

0 comments on commit 02c1966

Please sign in to comment.