This is the official implementation for Multi3DRefer: Grounding Text Description to Multiple 3D Objects.
This repo contains CUDA implementation, please make sure your GPU compute capability is at least 3.0 or above.
We report the max computing resources usage with batch size 4:
Training | Inference | |
---|---|---|
GPU mem usage | 15.2 GB | 11.3 GB |
We recommend the use of miniconda to manage system dependencies.
# create and activate the conda environment
conda create -n m3drefclip python=3.10
conda activate m3drefclip
# install PyTorch 2.0.1
conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia
# install PyTorch3D with dependencies
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
conda install pytorch3d -c pytorch3d
# install MinkowskiEngine with dependencies
conda install -c anaconda openblas
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"
# install Python libraries
pip install .
# install CUDA extensions
cd m3drefclip/common_ops
pip install .
Note: Setting up with pip (no conda) requires OpenBLAS to be pre-installed in your system.
# create and activate the virtual environment
virtualenv env
source env/bin/activate
# install PyTorch 2.0.1
pip install torch torchvision
# install PyTorch3D
pip install pytorch3d
# install MinkowskiEngine
pip install MinkowskiEngine
# install Python libraries
pip install .
# install CUDA extensions
cd m3drefclip/common_ops
pip install .
Note: Both ScanRefer and Nr3D datasets requires the ScanNet v2 dataset. Please preprocess it first.
-
Download the ScanNet v2 dataset (train/val/test), (refer to ScanNet's instruction for more details). The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── scannetv2 │ │ ├── scans │ │ │ ├── [scene_id] │ │ │ │ ├── [scene_id]_vh_clean_2.ply │ │ │ │ ├── [scene_id]_vh_clean_2.0.010000.segs.json │ │ │ │ ├── [scene_id].aggregation.json │ │ │ │ ├── [scene_id].txt
-
Pre-process the data, it converts original meshes and annotations to
.pth
data:python dataset/scannetv2/preprocess_all_data.py data=scannetv2 +workers={cpu_count}
-
Pre-process the multiview features from ENet: Please refer to the instructions in ScanRefer's repo with one modification:
- comment out lines 51 to 56 in batch_load_scannet_data.py since we follow D3Net's setting that doesn't do point downsampling here.
Then put the generated
enet_feats_maxpool.hdf5
(116GB) underm3drefclip/dataset/scannetv2
-
Download the ScanRefer dataset (train/val). Also, download the test set. The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── scanrefer │ │ ├── metadata │ │ │ ├── ScanRefer_filtered_train.json │ │ │ ├── ScanRefer_filtered_val.json │ │ │ ├── ScanRefer_filtered_test.json
-
Pre-process the data, "unique/multiple" labels will be added to raw
.json
files for evaluation purpose:python dataset/scanrefer/add_evaluation_labels.py data=scanrefer
-
Download the Nr3D dataset (train/test). The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── nr3d │ │ ├── metadata │ │ │ ├── nr3d_train.csv │ │ │ ├── nr3d_test.csv
-
Pre-process the data, "easy/hard/view-dep/view-indep" labels will be added to raw
.csv
files for evaluation purpose:python dataset/nr3d/add_evaluation_labels.py data=nr3d
- Downloading the Multi3DRefer dataset (train/val). The raw dataset files should be organized as follows:
m3drefclip # project root ├── dataset │ ├── multi3drefer │ │ ├── metadata │ │ │ ├── multi3drefer_train.json │ │ │ ├── multi3drefer_val.json
We pre-trained PointGroup implemented in MINSU3D on ScanNet v2 and use it as the detector. We use coordinates + colors + multi-view features as inputs.
- Download the pre-trained detector. The detector checkpoint file should be organized as follows:
m3drefclip # project root ├── checkpoints │ ├── PointGroup_ScanNet.ckpt
Note: Configuration files are managed by Hydra, you can easily add or override any configuration attributes by passing them as arguments.
# log in to WandB
wandb login
# train a model with the pre-trained detector, using predicted object proposals
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt
# train a model with the pretrained detector, using GT object proposals
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt model.network.detector.use_gt_proposal=True
# train a model from a checkpoint, it restores all hyperparameters in the .ckpt file
python train.py data={scanrefer/nr3d/multi3drefer} experiment_name={checkpoint_experiment_name} ckpt_path={ckpt_file_path}
# test a model from a checkpoint and save its predictions
python test.py data={scanrefer/nr3d/multi3drefer} data.inference.split={train/val/test} ckpt_path={ckpt_file_path} pred_path={predictions_path}
# evaluate predictions
python evaluate.py data={scanrefer/nr3d/multi3drefer} pred_path={predictions_path} data.evaluation.split={train/val/test}
Performance:
Split | IoU | Unique | Multiple | Overall |
---|---|---|---|---|
Val | 0.25 | 85.3 | 43.8 | 51.9 |
Val | 0.5 | 77.2 | 36.8 | 44.7 |
Test | 0.25 | 79.8 | 46.9 | 54.3 |
Test | 0.5 | 70.9 | 38.1 | 45.5 |
Performance:
Split | Easy | Hard | View-dep | View-indep | Overall |
---|---|---|---|---|---|
Test | 55.6 | 43.4 | 42.3 | 52.9 | 49.4 |
Performance:
Split | IoU | ZT w/ D | ZT w/o D | ST w/ D | ST w/o D | MT | Overall |
---|---|---|---|---|---|---|---|
Val | 0.25 | 39.4 | 81.8 | 34.6 | 53.5 | 43.6 | 42.8 |
Val | 0.5 | 39.4 | 81.8 | 30.6 | 47.8 | 37.9 | 38.4 |
Convert M3DRef-CLIP predictions to ScanRefer benchmark format:
python dataset/scanrefer/convert_output_to_benchmark_format.py data=scanrefer pred_path={predictions_path} +output_path={output_file_path}
Please refer to ReferIt3D benchmark to report results.