GSOT3D: Towards Generic 3D Single Object Tracking in the Wild
Yifan Jiao, Yunhao Li, Junhua Ding, Qing Yang, Song Fu, Heng Fan
(
Figure: We present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking.
- Multiple Modalities
- GSOT3D provides multiple modalities for each sequence, including point cloud (PC), RGB image, and depth, making it a versatile platform for various research directions in 3D single object tracking.
- Target Category Diversity
- GSOT3D covers a wide selection of 54 object categories, making it a diverse benchmark for 3D single object tracking.
- Larger-scale Benchmark
- GSOT3D comprises 620 sequences with 123K frames, which is the largest benchmark delicately designed for 3D single object tracking.
- 9-DoF (Degrees of Freedom) Box
- Different from Nuscenes and KITTI, GSOT3D provides 9-DoF bounding boxes for each object, which is more comprehensive and realistic for 3D single object tracking.
- High-quality and Dense Annotation
- For precise dense annotations, all the sequences in GSOT3D are manually labeled using 9DoF 3D bounding boxes with multiple rounds of inspection and refinement.
Figure: Illustration of category organization in GSOT3D and its distribution of sequence number in each classes.
Figure: Statistics on GSOT3D. (a): Distribution of sequence length. (b): Average number of points in each object category
Figure: To facilitate research on GSOT3D, we present a simple but effective generic 3D tracker, dubbed PROT3D, for class-agnostic 3D tracking on point clouds. The core of PROT3D is a progressive spatial-temporal architecture containing multiple stages. In each stage, target localization is performed by spatial-temporal matching with Transformer, and the result is applied to refine search region feature. The refined search region feature from one stage is forwarded to next stage for further improvements, and tracking result is generated after the final stage.
Table: Overall performance of eight state-of-the-art trackers and our PROT3D using mAO, mSR50 and mSR75. The best three results are highlighted in red, blue, and green fonts, respectively.
Figure: Attribute-based performance and comparison using mAO, mSR50 and mSR75.
Table: Comparison of GSOT3D with KITTI.
Figure: Visualization of ground truth and several tracking results on GSOT3D.
More experimental results with analysis can be found in the paper.
Figure: To collect multimodal data for GSOT3D, we build a mobile robotic platform based on Clearpath Husky A200. Multiple sensors, including a 64-beam LiDAR, an RGB camera and a depth camera, are deployed on the platform with careful calibration.
Table: Specific configuration of sensors and robot chassis of the mobile robotic platform.
We will release all download links and models soon. Stay tuned!
GSOT3D aims to facilitate research and applications of 3D single object tracking. It is developed and used for research purpose only.
If you find our GSOT3D useful, please consider giving it a star and citing it. Thanks!
@article{jiao2024gsot3d,
title={GSOT3D: Towards Generic 3D Single Object Tracking in the Wild},
author={Yifan, Jiao and Yunhao, Li and Junhua, Ding and Qing, Yang and Song, Fu and Heng, Fan and Libo, Zhang},
journal={arXiv preprint arXiv:2412.02129},
year={2024}
}