SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation
1Tsinghua University 2Tianjin University 3PhiGent Robotics
Demo of our SurroundDepth. SurroundDepth is designed for incorporating the information from multiple surrounding views to predict depth maps across cameras.
Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric consistency enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Comparison between SurroundDepth and self-supervised monocular depth estimation methods. Conventional methods predict depths and ego-motions of each view separately, which ignores the correlations between views. Our method incorporates the information across cameras and jointly processes all surrounding views.
We utilize encoder-decoder networks to predict depths. To entangle surrounding views, we propose the cross-view transformer (CVT) to fuse multi-camera features in a multi-scale fashion. Pretrained with the sparse pseudo depths generated by two-frame Structure-from-Motion, the depth model can learn the absolute scale of the real world. By explicitly introducing extrinsic matrices into pose estimation, we can predict multi-view consistent ego-motions and boost the performance of scale-aware depth estimation.
We show that SurroundDepth can outperform existing self-supervised models with a large margin on both DDAD and NuScenes datasets.
SurroundDepth is able to predict real-world scales by 1) utilizing the scale-aware pseudo depths and 2) estimating a universal ego-motion of the vehicle and transferring it to each view.