SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

Yi Wei1* Linqing Zhao2* Wenzhao Zheng1 Zheng Zhu3 Yongming Rao1

Guan Huang3 Jiwen Lu1 Jie Zhou1

1Tsinghua University 2Tianjin University 3PhiGent Robotics

[Paper (arXiv)] [Code (GitHub)] [Data (Tsinghua Cloud)]

Demo of our SurroundDepth. SurroundDepth is designed for incorporating the information from multiple surrounding views to predict depth maps across cameras.


Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric consistency enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.

Main Idea

Comparison between SurroundDepth and self-supervised monocular depth estimation methods. Conventional methods predict depths and ego-motions of each view separately, which ignores the correlations between views. Our method incorporates the information across cameras and jointly processes all surrounding views.


We utilize encoder-decoder networks to predict depths. To entangle surrounding views, we propose the cross-view transformer (CVT) to fuse multi-camera features in a multi-scale fashion. Pretrained with the sparse pseudo depths generated by two-frame Structure-from-Motion, the depth model can learn the absolute scale of the real world. By explicitly introducing extrinsic matrices into pose estimation, we can predict multi-view consistent ego-motions and boost the performance of scale-aware depth estimation.


  • We show that SurroundDepth can outperform existing self-supervised models with a large margin on both DDAD and NuScenes datasets.

  • SurroundDepth is able to predict real-world scales by 1) utilizing the scale-aware pseudo depths and 2) estimating a universal ego-motion of the vehicle and transferring it to each view.



title={SurroundDepth: Entangling Views for Self-Supervised Multi-Camera Depth Estimation},

author={Wei, Yi and Zhao, Linqing and Zheng, Wenzhao and Zhu, Zheng and Rao, Yongming and Huang, Guan and Lu, Jiwen and Zhou, Jie},

journal={arXiv preprint arXiv:2204.03636},