Research Paper:
3D Street Object Detection from Monocular Images Using Deep Learning and Depth Information
Wei Liu*,**,***, Tao Zhang*,**,***, Yun Ma*,**,***,, and Longsheng Wei*,**,***
*School of Automation, China University of Geosciences
No.388 Lumo Road, Hongshan District, Wuhan, Hubei  430074, China
**Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
No.388 Lumo Road, Hongshan District, Wuhan, Hubei  430074, China
***Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
No.388 Lumo Road, Hongshan District, Wuhan, Hubei  430074, China
Corresponding author
In this study, we present a three-dimensional (3D) object detection algorithm based on monocular images by constructing an end-to-end network, that incorporates depth information. The entire network consists of three parts. The first part includes the basic object detection neural network as the main body, that uses the region proposal network to obtain the two-dimensional (2D) region proposal of the object. The second part is the depth estimation branch network, that obtains the depth information of the object pixels and calculates the corresponding 3D point cloud. In the last part, concatenated features obtained from the aforementioned two parts are fed into the fully-connected layers. Subsequently, 2D and 3D detection results are obtained. Compared with certain existing methods, the accuracy of the detection results is improved in this study.
- [1] A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká, “3D bounding box estimation using deep learning and geometry,” Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 5632-5640, 2017. https://doi.org/10.1109/CVPR.2017.597
- [2] P. Li, X. Chen, and S. Shen, “Stereo R-CNN based 3D object detection for autonomous driving,” Proc. of IEEE/CVF Conf. on CVPR, pp. 7636-7644, 2019. https://doi.org/10.1109/CVPR.2019.00783
- [3] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3D Object Detection for Autonomous Driving,” Proc. of IEEE Conf. on CVPR, pp. 2147-2156, 2016. https://doi.org/10.1109/CVPR.2016.236
- [4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems 28, pp. 91-99, 2015.
- [5] C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” Proc. of IEEE Conf. on CVPR, pp. 6602-6611, 2017. https://doi.org/10.1109/CVPR.2017.699
- [6] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.37, No.9, pp. 1904-1916, 2015. https://doi.org/10.1109/TPAMI.2015.2389824
- [7] R. Girshick, “Fast R-CNN,” Proc. of IEEE Int. Conf. on Computer Vision (ICCV), pp. 1440-1448, 2015. https://doi.org/10.1109/ICCV.2015.169
- [8] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional networks,” Advances in Neural Information Processing Systems 29, pp. 379-387, 2016.
- [9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” Proc. of IEEE Conf. on CVPR, pp. 779-788, 2016. https://doi.org/10.1109/CVPR.2016.91
- [10] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” Proc. of IEEE Conf. on CVPR, pp. 6517-6525, 2017. https://doi.org/10.1109/CVPR.2017.690
- [11] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv:1804.02767, 2018.
- [12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” European Conf. on Computer Vision (ECCV), pp. 21-37, 2016. https://doi.org/10.1007/978-3-319-46448-0_2
- [13] S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation and detection from point cloud,” Proc. of IEEE/CVF Conf. on CVPR, pp. 770-779, 2019. https://doi.org/10.1109/CVPR.2019.00086
- [14] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3DSSD: Point-based 3D single stage object detector,” Proc. of IEEE/CVF Conf. on CVPR, pp. 11037-11045, 2020. https://doi.org/10.1109/CVPR42600.2020.01105
- [15] T. Yin, X. Zhou, and P. Krähenbühl, “Center-based 3D object detection and tracking,” Proc. of IEEE/CVF Conf. on CVPR, pp. 11779-11788, 2021. https://doi.org/10.1109/CVPR46437.2021.01161
- [16] Y. Chen, S. Liu, X. Shen, and J. Jia, “DSGN: Deep stereo geometry network for 3D object detection,” Proc. of IEEE/CVF Conf. on CVPR, pp. 12533-12542, 2020. https://doi.org/10.1109/CVPR42600.2020.01255
- [17] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving,” Proc. of IEEE/CVF Conf. on CVPR, pp. 8437-8445, 2019. https://doi.org/10.1109/CVPR.2019.00864
- [18] Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-LiDAR++: Accurate depth for 3D object detection in autonomous driving,” arXiv:1906.06310, 2019.
- [19] R. Qian, D. Garg, Y. Wang, Y. You, S. Belongie, B. Hariharan, M. Campbell, K. Q. Weinberger, and W.-L. Chao, “End-to-end pseudo-LiDAR for image-based 3D object detection,” Proc. of IEEE/CVF Conf. on CVPR, pp. 5880-5889, 2020. https://doi.org/10.1109/CVPR42600.2020.00592
- [20] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliére, and T. Chateau, “Deep MANTA: A coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image,” Proc. of IEEE Conf. on CVPR, pp. 1827-1836, 2017. https://doi.org/10.1109/CVPR.2017.198
- [21] B. Xu and Z. Chen, “Multi-level fusion based 3D object detection from monocular images,” Proc. of IEEE/CVF Conf. on CVPR, pp. 2345-2353, 2018. https://doi.org/10.1109/CVPR.2018.00249
- [22] A. Kundu, Y. Li, and J. M. Rehg, “3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare,” Proc. of IEEE/CVF Conf. on CVPR, pp. 3559-3568, 2018. https://doi.org/10.1109/CVPR.2018.00375
- [23] X. Weng and K. Kitani, “Monocular 3D object detection with pseudo-LiDAR point cloud,” Proc. of IEEE/CVF Int. Conf. on Computer Vision Workshops (ICCVW), pp. 857-866, 2019. https://doi.org/10.1109/ICCVW.2019.00114
- [24] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou, “Deep fitting degree scoring network for monocular 3D object detection,” Proc. of IEEE/CVF Conf. on CVPR, pp. 1057-1066, 2019. https://doi.org/10.1109/CVPR.2019.00115
- [25] F. Manhardt, W. Kehl, and A. Gaidon, “ROI-10D: Monocular lifting of 2D detection to 6D pose and metric shape,” Proc. of IEEE/CVF Conf. on CVPR, pp. 2064-2073, 2019. https://doi.org/10.1109/CVPR.2019.00217
- [26] Z. Liu, Z. Wu, and R. Tóth, “SMOKE: Single-stage monocular 3D object detection via keypoint estimation,” Proc. of IEEE/CVF Conf. on CVPR Workshops (CVPRW), pp. 4289-4298, 2020. https://doi.org/10.1109/CVPRW50498.2020.00506
- [27] Y.-L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” Proc. of the 27th Int. Conf. on Machine Learning (ICML-10), pp. 111-118, 2010.
- [28] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” Proc. of IEEE Conf. on CVPR, pp. 6526-6534, 2017. https://doi.org/10.1109/CVPR.2017.691
- [29] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, “3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.40, No.5, pp. 1259-1272, 2018. https://doi.org/10.1109/TPAMI.2017.2706685
- [30] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
				 This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.
				 This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License. 
			
