Research Paper:
Monocular 3D Object Detection Based on Reparametrized Cross-Dimension Focusing
Ruikai Li*, Chao Wang**,, and Guopeng Tan*
*Information and Electrical Engineering School, Hebei University of Engineering
19 Taiji Street, Congtai District, Handan, Hebei 056038, China
**Hebei Key Laboratory of Security & Protection Information Sensing and Processing
19 Taiji Street, Congtai District, Handan, Hebei 056038, China
Corresponding author
Deploying monocular 3D object detection networks on visual sensors of intelligent transportation assistance devices is a cost-effective and practical solution. Despite the progress made in existing monocular 3D object detection methods, there still exists a certain gap in the detection accuracy compared to 3D object detection methods based on point cloud data from LiDAR (light detection and ranging) sensors. Additionally, these methods incur relatively high computational costs. Addressing these issues, this paper proposes an improved monocular 3D object detection network, which optimizes the overall structure of the model through structural reparameterization, effectively alleviating the computational burden on computing devices. Simultaneously, we focus on the differences between 2D and 3D features and propose a cross-dimension focusing method to enhance the performance of ceiling the object detection method in extracting 3D object features. In the KITTI benchmarks, our framework achieved significantly superior performance in 3D object detection compared to other methods.
- [1] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-LiDAR from visual depth estimation: Bridging the gap in 3D object detection for autonomous driving,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 8445-8453, 2019. https://doi.org/10.1109/CVPR.2019.00864
- [2] Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-LiDAR++: Accurate depth for 3D object detection in autonomous driving,” arXiv preprint, arXiv:1906.06310, 2019. https://doi.org/10.48550/arXiv.1906.06310
- [3] P. Li, X. Chen, and S. Shen, “Stereo R-CNN based 3D object detection for autonomous driving,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 7644-7652, 2019. https://doi.org/10.1109/CVPR.2019.00783
- [4] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun, “3D object proposals using stereo imagery for accurate object class detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.40, Issue 5, pp. 1259-1272, 2017. https://doi.org/10.1109/TPAMI.2017.2706685
- [5] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2002-2011, 2018. https://doi.org/10.1109/CVPR.2018.00214
- [6] R. Furukawa, R. Sagawa, and H. Kawasaki, “Depth estimation using structured light flow—Analysis of projected pattern flow on an object’s surface,” Proc. of the IEEE Int. Conf. on Computer Vision, pp. 4640-4648, 2017. https://doi.org/10.1109/ICCV.2017.497
- [7] Y. Chen, L. Tai, K. Sun, and M. Li, “Monopair: Monocular 3D object detection using pairwise spatial relationships,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 12093-12102, 2020. https://doi.org/10.1109/CVPR42600.2020.01211
- [8] T. Wang, X. Zhu, J. Pang, and D. Lin, “FCOS3D: Fully convolutional one-stage monocular 3D object detection,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 913-922, 2021. https://doi.org/10.1109/ICCVW54120.2021.00107
- [9] R. Qian, D. Garg, Y. Wang, Y. You, S. Belongie, B. Hariharan, M. Campbell, K. Q. Weinberger, and W.-L. Chao, “End-to-end pseudo-LiDAR for image-based 3D object detection,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 5881-5890, 2020. https://doi.org/10.1109/CVPR42600.2020.00592
- [10] A. Mousavian, D. Anguelov, J. Flynn, and J. Košecká, “3D bounding box estimation using deep learning and geometry,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 7074-7082, 2017. https://doi.org/10.1109/CVPR.2017.597
- [11] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3D object detection for autonomous driving,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2147-2156, 2016. https://doi.org/10.1109/CVPR.2016.236
- [12] Z. Qin, J. Wang, and Y. Lu, “MonoGRNet: A geometric reasoning network for monocular 3D object localization,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.33, No.01, pp. 8851-8858, 2019. https://doi.org/10.1609/aaai.v33i01.33018851
- [13] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang, “GS3D: An efficient 3D object detection framework for autonomous driving,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 1019-1028, 2019. https://doi.org/10.1109/CVPR.2019.00111
- [14] J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3D object detection leveraging accurate proposals and shape reconstruction,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11867-11876, 2019. https://doi.org/10.1109/CVPR.2019.01214
- [15] A. Simonelli, S. R. Bulò, L. Porzi, M. López-Antequera, and P. Kontschieder, “Disentangling monocular 3D object detection,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 1991-1999, 2019. https://doi.org/10.1109/ICCV.2019.00208
- [16] Z. Liu, Z. Wu, and R. Tóth, “SMOKE: Single-stage monocular 3D object detection via keypoint estimation,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops, pp. 996-997, 2020. https://doi.org/10.1109/CVPRW50498.2020.00506
- [17] G. Brazil and X. Liu, “M3D-RPN: Monocular 3D region proposal network for object detection,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 9287-9296, 2019. https://doi.org/10.1109/ICCV.2019.00938
- [18] P. Li, H. Zhao, P. Liu, and F. Cao, “RTM3D: Real-time monocular 3D detection from object keypoints for autonomous driving,” Proc. of the European Conf. on Computer Vision (ECCV), pp. 644-660, 2020. https://doi.org/10.1007/978-3-030-58580-8_38
- [19] M. Ding, Y. Huo, H. Yi, Z. Wang, J. Shi, Z. Lu, and P. Luo, “Learning depth-guided convolutions for monocular 3D object detection,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 1000-1001, 2020. https://doi.org/10.1109/CVPR42600.2020.01169
- [20] S. Luo, H. Dai, L. Shao, and Y. Ding, “M3DSSD: Monocular 3D single stage object detector,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6145-6154, 2021. https://doi.org/10.1109/CVPR46437.2021.00608
- [21] X. Ding, Y. Guo, G. Ding, and J. Han, “Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 1911-1920, 2019. https://doi.org/10.1109/ICCV.2019.00200
- [22] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, “RepVGG: Making VGG-style convnets great again,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 13733-13742, 2021. https://doi.org/10.1109/CVPR46437.2021.01352
- [23] P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan, “FastViT: A fast hybrid vision transformer using structural reparameterization,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 5785-5795, 2023. https://doi.org/10.1109/ICCV51070.2023.00532
- [24] Y. Li, P. Zhao, G. Yuan, X. Lin, Y. Wang, and X. Chen, “Pruning-as-search: Efficient neural architecture search via channel pruning and structural reparameterization,” arXiv preprint, arXiv:2206.01198, 2022. https://doi.org/10.48550/arXiv.2206.01198
- [25] S. Liu, D. Huang et al., “Receptive field block net for accurate and fast object detection,” Proc. of the European Conf. on Computer Vision (ECCV), pp. 385-400, 2018. https://doi.org/10.1007/978-3-030-01252-6_24
- [26] X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your kernels to 31x31: Revisiting large kernel design in CNNs,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11963-11975, 2022. https://doi.org/10.1109/CVPR52688.2022.01166
- [27] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint, arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929
- [28] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4700-4708, 2017. https://doi.org/10.1109/CVPR.2017.243
- [29] Z. Chen, Z. He, and Z.-M. Lu, “DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention,” IEEE Trans. on Image Processing, Vol.33, pp. 1002-1015, 2024. https://doi.org/10.1109/TIP.2024.3354108
- [30] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” 2012 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 3354-3361, 2012. https://doi.org/10.1109/CVPR.2012.6248074
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.