Lightweight Bilateral Network for Real-Time Semantic Segmentation

Pengtao Wang; Lihong Li; Feiyang Pan; Lin Wang

doi:10.20965/jaciii.2023.p0673

single-jc.php

« previous

JACIII Vol.27 No.4 pp. 673-682

(2023)

doi: 10.20965/jaciii.2023.p0673

Research Paper:

Views over last 60 days: 6,589

Lightweight Bilateral Network for Real-Time Semantic Segmentation

Pengtao Wang , Lihong Li^† , Feiyang Pan , and Lin Wang

School of Information and Electrical Engineering, Hebei University of Engineering
No.19 Taiji Road, Handan, Hebei 056038, China

^†Corresponding author

Received:

February 11, 2023

Accepted:

April 4, 2023

Published:

July 20, 2023

Keywords:

real-time semantic segmentation, depth separable convolution, attention mechanism

Abstract

Herein, a dual-branch semantic segmentation model based on depth-separable convolution and attention mechanism is proposed for the real-time and accuracy requirement of semantic segmentation. The proposed approach overcomes the problems of poor segmentation effect and over-simplification of feature fusion arising from the constant downsample operations in semantic segmentation. The network is divided into spatial detail and semantic information paths. The spatial detail path utilizes a smaller downsample multiplier to maintain resolution and efficiently extract spatial information. The semantic information path is constructed by a non-bottleneck residual unit with dilated convolution; it extracts semantic features. For the feature aggregation problem, the feature-guided fusion module is designed to assign different weights to the parts of the two paths and fuse them to obtain the final output. The proposed algorithm achieves a segmentation accuracy of 69.6% and speed of 70 fps on the Cityscapes dataset, with a model parameter count of only 0.76 M, thus indicating some advantages over recent real-time semantic segmentation algorithms. The proposed method with depth separable convolution and attention mechanism can effectively extract features and compensate for the loss of accuracy caused by downsampling. The experiments demonstrate that the proposed fusion module outperforms other methods in fusing different features.

Cite this article as:

P. Wang, L. Li, F. Pan, and L. Wang, “Lightweight Bilateral Network for Real-Time Semantic Segmentation,” J. Adv. Comput. Intell. Intell. Inform., Vol.27 No.4, pp. 673-682, 2023.

Data files:

References

[1] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440, 2015. https://doi.org/10.1109/CVPR.2015.7298965
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, No.12, pp. 2481-2495, 2017. https://doi.org/10.1109/TPAMI.2016.2644615
[3] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556, 2014. https://doi.org/10.48550/arXiv.1409.1556
[4] A. Paszke et al., “ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation,” arXiv:1606.02147, 2016. https://doi.org/10.48550/arXiv.1606.02147
[5] H. Zhao et al., “ICNet for Real-Time Semantic Segmentation on High-Resolution Images,” European Conf. Computer Vision (ECCV 2018), pp. 418-434, 2018. https://doi.org/10.1007/978-3-030-01219-9_25
[6] C. Peng et al., “Large Kernel Matters–Improve Semantic Segmentation by Global Convolutional Network,” IEEE Conf. on CVPR, pp. 1743-1751, 2017. https://doi.org/10.1109/CVPR.2017.189
[7] C. Yu et al., “BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation,” ECCV 2018, pp. 334-349, 2018. https://doi.org/10.1007/978-3-030-01261-8_20
[8] M. Fan et al., “Rethinking BiSeNet for Real-Time Semantic Segmentation,” 2021 IEEE/CVF Conf. on CVPR, pp. 9711-9720, 2021. https://doi.org/10.1109/CVPR46437.2021.00959
[9] H. Li et al., “DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation,” 2019 IEEE/CVF Conf. on CVPR, pp. 9514-9523, 2019. https://doi.org/10.1109/CVPR.2019.00975
[10] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” 2017 IEEE/CVF Conf. on CVPR, pp. 1800-1807, 2017. https://doi.org/10.1109/CVPR.2017.195
[11] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv:1704.04861, 2017. https://doi.org/10.48550/arXiv.1704.04861
[12] S.-Y. Lo et al., “Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation,” Proc. of the ACM Multimedia Asia (MMAsia’19), 2020. https://doi.org/10.1145/3338533.3366558
[13] S. Mehta et al., “ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation,” ECCV 2018, pp. 561-580, 2018. https://doi.org/10.1007/978-3-030-01249-6_34
[14] S. Mehta et al., “ESPNetv2: A Light-Weight, Power Efficient, and General Purpose Convolutional Neural Network,” 2019 IEEE/CVF Conf. on CVPR, pp. 9182-9192, 2019. https://doi.org/10.1109/CVPR.2019.00941
[15] A. Vaswani et al., “Attention is All You Need,” Advances in Neural Information Processing Systems, Vol.30, pp. 5998-6008, 2017.
[16] Y. Yuan et al., “OCNet: Object Context Network for Scene Parsing,” arXiv:1809.00916, 2018. https://doi.org/10.48550/arXiv.1809.00916
[17] H. Zhao et al., “PSANet: Point-Wise Spatial Attention Network for Scene Parsing,” ECCV 2018, pp. 270-286, 2018. https://doi.org/10.1007/978-3-030-01240-3_17
[18] J. Hu et al., “Squeeze-and-Excitation Networks,” 2018 IEEE/CVF Conf. on CVPR, pp. 7132-7141, 2018. https://doi.org/10.1109/cvpr.2018.00745
[19] J. Fu et al., “Dual Attention Network for Scene Segmentation,” 2019 IEEE/CVF Conf. on CVPR, pp. 3141-3149, 2018. https://doi.org/10.1109/cvpr.2019.00326
[20] H. Cao et al., “Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation,” arXiv:2105.05537, 2021. https://doi.org/10.48550/arXiv.2105.05537
[21] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Medical Image Computing and Computer-Assisted Intervention, pp. 234-241, 2015. https://doi.org/10.1007/978-3-319-24574-4_28
[22] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions,” arXiv:1511.07122, 2015. https://doi.org/10.48550/arXiv.1511.07122
[23] L.-C. Chen et al., “Rethinking Atrous Convolution for Semantic Image Segmentation,” arXiv:1706.05587, 2017. https://doi.org/10.48550/arXiv.1706.05587
[24] P. Wang et al., “Understanding Convolution for Semantic Segmentation,” 2018 IEEE Winter Conf. on Applications of Computer Vision (WACV), pp. 1451-1460, 2018. https://doi.org/10.1109/wacv.2018.00163
[25] H. Zhao et al., “Pyramid Scene Parsing Network,” 2017 IEEE Conf. on CVPR, pp. 6230-6239, 2017. https://doi.org/10.1109/CVPR.2017.660
[26] S. Liu, D. Huang, and Y. Wang, “Receptive Field Block Net for Accurate and Fast Object Detection,” ECCV 2018, pp. 404-419, 2018. https://doi.org/10.1007/978-3-030-01252-6_24
[27] K. He et al., “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.37, No.9, pp. 1904-1916, 2015. https://doi.org/10.1109/TPAMI.2015.2389824
[28] E. Romera et al., “ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation,” IEEE Trans. on Intelligent Transportation Systems, Vol.19, No.1, pp. 263-272, 2018. https://doi.org/10.1109/tits.2017.2750080
[29] Y. Wang et al., “Lednet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation,” 2019 IEEE Int. Conf. on Image Processing (ICIP), pp. 1860-1864, 2019. https://doi.org/10.1109/ICIP.2019.8803154
[30] L.-C. Chen et al., “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” ECCV 2018, pp. 833-851, 2018. https://doi.org/10.1007/978-3-030-01234-2_49
[31] Q. Wang et al., “ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks,” 2020 IEEE/CVF Conf. on CVPR, pp. 11531-11539, 2020. https://doi.org/10.1109/CVPR42600.2020.01155
[32] S. Woo et al., “CBAM: Convolutional Block Attention Module,” ECCV 2018, pp. 3-19, 2018. https://doi.org/10.1007/978-3-030-01234-2_1
[33] K. He et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conf. on CVPR, pp. 770-778, 2016. https://doi.org/10.1109/cvpr.2016.90
[34] A. Shrivastava, A. Gupta, and R. Girshick, “Training Region-Based Object Detectors with Online Hard Example Mining,” 2016 IEEE Conf. on CVPR, pp. 761-769, 2016. https://doi.org/10.1109/cvpr.2016.89
[35] M. Treml et al., “Speeding up Semantic Segmentation for Autonomous Driving,” 29th Conf. on Neural Information Processing Systems (NIPS 2016), 2016.
[36] T. Wu et al., “CGNet: A Light-Weight Context Guided Network for Semantic Segmentation,” IEEE Trans. on Image Processing, Vol.30, pp. 1169-1179, 2021. https://doi.org/10.1109/tip.2020.3042065
[37] R. P. K. Poudel, S. Liwicki, and R. Cipolla, “Fast-SCNN: Fast Semantic Segmentation Network,” arXiv:1902.04502, 2019. https://doi.org/10.48550/arXiv.1902.04502
[38] G. Li et al., “Depth-Wise Asymmetric Bottleneck with Point-Wise Aggregation Decoder for Real-Time Semantic Segmentation in Urban Scenes,” IEEE Access, Vol.8, pp. 27495-27506, 2020. https://doi.org/10.1109/access.2020.2971760

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3431-3440, 2015. https://doi.org/10.1109/CVPR.2015.7298965

[2] [2] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.39, No.12, pp. 2481-2495, 2017. https://doi.org/10.1109/TPAMI.2016.2644615

[3] [3] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556, 2014. https://doi.org/10.48550/arXiv.1409.1556

[4] [4] A. Paszke et al., “ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation,” arXiv:1606.02147, 2016. https://doi.org/10.48550/arXiv.1606.02147

[5] [5] H. Zhao et al., “ICNet for Real-Time Semantic Segmentation on High-Resolution Images,” European Conf. Computer Vision (ECCV 2018), pp. 418-434, 2018. https://doi.org/10.1007/978-3-030-01219-9_25

[6] [6] C. Peng et al., “Large Kernel Matters–Improve Semantic Segmentation by Global Convolutional Network,” IEEE Conf. on CVPR, pp. 1743-1751, 2017. https://doi.org/10.1109/CVPR.2017.189

[7] [7] C. Yu et al., “BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation,” ECCV 2018, pp. 334-349, 2018. https://doi.org/10.1007/978-3-030-01261-8_20

[8] [8] M. Fan et al., “Rethinking BiSeNet for Real-Time Semantic Segmentation,” 2021 IEEE/CVF Conf. on CVPR, pp. 9711-9720, 2021. https://doi.org/10.1109/CVPR46437.2021.00959

[9] [9] H. Li et al., “DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation,” 2019 IEEE/CVF Conf. on CVPR, pp. 9514-9523, 2019. https://doi.org/10.1109/CVPR.2019.00975

[10] [10] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” 2017 IEEE/CVF Conf. on CVPR, pp. 1800-1807, 2017. https://doi.org/10.1109/CVPR.2017.195

[11] [11] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv:1704.04861, 2017. https://doi.org/10.48550/arXiv.1704.04861

[12] [12] S.-Y. Lo et al., “Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation,” Proc. of the ACM Multimedia Asia (MMAsia’19), 2020. https://doi.org/10.1145/3338533.3366558

[13] [13] S. Mehta et al., “ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation,” ECCV 2018, pp. 561-580, 2018. https://doi.org/10.1007/978-3-030-01249-6_34

[14] [14] S. Mehta et al., “ESPNetv2: A Light-Weight, Power Efficient, and General Purpose Convolutional Neural Network,” 2019 IEEE/CVF Conf. on CVPR, pp. 9182-9192, 2019. https://doi.org/10.1109/CVPR.2019.00941

[15] [15] A. Vaswani et al., “Attention is All You Need,” Advances in Neural Information Processing Systems, Vol.30, pp. 5998-6008, 2017.

[16] [16] Y. Yuan et al., “OCNet: Object Context Network for Scene Parsing,” arXiv:1809.00916, 2018. https://doi.org/10.48550/arXiv.1809.00916

[17] [17] H. Zhao et al., “PSANet: Point-Wise Spatial Attention Network for Scene Parsing,” ECCV 2018, pp. 270-286, 2018. https://doi.org/10.1007/978-3-030-01240-3_17

[18] [18] J. Hu et al., “Squeeze-and-Excitation Networks,” 2018 IEEE/CVF Conf. on CVPR, pp. 7132-7141, 2018. https://doi.org/10.1109/cvpr.2018.00745

[19] [19] J. Fu et al., “Dual Attention Network for Scene Segmentation,” 2019 IEEE/CVF Conf. on CVPR, pp. 3141-3149, 2018. https://doi.org/10.1109/cvpr.2019.00326

[20] [20] H. Cao et al., “Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation,” arXiv:2105.05537, 2021. https://doi.org/10.48550/arXiv.2105.05537

[21] [21] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Medical Image Computing and Computer-Assisted Intervention, pp. 234-241, 2015. https://doi.org/10.1007/978-3-319-24574-4_28

[22] [22] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions,” arXiv:1511.07122, 2015. https://doi.org/10.48550/arXiv.1511.07122

[23] [23] L.-C. Chen et al., “Rethinking Atrous Convolution for Semantic Image Segmentation,” arXiv:1706.05587, 2017. https://doi.org/10.48550/arXiv.1706.05587

[24] [24] P. Wang et al., “Understanding Convolution for Semantic Segmentation,” 2018 IEEE Winter Conf. on Applications of Computer Vision (WACV), pp. 1451-1460, 2018. https://doi.org/10.1109/wacv.2018.00163

[25] [25] H. Zhao et al., “Pyramid Scene Parsing Network,” 2017 IEEE Conf. on CVPR, pp. 6230-6239, 2017. https://doi.org/10.1109/CVPR.2017.660

[26] [26] S. Liu, D. Huang, and Y. Wang, “Receptive Field Block Net for Accurate and Fast Object Detection,” ECCV 2018, pp. 404-419, 2018. https://doi.org/10.1007/978-3-030-01252-6_24

[27] [27] K. He et al., “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.37, No.9, pp. 1904-1916, 2015. https://doi.org/10.1109/TPAMI.2015.2389824

[28] [28] E. Romera et al., “ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation,” IEEE Trans. on Intelligent Transportation Systems, Vol.19, No.1, pp. 263-272, 2018. https://doi.org/10.1109/tits.2017.2750080

[29] [29] Y. Wang et al., “Lednet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation,” 2019 IEEE Int. Conf. on Image Processing (ICIP), pp. 1860-1864, 2019. https://doi.org/10.1109/ICIP.2019.8803154

[30] [30] L.-C. Chen et al., “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” ECCV 2018, pp. 833-851, 2018. https://doi.org/10.1007/978-3-030-01234-2_49

[31] [31] Q. Wang et al., “ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks,” 2020 IEEE/CVF Conf. on CVPR, pp. 11531-11539, 2020. https://doi.org/10.1109/CVPR42600.2020.01155

[32] [32] S. Woo et al., “CBAM: Convolutional Block Attention Module,” ECCV 2018, pp. 3-19, 2018. https://doi.org/10.1007/978-3-030-01234-2_1

[33] [33] K. He et al., “Deep Residual Learning for Image Recognition,” 2016 IEEE Conf. on CVPR, pp. 770-778, 2016. https://doi.org/10.1109/cvpr.2016.90

[34] [34] A. Shrivastava, A. Gupta, and R. Girshick, “Training Region-Based Object Detectors with Online Hard Example Mining,” 2016 IEEE Conf. on CVPR, pp. 761-769, 2016. https://doi.org/10.1109/cvpr.2016.89

[35] [35] M. Treml et al., “Speeding up Semantic Segmentation for Autonomous Driving,” 29th Conf. on Neural Information Processing Systems (NIPS 2016), 2016.

[36] [36] T. Wu et al., “CGNet: A Light-Weight Context Guided Network for Semantic Segmentation,” IEEE Trans. on Image Processing, Vol.30, pp. 1169-1179, 2021. https://doi.org/10.1109/tip.2020.3042065

[37] [37] R. P. K. Poudel, S. Liwicki, and R. Cipolla, “Fast-SCNN: Fast Semantic Segmentation Network,” arXiv:1902.04502, 2019. https://doi.org/10.48550/arXiv.1902.04502

[38] [38] G. Li et al., “Depth-Wise Asymmetric Bottleneck with Point-Wise Aggregation Decoder for Real-Time Semantic Segmentation in Urban Scenes,” IEEE Access, Vol.8, pp. 27495-27506, 2020. https://doi.org/10.1109/access.2020.2971760

Lightweight Bilateral Network for Real-Time Semantic Segmentation

Pengtao Wang , Lihong Li† , Feiyang Pan , and Lin Wang

Pengtao Wang , Lihong Li^† , Feiyang Pan , and Lin Wang