Research Paper:
YOLO-CF: An Object Detection Model that Improves Feature Expression at Both Coarse-Grained and Fine-Grained Levels for Industrial Surface Image Defect Detection
Zhaowei Sun*1,, Jintao Chen*2, Cong Lin*3, Xuebin Yue*4
, Kuozhan Wang*4, and Lin Meng*5

*1Henan Huadong Industrial Control Technology Company Ltd.
No.5 Wutong West Street, High-New Technology Development Zone, Zhengzhou, Henan 450000, China
Corresponding author
*2The First Affiliated Hospital of Henan University of Chinese Medicine
No.19 Renmin Road, Jinshui, Zhengzhou, Henan 450046, China
*3Hanwei Electronics Group Corporation
No.169 Xuesong Road, National High-Tech Zone, Zhengzhou, Henan 450001, China
*4School of Automation and Electrical Engineering, Zhongyuan University of Technology
No.41 Zhongyuan Middle Road, Zhongyuan, Zhengzhou, Henan 450007, China
*5College of Science and Engineering, Ritsumeikan University
1-1-1 Nojihigashi, Kusatsu, Shiga 525-8577, Japan
In industrial surface defect detection, enhancing the model’s capability to express features at both coarse-grained and fine-grained levels is crucial. Accordingly, this paper proposes YOLO Coarse and Fine (YOLO-CF), a novel object detection model that significantly enhances feature expression by integrating innovative feature fusion strategies with an improved network architecture. YOLO-CF incorporates Res2Net and Res2Net block modules, notably improving the expression of fine-grained features without increasing computational complexity. Additionally, this model introduces a multi-scale feature fusion module, which seamlessly integrates coarse and fine-grained information by combining top-down and bottom-up pathways. This enhancement effectively expands the perceptual range and significantly improves the model’s generalization capability, making YOLO-CF a powerful tool for detecting diverse defects in complex industrial images. A hybrid downsampling module is introduced, combining max pooling, average pooling, and convolution operations with a stride of 2 to provide richer feature representations. In the GC10-DET dataset, YOLO-CF achieved a mean average precision (mAP) of 59.09%, surpassing the second-ranked RetinaNet by 3.41 percentage points. On the PCB, crack, NEU-DET and Dish-20 public datasets, YOLO-CF achieved mAPs of 97.38%, 84.72%, 73.35%, and 99.54%, at IoU =0.5, respectively. The experimental results indicate that by integrating feature extraction at both coarse-grained and fine-grained levels, YOLO-CF effectively enhances the model’s ability to detect objects of various sizes in complex scenes, demonstrating significant performance improvements. The code is available at http://www.ihpc.se.ritsumei.ac.jp/obidataset.html.
YOLO-CF dual-level feature framework
- [1] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 7464-7475, 2023. https://doi.org/10.1109/CVPR52729.2023.00721
- [2] J. Qu, R. W. Liu, C. Zhao, Y. Guo, S. S.-D. Xu, F. Zhu, and Y. Lv, “Multi-task learning-enabled automatic vessel draft reading for intelligent maritime surveillance,” IEEE Trans. on Intelligent Transportation Systems, Vol.25, No.5, pp. 4552-4564, 2024. https://doi.org/10.1109/TITS.2023.3327824
- [3] Y. Cai, T. Luan, H. Gao, H. Wang, L. Chen, Y. Li, M. A. Sotelo, and Z. Li, “YOLOv4-5D: An effective and efficient object detector for autonomous driving,” IEEE Trans. on Instrumentation and Measurement, Vol.70, pp. 1-13, 2021. https://doi.org/10.1109/TIM.2021.3065438
- [4] N. Robinson, B. Tidd, D. Campbell, D. Kulić, and P. Corke, “Robotic vision for human–robot interaction and collaboration: A survey and systematic review,” ACM Trans. on Human-Robot Interaction, Vol.12, No.1, pp. 1-66, 2023. https://doi.org/10.1145/3570731
- [5] Y. Ge, Z. Li, X. Yue, H. Li, Q. Li, and L. Meng, “IoT-based automatic deep learning model generation and the application on empty-dish recycling robots,” Internet of Things, Vol.25, Article No.101047, 2024. https://doi.org/10.1016/j.iot.2023.101047
- [6] Y. Wang, S. Cang, and H. Yu, “A survey on wearable sensor modality centred human activity recognition in health care,” Expert Systems with Applications, Vol.137, pp. 167-190, 2019. https://doi.org/10.1016/j.eswa.2019.04.057
- [7] Z. Li, Y. Ge, X. Wang, and L. Meng, “3D industrial anomaly detection via dual reconstruction network,” Applied Intelligence, Vol.54, pp. 9956-9970, 2024. https://doi.org/10.1007/s10489-024-05700-x
- [8] Z. Li, Y. Ge, X. Yue, and L. Meng, “MCAD: Multi-classification anomaly detection with relational knowledge distillation,” Neural Computing and Applications, Vol.36, pp. 14543-14557, 2024. https://doi.org/10.1007/s00521-024-09838-0
- [9] X. Yue, Z. Wang, R. Ishibashi, H. Kaneko, and L. Meng, “An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions,” Int. J. on Document Analysis and Recognition, Vol.27, pp. 583-601, 2024. https://doi.org/10.1007/s10032-024-00463-0
- [10] X. Yue, H. Li, Y. Fujikawa, and L. Meng, “Dynamic dataset augmentation for deep learning-based oracle bone inscriptions recognition,” J. Comput. Cult. Herit., Vol.15, No.4, 2022. https://doi.org/10.1145/3532868
- [11] G. Wang, K. Jiang, K. Gu, H. Liu, H. Liu, and W. Zhang, “Coarse- and fine-grained fusion hierarchical network for hole filling in view synthesis,” IEEE Trans. on Image Processing, Vol.33, pp. 322-337, 2024. https://doi.org/10.1109/TIP.2023.3341303
- [12] Y. Gan, F. Gao, J. Dong, and S. Chen, “Arbitrary-scale texture generation from coarse-grained control,” IEEE Trans. on Image Processing, Vol.31, pp. 5841-5855, 2022. https://doi.org/10.1109/TIP.2022.3201710
- [13] X.-S. Wei, Y.-Z. Song, O. M. Aodha, J. Wu, Y. Peng, J. Tang, J. Yang, and S. Belongie, “Fine-grained image analysis with deep learning: A survey,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.44, No.12, pp. 8927-8948, 2022. https://doi.org/10.1109/TPAMI.2021.3126648
- [14] Y. Lin, Y. Chang, X. Tong, J. Yu, A. Liotta, G. Huang, W. Song, D. Zeng, Z. Wu, Y. Wang, and W. Zhang, “A survey on RGB, 3D, and multimodal approaches for unsupervised industrial image anomaly detection,” Information Fusion, Vol.121, Article No.103139, 2025. https://doi.org/10.1016/j.inffus.2025.103139
- [15] Z. Li, Y. Yan, X. Wang, Y. Ge, and L. Meng, “A survey of deep learning for industrial visual anomaly detection,” Artificial Intelligence Review, Vol.58, Article No.279, 2025. https://doi.org/10.1007/s10462-025-11287-7
- [16] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. Torr, “Res2Net: A new multi-scale backbone architecture,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.43, No.2, pp. 652-662, 2021. https://doi.org/10.1109/TPAMI.2019.2938758
- [17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 580-587, 2014. https://doi.org/10.1109/CVPR.2014.81
- [18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. on pattern analysis and machine intelligence, Vol.39, No.6, pp. 1137-1149, 2016. https://doi.org/10.1109/TPAMI.2016.2577031
- [19] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” Proc. of the IEEE Int. Conf. on Computer Vision (ICCV), Vol.42, No.2, pp. 318-327, 2017. https://doi.org/10.1109/TPAMI.2018.2858826
- [20] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv:1804.02767, 2018. https://doi.org/10.48550/arXiv.1804.02767
- [21] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv:2004.10934, 2020. https://doi.org/10.48550/arXiv.2004.10934
- [22] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOx: Exceeding YOLO series in 2021,” arXiv:2107.08430, 2021. https://doi.org/10.48550/arXiv.2107.08430
- [23] J. Liu, Q. Hou, Z. Liu, and M. Cheng, “PoolNet+: Exploring the potential of pooling for salient object detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.1, pp. 887-904, 2023. https://doi.org/10.1109/TPAMI.2021.3140168
- [24] Q. Hou, L. Zhang, M. Cheng, and J. Feng, “Strip pooling: Rethinking spatial pooling for scene parsing,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4002-4011, 2020. https://doi.org/10.1109/CVPR42600.2020.00406
- [25] Y. Wu, Y. Liu, X. Zhan, and M. Cheng, “P2T: Pyramid pooling transformer for scene understanding,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.11, pp. 12760-12771, 2023. https://doi.org/10.1109/TPAMI.2022.3202765
- [26] Z. Gao, L. Wang, and G. Wu, “LIP: Local importance-based pooling,” Proc. of the IEEE/CVF Int. Conf. on Computer Vision (ICCV), pp. 3354-3363, 2019. https://doi.ieeecomputersociety.org/10.1109/ICCV.2019.00345
- [27] H. Xu, J. Liu, Y. Shen, K. Lou, Y. Bao, R. Zhang, S. Zhou, H. Zhao, X. Zhu, and S. Wang, “Geometric pooling: Maintaining more representative information,” IEEE Access, Vol.12, pp. 54066-54072, 2024. https://doi.org/10.1109/ACCESS.2024.3387703
- [28] Z. Chen, H. Ji, Y. Zhang, Z. Zhu, and Y. Li, “High-resolution feature pyramid network for small object detection on drone view,” IEEE Trans. on Circuits and Systems for Video Technology, Vol.34, No.1, pp. 475-489, 2024. https://doi.org/10.1109/TCSVT.2023.3286896
- [29] H. Li, H. Ma, Y. Che, and Z. Yang, “A two-way dense feature pyramid networks for object detection of remote sensing images,” Knowledge and Information Systems, Vol.65, No.11, pp. 4847-4871, 2023. https://doi.org/10.1007/s10115-023-01916-4
- [30] J. Hu, C.-J. R. Shi, and J. Zhang, “Saliency-based YOLO for single target detection,” Knowledge and Information Systems, Vol.63, No.3, pp. 717-732, 2021. https://doi.org/10.1007/s10115-020-01538-0
- [31] S. Xu, J. Fei, G. Zhao, X. Liu, and H. Li, “CCL-YOLO: Catenary components location based on YOLO and gather-distribute mechanism,” IEEE Access, Vol.13, pp. 9064-9072, 2025. https://doi.org/10.1109/ACCESS.2024.3403716
- [32] X. Yue, H. Li, and L. Meng, “An ultralightweight object detection network for empty-dish recycling robots,” IEEE Trans. on Instrumentation and Measurement, Vol.72, pp. 1-12, 2023. https://doi.org/10.1109/TIM.2023.3241078
- [33] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018. https://doi.org/10.1109/CVPR.2018.00913
- [34] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.37, No.9, pp. 1904-1916, 2015. https://doi.org/10.1109/TPAMI.2015.2389824
- [35] X. Yue and L. Meng, “YOLO-MSA: A multiscale stereoscopic attention network for empty-dish recycling robots,” IEEE Trans. on Instrumentation and Measurement, Vol.72, pp. 1-14, 2023. https://doi.org/10.1109/TIM.2023.3315355
- [36] X. Yue, H. Li, M. Shimizu, S. Kawamura, and L. Meng, “YOLO-GD: A deep learning-based object detection algorithm for empty-dish recycling robots,” Machines, Vol.10, No.5, Article No.294, 2022. https://doi.org/10.3390/machines10050294
- [37] X. Lv, F. Duan, J.-j. Jiang, X. Fu, and L. Gan, “Deep metallic surface defect detection: The new benchmark and detection network,” Sensors, Vol.20, No.6, Article No.1562, 2020. https://doi.org/10.3390/s20061562
- [38] W. Huang and P. Wei, “A PCB dataset for defects detection and classification,” arXiv:1901.08204, 2019. https://doi.org/10.48550/arXiv.1901.08204
- [39] K. Liu, X. Han, and B. M. Chen, “Deep learning based automatic crack detection and segmentation for unmanned aerial vehicle inspections,” 2019 IEEE Int. Conf. on Robotics and Biomimetics (ROBIO), pp. 381-387, 2019. https://doi.org/10.1109/ROBIO49542.2019.8961534
- [40] Y. Bao, K. Song, J. Liu, Y. Wang, Y. Yan, H. Yu, and X. Li, “Triplet-graph reasoning network for few-shot metal generic surface defect segmentation,” IEEE Trans. on Instrumentation and Measurement, Vol.70, pp. 1-11, 2021. https://doi.org/10.1109/TIM.2021.3083561
- [41] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient object detection,” Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 10778-10787, 2020. https://doi.org/10.1109/CVPR42600.2020.01079
- [42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” 14th European Conf. on Computer Vision (ECCV), pp. 21-37, 2016. https://doi.org/10.1007/978-3-319-46448-0_2
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.