Two-Stage Recognition Framework Based on YOLO and Siamese Networks for Crack Detection in Cherry Tomatoes

Zhaohui Tan; Masanori Sato

doi:10.20965/jrm.2026.p0938

single-rb.php

« previous

JRM Vol.38 No.3 pp. 938-952

(2026)

doi: 10.20965/jrm.2026.p0938

Paper:

Views over last 60 days: 1,246

Two-Stage Recognition Framework Based on YOLO and Siamese Networks for Crack Detection in Cherry Tomatoes

Zhaohui Tan and Masanori Sato

Nagasaki Institute of Applied Science
536 Abamachi, Nagasaki, Nagasaki 851-0193, Japan

Received:

September 30, 2025

Accepted:

April 7, 2026

Published:

June 20, 2026

Keywords:

smart agriculture, object detection, image classification, deep learning, two-stage recognition framework

Abstract

Here, we propose a deep learning-based two-stage recognition system for fruit-level crack classification in cherry tomatoes. This targets harvesting and sorting scenarios in real-world cultivation environments where leaves and stems are present. Cherry tomato cracking exhibits substantial visual variability, ranging from clearly split fruits to subtle white linear cracks around the calyx region. Therefore, crack-region-based or bounding-box-driven detection methods are highly susceptible to external noise, such as occlusions caused by leaves and stems, and illumination variations. This can strongly impair their generalization performance in field conditions. The wide diversity of crack appearances makes it difficult to collect sufficiently large and stable annotated datasets for robust training. To alleviate data scarcity, synthetic data generation was used to support model pre-training. Crack recognition in real-world environments was formulated as a two-stage framework comprising fruit detection followed by fruit-level crack classification. In the first stage, cherry tomatoes are detected using a You Only Look Once (YOLO)-based object detector. In the second stage, the detected fruit instances are classified as cracked or non-cracked through image-level classification using a Siamese network. Based on real-world environmental images, the proposed method achieved a crack classification accuracy of approximately 88% for red cherry tomatoes and successfully detected red cherry tomatoes, demonstrating its effectiveness for fruit-level crack differentiation under practical cultivation conditions.

Crack detection via YOLO-Siamese network

Full text

Cite this article as:

Z. Tan and M. Sato, “Two-Stage Recognition Framework Based on YOLO and Siamese Networks for Crack Detection in Cherry Tomatoes,” J. Robot. Mechatron., Vol.38 No.3, pp. 938-952, 2026.

Data files:

References

[1] M. Yamada, “Emergence of large-scale greenhouse farms as a main figure of protected cropping in Japan,” Research Bulletin of the Aichi-ken Agricultural Research Center, Vol.40, pp. 1-7, 2009 (in Japanese).
[2] H. Okada, M. Tada, and Y. Sakai, “Necessity of the automation in a large-scale greenhouse (plant factory),” Plant Environmental Engineering, Vol.23, No.2, pp. 44-51, 2011 (in Japanese). https://doi.org/10.2525/shita.23.44
[3] H. Ohmori, H. Kurosaki, Y. Iwasaki, and M. Takaichi, “Development of a robotic harvesting system for tomato clusters with low-node-order pinching and high-density planting (Part 1): Robot for harvesting tomato clusters,” J. of the Japanese Society of Agricultural Machinery and Food Engineers, Vol.77, No.2, pp. 113-121, 2015 (in Japanese).
[4] T. Fujinaga, S. Yasukawa, and K. Ishii, “Tomato growth state map for the automation of monitoring and harvesting,” J. Robot. Mechatron., Vol.32, No.6, pp. 1279-1291, 2020. https://doi.org/10.20965/jrm.2020.p1279
[5] T. Yoshida, T. Fukao, and T. Hasegawa, “Fast detection of tomato peduncle using point cloud with a harvesting robot,” J. Robot. Mechatron., Vol.30, No.2, pp. 180-186, 2018. https://doi.org/10.20965/jrm.2018.p0180
[6] J. Liu, “Tomato yield estimation based on object detection,” J. Adv. Comput. Intell. Intell. Inform., Vol.22, No.7, pp. 1120-1125, 2018. https://doi.org/10.20965/jaciii.2018.p1120
[7] M. Goto, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Evaluation of a hand approach method for a harvesting robot using a 4-DOF arm,” The 28th Int. Symp. on Artificial Life and Robotics 2023, pp. 1201-1204, 2023.
[8] M. Goto, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Experiments of approach posture to divided virtual grid space in work space on automatic harvesting robot,” The 29th Int. Symp. on Artificial Life and Robotics 2024, pp. 1189-1192, 2024.
[9] T. Ikeda, R. Fukuzaki, M. Sato, S. Furuno, and F. Nagata, “Tomato recognition for harvesting robots considering overlapping leaves and stems,” J. Robot. Mechatron., Vol.33, No.6, pp. 1274-1283, 2021. https://doi.org/10.20965/jrm.2021.p1274
[10] K. Morita, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Categorizing the work area for an autonomous robot harvesting the tomato,” The 29th Int. Symp. on Artificial Life and Robotics 2029, pp. 1193-1196, 2024.
[11] T. Ikeda, K. Morita, M. Sato, S. Furuno, and F. Nagata, “Experiment on changing harvesting order based on ranking of harvesting areas for a tomato harvesting robot,” The 30th Int. Symp. on Artificial Life and Robotics 2025, pp. 1511-1515, 2025.
[12] Z. Tan, M. Sato, H. Isokane, T. Shibata, and Y. Kitajima, “Research on tomato fruit recognition by deep learning using virtual data,” Proc. of the 2023 JSME Conf. on Robotics and Mechatronics, Article No.2A2-A23, 2023 (in Japanese). https://doi.org/10.1299/jsmermd.2023.2A2-A23
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016. https://doi.org/10.1109/CVPR.2016.91
[14] M. Sato et al., “Development of an AI-based detection system for cherry tomato skin splitting and cracking,” 13th Int. Conf. on Renewable Energy Research and Applications, pp. 1736-1739, 2024. https://doi.org/10.1109/ICRERA62673.2024.10815464
[15] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” Proc. of the 32nd Int. Conf. on Machine Learning, 2015.
[16] Z. Tan et al., “Comparative analysis of AI-based methods for crack detection in cherry tomatoes,” The 30th Int. Symp. on Artificial Life and Robotics 2025, pp. 1476-1481, 2025.
[17] J. Zhou, B. Li, and Y. Tang, “Chinese person name disambiguation based on two-stage clustering,” J. Adv. Comput. Intell. Intell. Inform., Vol.20, No.5, pp. 755-764, 2016. https://doi.org/10.20965/jaciii.2016.p0755
[18] T. Ishizawa and T. Danjo, “Verification of a two-stage slope condition estimation method using real-time monitoring records of a rainfall-induced landslide,” J. Disaster Res., Vol.20, No.5, pp. 673-684, 2025. https://doi.org/10.20965/jdr.2025.p0673
[19] A. Fukuda, S. Kondo, K. Maruyama, K. Suzuki, and M. Hagiwara, “A pseudo data generation method and a two-stage quantitation method for simultaneous determination sensor of nucleotide derivatives,” J. Adv. Comput. Intell. Intell. Inform., Vol.11, No.7, pp. 751-758, 2007. https://doi.org/10.20965/jaciii.2007.p0751
[20] T. Doi, A. Mizuta, and K. Nagumo, “Harmful animal detection using visual information for wire-type mobile robots,” J. Robot. Mechatron., Vol.37, No.3, pp. 742-751, 2025. https://doi.org/10.20965/jrm.2025.p0742
[21] S.-Y. Fu, D. Wei, and L.-Y. Zhou, “Improved YOLOv8-based algorithm for detecting helmets of electric moped drivers and passengers,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 349-357, 2025. https://doi.org/10.20965/jaciii.2025.p0349
[22] H. Yan, S. S. Merajuddin, and M. Zhang, “Real-time fire detection in scenic spot using convolutional neural network,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 432-437, 2025. https://doi.org/10.20965/jaciii.2025.p0432
[23] I. R. S. Evangelista et al., “Detection of Japanese quails (Coturnix japonica) in poultry farms using YOLOv5 and Detectron2 Faster R-CNN,” J. Adv. Comput. Intell. Intell. Inform., Vol.26, No.6, pp. 930-936, 2022. https://doi.org/10.20965/jaciii.2022.p0930
[24] J. Chen et al., “Using deep transfer learning for image-based plant disease identification,” Computers and Electronics in Agriculture, Vol.173, Article No.105393, 2020. https://doi.org/10.1016/j.compag.2020.105393
[25] W. Liu, S. Chen, and L. Wei, “Improving street object detection using transfer learning: From generic model to specific model,” J. Adv. Comput. Intell. Intell. Inform., Vol.22, No.6, pp. 869-874, 2018. https://doi.org/10.20965/jaciii.2018.p0869
[26] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” 2015 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 815-823, 2015. https://doi.org/10.1109/CVPR.2015.7298682
[27] Wikipedia, “Triplet loss.” https://en.wikipedia.org/wiki/Triplet_loss [Accessed May 25, 2026]

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[B1] [1] M. Yamada, “Emergence of large-scale greenhouse farms as a main figure of protected cropping in Japan,” Research Bulletin of the Aichi-ken Agricultural Research Center, Vol.40, pp. 1-7, 2009 (in Japanese).

[B2] [2] H. Okada, M. Tada, and Y. Sakai, “Necessity of the automation in a large-scale greenhouse (plant factory),” Plant Environmental Engineering, Vol.23, No.2, pp. 44-51, 2011 (in Japanese). https://doi.org/10.2525/shita.23.44

[B3] [3] H. Ohmori, H. Kurosaki, Y. Iwasaki, and M. Takaichi, “Development of a robotic harvesting system for tomato clusters with low-node-order pinching and high-density planting (Part 1): Robot for harvesting tomato clusters,” J. of the Japanese Society of Agricultural Machinery and Food Engineers, Vol.77, No.2, pp. 113-121, 2015 (in Japanese).

[B4] [4] T. Fujinaga, S. Yasukawa, and K. Ishii, “Tomato growth state map for the automation of monitoring and harvesting,” J. Robot. Mechatron., Vol.32, No.6, pp. 1279-1291, 2020. https://doi.org/10.20965/jrm.2020.p1279

[B5] [5] T. Yoshida, T. Fukao, and T. Hasegawa, “Fast detection of tomato peduncle using point cloud with a harvesting robot,” J. Robot. Mechatron., Vol.30, No.2, pp. 180-186, 2018. https://doi.org/10.20965/jrm.2018.p0180

[B6] [6] J. Liu, “Tomato yield estimation based on object detection,” J. Adv. Comput. Intell. Intell. Inform., Vol.22, No.7, pp. 1120-1125, 2018. https://doi.org/10.20965/jaciii.2018.p1120

[B7] [7] M. Goto, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Evaluation of a hand approach method for a harvesting robot using a 4-DOF arm,” The 28th Int. Symp. on Artificial Life and Robotics 2023, pp. 1201-1204, 2023.

[B8] [8] M. Goto, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Experiments of approach posture to divided virtual grid space in work space on automatic harvesting robot,” The 29th Int. Symp. on Artificial Life and Robotics 2024, pp. 1189-1192, 2024.

[B9] [9] T. Ikeda, R. Fukuzaki, M. Sato, S. Furuno, and F. Nagata, “Tomato recognition for harvesting robots considering overlapping leaves and stems,” J. Robot. Mechatron., Vol.33, No.6, pp. 1274-1283, 2021. https://doi.org/10.20965/jrm.2021.p1274

[B10] [10] K. Morita, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Categorizing the work area for an autonomous robot harvesting the tomato,” The 29th Int. Symp. on Artificial Life and Robotics 2029, pp. 1193-1196, 2024.

[B11] [11] T. Ikeda, K. Morita, M. Sato, S. Furuno, and F. Nagata, “Experiment on changing harvesting order based on ranking of harvesting areas for a tomato harvesting robot,” The 30th Int. Symp. on Artificial Life and Robotics 2025, pp. 1511-1515, 2025.

[B12] [12] Z. Tan, M. Sato, H. Isokane, T. Shibata, and Y. Kitajima, “Research on tomato fruit recognition by deep learning using virtual data,” Proc. of the 2023 JSME Conf. on Robotics and Mechatronics, Article No.2A2-A23, 2023 (in Japanese). https://doi.org/10.1299/jsmermd.2023.2A2-A23

[B13] [13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016. https://doi.org/10.1109/CVPR.2016.91

[B14] [14] M. Sato et al., “Development of an AI-based detection system for cherry tomato skin splitting and cracking,” 13th Int. Conf. on Renewable Energy Research and Applications, pp. 1736-1739, 2024. https://doi.org/10.1109/ICRERA62673.2024.10815464

[B15] [15] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” Proc. of the 32nd Int. Conf. on Machine Learning, 2015.

[B16] [16] Z. Tan et al., “Comparative analysis of AI-based methods for crack detection in cherry tomatoes,” The 30th Int. Symp. on Artificial Life and Robotics 2025, pp. 1476-1481, 2025.

[B17] [17] J. Zhou, B. Li, and Y. Tang, “Chinese person name disambiguation based on two-stage clustering,” J. Adv. Comput. Intell. Intell. Inform., Vol.20, No.5, pp. 755-764, 2016. https://doi.org/10.20965/jaciii.2016.p0755

[B18] [18] T. Ishizawa and T. Danjo, “Verification of a two-stage slope condition estimation method using real-time monitoring records of a rainfall-induced landslide,” J. Disaster Res., Vol.20, No.5, pp. 673-684, 2025. https://doi.org/10.20965/jdr.2025.p0673

[B19] [19] A. Fukuda, S. Kondo, K. Maruyama, K. Suzuki, and M. Hagiwara, “A pseudo data generation method and a two-stage quantitation method for simultaneous determination sensor of nucleotide derivatives,” J. Adv. Comput. Intell. Intell. Inform., Vol.11, No.7, pp. 751-758, 2007. https://doi.org/10.20965/jaciii.2007.p0751

[B20] [20] T. Doi, A. Mizuta, and K. Nagumo, “Harmful animal detection using visual information for wire-type mobile robots,” J. Robot. Mechatron., Vol.37, No.3, pp. 742-751, 2025. https://doi.org/10.20965/jrm.2025.p0742

[B21] [21] S.-Y. Fu, D. Wei, and L.-Y. Zhou, “Improved YOLOv8-based algorithm for detecting helmets of electric moped drivers and passengers,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 349-357, 2025. https://doi.org/10.20965/jaciii.2025.p0349

[B22] [22] H. Yan, S. S. Merajuddin, and M. Zhang, “Real-time fire detection in scenic spot using convolutional neural network,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 432-437, 2025. https://doi.org/10.20965/jaciii.2025.p0432

[B23] [23] I. R. S. Evangelista et al., “Detection of Japanese quails (Coturnix japonica) in poultry farms using YOLOv5 and Detectron2 Faster R-CNN,” J. Adv. Comput. Intell. Intell. Inform., Vol.26, No.6, pp. 930-936, 2022. https://doi.org/10.20965/jaciii.2022.p0930

[B24] [24] J. Chen et al., “Using deep transfer learning for image-based plant disease identification,” Computers and Electronics in Agriculture, Vol.173, Article No.105393, 2020. https://doi.org/10.1016/j.compag.2020.105393

[B25] [25] W. Liu, S. Chen, and L. Wei, “Improving street object detection using transfer learning: From generic model to specific model,” J. Adv. Comput. Intell. Intell. Inform., Vol.22, No.6, pp. 869-874, 2018. https://doi.org/10.20965/jaciii.2018.p0869

[B26] [26] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” 2015 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 815-823, 2015. https://doi.org/10.1109/CVPR.2015.7298682

[B27] [27] Wikipedia, “Triplet loss.” https://en.wikipedia.org/wiki/Triplet_loss [Accessed May 25, 2026]