single-rb.php

JRM Vol.38 No.3 pp. 938-952
(2026)

Paper:

Two-Stage Recognition Framework Based on YOLO and Siamese Networks for Crack Detection in Cherry Tomatoes

Zhaohui Tan and Masanori Sato

Nagasaki Institute of Applied Science
536 Abamachi, Nagasaki, Nagasaki 851-0193, Japan

Received:
September 30, 2025
Accepted:
April 7, 2026
Published:
June 20, 2026
Keywords:
smart agriculture, object detection, image classification, deep learning, two-stage recognition framework
Abstract

Here, we propose a deep learning-based two-stage recognition system for fruit-level crack classification in cherry tomatoes. This targets harvesting and sorting scenarios in real-world cultivation environments where leaves and stems are present. Cherry tomato cracking exhibits substantial visual variability, ranging from clearly split fruits to subtle white linear cracks around the calyx region. Therefore, crack-region-based or bounding-box-driven detection methods are highly susceptible to external noise, such as occlusions caused by leaves and stems, and illumination variations. This can strongly impair their generalization performance in field conditions. The wide diversity of crack appearances makes it difficult to collect sufficiently large and stable annotated datasets for robust training. To alleviate data scarcity, synthetic data generation was used to support model pre-training. Crack recognition in real-world environments was formulated as a two-stage framework comprising fruit detection followed by fruit-level crack classification. In the first stage, cherry tomatoes are detected using a You Only Look Once (YOLO)-based object detector. In the second stage, the detected fruit instances are classified as cracked or non-cracked through image-level classification using a Siamese network. Based on real-world environmental images, the proposed method achieved a crack classification accuracy of approximately 88% for red cherry tomatoes and successfully detected red cherry tomatoes, demonstrating its effectiveness for fruit-level crack differentiation under practical cultivation conditions.

Crack detection via YOLO-Siamese network

Crack detection via YOLO-Siamese network

Cite this article as:
Z. Tan and M. Sato, “Two-Stage Recognition Framework Based on YOLO and Siamese Networks for Crack Detection in Cherry Tomatoes,” J. Robot. Mechatron., Vol.38 No.3, pp. 938-952, 2026.
Data files:
References
  1. [1] M. Yamada, “Emergence of large-scale greenhouse farms as a main figure of protected cropping in Japan,” Research Bulletin of the Aichi-ken Agricultural Research Center, Vol.40, pp. 1-7, 2009 (in Japanese).
  2. [2] H. Okada, M. Tada, and Y. Sakai, “Necessity of the automation in a large-scale greenhouse (plant factory),” Plant Environmental Engineering, Vol.23, No.2, pp. 44-51, 2011 (in Japanese). https://doi.org/10.2525/shita.23.44
  3. [3] H. Ohmori, H. Kurosaki, Y. Iwasaki, and M. Takaichi, “Development of a robotic harvesting system for tomato clusters with low-node-order pinching and high-density planting (Part 1): Robot for harvesting tomato clusters,” J. of the Japanese Society of Agricultural Machinery and Food Engineers, Vol.77, No.2, pp. 113-121, 2015 (in Japanese).
  4. [4] T. Fujinaga, S. Yasukawa, and K. Ishii, “Tomato growth state map for the automation of monitoring and harvesting,” J. Robot. Mechatron., Vol.32, No.6, pp. 1279-1291, 2020. https://doi.org/10.20965/jrm.2020.p1279
  5. [5] T. Yoshida, T. Fukao, and T. Hasegawa, “Fast detection of tomato peduncle using point cloud with a harvesting robot,” J. Robot. Mechatron., Vol.30, No.2, pp. 180-186, 2018. https://doi.org/10.20965/jrm.2018.p0180
  6. [6] J. Liu, “Tomato yield estimation based on object detection,” J. Adv. Comput. Intell. Intell. Inform., Vol.22, No.7, pp. 1120-1125, 2018. https://doi.org/10.20965/jaciii.2018.p1120
  7. [7] M. Goto, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Evaluation of a hand approach method for a harvesting robot using a 4-DOF arm,” The 28th Int. Symp. on Artificial Life and Robotics 2023, pp. 1201-1204, 2023.
  8. [8] M. Goto, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Experiments of approach posture to divided virtual grid space in work space on automatic harvesting robot,” The 29th Int. Symp. on Artificial Life and Robotics 2024, pp. 1189-1192, 2024.
  9. [9] T. Ikeda, R. Fukuzaki, M. Sato, S. Furuno, and F. Nagata, “Tomato recognition for harvesting robots considering overlapping leaves and stems,” J. Robot. Mechatron., Vol.33, No.6, pp. 1274-1283, 2021. https://doi.org/10.20965/jrm.2021.p1274
  10. [10] K. Morita, T. Ikeda, M. Sato, S. Furuno, and F. Nagata, “Categorizing the work area for an autonomous robot harvesting the tomato,” The 29th Int. Symp. on Artificial Life and Robotics 2029, pp. 1193-1196, 2024.
  11. [11] T. Ikeda, K. Morita, M. Sato, S. Furuno, and F. Nagata, “Experiment on changing harvesting order based on ranking of harvesting areas for a tomato harvesting robot,” The 30th Int. Symp. on Artificial Life and Robotics 2025, pp. 1511-1515, 2025.
  12. [12] Z. Tan, M. Sato, H. Isokane, T. Shibata, and Y. Kitajima, “Research on tomato fruit recognition by deep learning using virtual data,” Proc. of the 2023 JSME Conf. on Robotics and Mechatronics, Article No.2A2-A23, 2023 (in Japanese). https://doi.org/10.1299/jsmermd.2023.2A2-A23
  13. [13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016. https://doi.org/10.1109/CVPR.2016.91
  14. [14] M. Sato et al., “Development of an AI-based detection system for cherry tomato skin splitting and cracking,” 13th Int. Conf. on Renewable Energy Research and Applications, pp. 1736-1739, 2024. https://doi.org/10.1109/ICRERA62673.2024.10815464
  15. [15] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” Proc. of the 32nd Int. Conf. on Machine Learning, 2015.
  16. [16] Z. Tan et al., “Comparative analysis of AI-based methods for crack detection in cherry tomatoes,” The 30th Int. Symp. on Artificial Life and Robotics 2025, pp. 1476-1481, 2025.
  17. [17] J. Zhou, B. Li, and Y. Tang, “Chinese person name disambiguation based on two-stage clustering,” J. Adv. Comput. Intell. Intell. Inform., Vol.20, No.5, pp. 755-764, 2016. https://doi.org/10.20965/jaciii.2016.p0755
  18. [18] T. Ishizawa and T. Danjo, “Verification of a two-stage slope condition estimation method using real-time monitoring records of a rainfall-induced landslide,” J. Disaster Res., Vol.20, No.5, pp. 673-684, 2025. https://doi.org/10.20965/jdr.2025.p0673
  19. [19] A. Fukuda, S. Kondo, K. Maruyama, K. Suzuki, and M. Hagiwara, “A pseudo data generation method and a two-stage quantitation method for simultaneous determination sensor of nucleotide derivatives,” J. Adv. Comput. Intell. Intell. Inform., Vol.11, No.7, pp. 751-758, 2007. https://doi.org/10.20965/jaciii.2007.p0751
  20. [20] T. Doi, A. Mizuta, and K. Nagumo, “Harmful animal detection using visual information for wire-type mobile robots,” J. Robot. Mechatron., Vol.37, No.3, pp. 742-751, 2025. https://doi.org/10.20965/jrm.2025.p0742
  21. [21] S.-Y. Fu, D. Wei, and L.-Y. Zhou, “Improved YOLOv8-based algorithm for detecting helmets of electric moped drivers and passengers,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 349-357, 2025. https://doi.org/10.20965/jaciii.2025.p0349
  22. [22] H. Yan, S. S. Merajuddin, and M. Zhang, “Real-time fire detection in scenic spot using convolutional neural network,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 432-437, 2025. https://doi.org/10.20965/jaciii.2025.p0432
  23. [23] I. R. S. Evangelista et al., “Detection of Japanese quails (Coturnix japonica) in poultry farms using YOLOv5 and Detectron2 Faster R-CNN,” J. Adv. Comput. Intell. Intell. Inform., Vol.26, No.6, pp. 930-936, 2022. https://doi.org/10.20965/jaciii.2022.p0930
  24. [24] J. Chen et al., “Using deep transfer learning for image-based plant disease identification,” Computers and Electronics in Agriculture, Vol.173, Article No.105393, 2020. https://doi.org/10.1016/j.compag.2020.105393
  25. [25] W. Liu, S. Chen, and L. Wei, “Improving street object detection using transfer learning: From generic model to specific model,” J. Adv. Comput. Intell. Intell. Inform., Vol.22, No.6, pp. 869-874, 2018. https://doi.org/10.20965/jaciii.2018.p0869
  26. [26] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” 2015 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 815-823, 2015. https://doi.org/10.1109/CVPR.2015.7298682
  27. [27] Wikipedia, “Triplet loss.” https://en.wikipedia.org/wiki/Triplet_loss [Accessed May 25, 2026]

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Jun. 19, 2026