single-jc.php

JACIII Vol.30 No.2 pp. 543-557
(2026)

Research Paper:

DroneDetect: Multiscale Feature Fusion and Attention-Driven Architecture for UAV Object Detection

Huiyao Zhang*,**

*School of IoT Engineering, Wuxi Taihu University
No.68 Qianrong Road, Binhu District, Wuxi, Jiangsu 214064, China

**Provincial Key (Construction) Laboratory of Intelligent Internet of Things Technology and Applications in Universities
No.68 Qianrong Road, Binhu District, Wuxi, Jiangsu 214064, China

Received:
May 7, 2025
Accepted:
November 18, 2025
Published:
March 20, 2026
Keywords:
aerial imagery, multiscale feature pyramid, spatial features, lightweight detection head, detection performance optimization
Abstract

Aerial object detection continues to face significant challenges such as complex scene compositions, highly variable object sizes and densities, and diverse imaging perspectives. This study presents DroneDetect, a series of progressively enhanced models specifically designed to address these challenges in aerial detection tasks. To address the complexity of aerial scenes and enable effective semantic information extraction, we propose an efficient up-convolution block with a multi-branch auxiliary feature pyramid network that enhances the multiscale feature fusion capabilities. Building on this foundation, we address the critical need for precise spatial localization by introducing a Cross-Stage Receptive Field Attention module that integrates CSPNet with an improved receptive field attention convolution, enabling dynamic spatial attention mechanisms to capture fine-grained positional information. To ensure practical deployment efficiency while maintaining detection accuracy, we developed a lightweight shared detail-enhanced convolution detection head that optimizes parameter utilization and reduces the computational overhead. Extensive experiments on multiple aerial datasets demonstrate the effectiveness of the proposed approach. On VisDrone, DroneDetect-Enhanced achieved an AP of 24.05%, representing a significant improvement of 2.39% over the baseline YOLOv8s on the validation sets. The cross-validation results further validate the generalizability of our model, with performance gains of 3.9% on UAVDT, 3.0% on CARPK, and 1.5% on DIOR. Notably, DroneDetect-Enhanced maintains comparable or reduced computational complexity while using fewer parameters than the baseline. Comprehensive ablation studies and comparative analyses with state-of-the-art methods confirm that our approach effectively balances the accuracy and efficiency of real-world aerial object detection applications. This code is available at https://github.com/aerialCV/DroneDetect.

Cite this article as:
H. Zhang, “DroneDetect: Multiscale Feature Fusion and Attention-Driven Architecture for UAV Object Detection,” J. Adv. Comput. Intell. Intell. Inform., Vol.30 No.2, pp. 543-557, 2026.
Data files:
References
  1. [1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016. https://doi.org/10.1109/CVPR.2016.91
  2. [2] S.-Y. Fu, D. Wei, and L.-Y. Zhou, “Improved YOLOv8-based algorithm for detecting helmets of electric moped drivers and passengers,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 349-357, 2025. https://doi.org/10.20965/jaciii.2025.p0349
  3. [3] J. Leng et al., “Recent advances for aerial object detection: A survey,” ACM Computing Surveys, Vol.56, No.12, Article No.296, 2024. https://doi.org/10.1145/3664598
  4. [4] T.-Y. Lin et al., “Feature pyramid networks for object detection,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 936-944, 2017. https://doi.org/10.1109/CVPR.2017.106
  5. [5] Z. Yang et al., “Multi-branch auxiliary fusion YOLO with re-parameterization heterogeneous convolutional for accurate object detection,” Proc. of the 7th Chinese Conf. on Pattern Recognition and Computer Vision, Part 12, pp. 492-505, 2024. https://doi.org/10.1007/978-981-97-8858-3_34
  6. [6] J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, “BAM: Bottleneck attention module,” arXiv:1807.06514, 2018. https://doi.org/10.48550/arXiv.1807.06514
  7. [7] T. Zhang et al., “CAS-ViT: Convolutional additive self-attention vision transformers for efficient mobile applications,” arXiv:2408.03703, 2024. https://doi.org/10.48550/arXiv.2408.03703
  8. [8] G. Jocher, J. Qiu, and A. Chaurasia, “Ultralytics YOLO (version 8.0.0),” 2023. https://github.com/ultralytics/ultralytics [Accessed August 12, 2023]
  9. [9] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv:1804.02767, 2018. https://doi.org/10.48550/arXiv.1804.02767
  10. [10] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv:2004.10934, 2020. https://doi.org/10.48550/arXiv.2004.10934
  11. [11] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “YOLOv9: Learning what you want to learn using programmable gradient information,” Proc. of the 18th European Conf. on Computer Vision, Part 31, pp. 1-21, 2024. https://doi.org/10.1007/978-3-031-72751-1_1
  12. [12] A. Wang et al., “YOLOv10: Real-time end-to-end object detection,” arXiv:2405.14458, 2024. https://doi.org/10.48550/arXiv.2405.14458
  13. [13] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 8759-8768, 2018. https://doi.org/10.1109/CVPR.2018.00913
  14. [14] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient object detection,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 10778-10787, 2020. https://doi.org/10.1109/CVPR42600.2020.01079
  15. [15] C. Wang et al., “Gold-YOLO: Efficient object detector via gather-and-distribute mechanism,” Proc. of the 37th Int. Conf. on Neural Information Processing Systems, pp. 51094-51112, 2023.
  16. [16] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 7132-7141, 2018. https://doi.org/10.1109/CVPR.2018.00745
  17. [17] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” Proc. of the 15th European Conf. on Computer Vision, Part 7, pp. 3-19, 2018. https://doi.org/10.1007/978-3-030-01234-2_1
  18. [18] C. Li et al., “YOLOv6 v3.0: A full-scale reloading,” arXiv:2301.05586, 2023. https://doi.org/10.48550/arXiv.2301.05586
  19. [19] M. M. Rahman, M. Munir, and R. Marculescu, “EMCAD: Efficient multi-scale convolutional attention decoding for medical image segmentation,” 2024 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11769-11779, 2024. https://doi.org/10.1109/CVPR52733.2024.01118
  20. [20] C.-Y. Wang, H.-Y. M. Liao, and I.-H. Yeh, “Designing network design strategies through gradient path analysis,” arXiv:2211.04800, 2022. https://doi.org/10.48550/arXiv.2211.04800
  21. [21] X. Zhang et al., “RFAConv: Innovating spatial attention and standard convolutional operation,” arXiv:2304.03198, 2023. https://doi.org/10.48550/arXiv.2304.03198
  22. [22] Z. Chen, Z. He, and Z.-M. Lu, “DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention,” IEEE Trans. on Image Processing, Vol.33, pp. 1002-1015, 2024. https://doi.org/10.1109/TIP.2024.3354108
  23. [23] D. Du et al., “VisDrone-DET2019: The vision meets drone object detection in image challenge results,” 2019 IEEE/CVF Int. Conf. on Computer Vision Workshop, pp. 213-226, 2019. https://doi.org/10.1109/ICCVW.2019.00030
  24. [24] D. Du et al., “The unmanned aerial vehicle benchmark: Object detection and tracking,” arXiv:1804.00518, 2018. https://doi.org/10.48550/arXiv.1804.00518
  25. [25] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, “Drone-based object counting by spatially regularized regional proposal network,” 2017 IEEE Int. Conf. on Computer Vision, pp. 4165-4173, 2017. https://doi.org/10.1109/ICCV.2017.446
  26. [26] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS J. of Photogrammetry and Remote Sensing, Vol.159, pp. 296-307, 2020. https://doi.org/10.1016/j.isprsjprs.2019.11.023
  27. [27] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” Proc. of the 13th European Conf. on Computer Vision, Part 5, pp. 740-755, 2014. https://doi.org/10.1007/978-3-319-10602-1_48
  28. [28] S. Xu et al., “HCF-Net: Hierarchical context fusion network for infrared small object detection,” 2024 IEEE Int. Conf. on Multimedia and Expo, 2024. https://doi.org/10.1109/ICME57554.2024.10687431
  29. [29] Y. Jiang, Z. Tan, J. Wang, X. Sun, M. Lin, and H. Li, “GiraffeDet: A heavy-neck paradigm for object detection,” arXiv:2202.04256, 2022. https://doi.org/10.48550/arXiv.2202.04256
  30. [30] D. Bolya, S. Foley, J. Hays, and J. Hoffman, “TIDE: A general toolbox for identifying object detection errors,” Proc. of the 16th European Conf. on Computer Vision, Part 3, pp. 558-573, 2020. https://doi.org/10.1007/978-3-030-58580-8_33
  31. [31] K. Ayush et al., “Geography-aware self-supervised learning,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 10161-10170, 2021. https://doi.org/10.1109/ICCV48922.2021.01002
  32. [32] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” Proc. of the 29th Int. Conf. on Neural Information Processing Systems, pp. 91-99, 2015.
  33. [33] Y. Cong et al., “SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery,” Proc. of the 36th Int. Conf. on Neural Information Processing Systems, pp. 197-211, 2022.
  34. [34] Z. Li et al., “DetNet: A backbone network for object detection,” arXiv:1804.06215, 2018. https://doi.org/10.48550/arXiv.1804.06215
  35. [35] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A large contextual dataset for classification, detection and counting of cars with deep learning,” Proc. of the 14th European Conf. on Computer Vision, Part 3, pp. 785-800, 2016. https://doi.org/10.1007/978-3-319-46487-9_48
  36. [36] X. Sun et al., “RingMo: A remote sensing foundation model with masked image modeling,” IEEE Trans. on Geoscience and Remote Sensing, Vol.61, Article No.5612822, 2023. https://doi.org/10.1109/TGRS.2022.3194732
  37. [37] D. Wang et al., “Advancing plain vision transformer toward remote sensing foundation model,” IEEE Trans. on Geoscience and Remote Sensing, Vol.61, Article No.5607315, 2023. https://doi.org/10.1109/TGRS.2022.3222818
  38. [38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” 2017 IEEE Int. Conf. on Computer Vision, pp. 2999-3007, 2017. https://doi.org/10.1109/ICCV.2017.324
  39. [39] H. Perreault, M. Héritier, P. Gravel, G.-A. Bilodeau, and N. Saunier, “RN-VID: A feature fusion architecture for video object detection,” arXiv:2003.10898, 2020. https://doi.org/10.48550/arXiv.2003.10898
  40. [40] N. Amini-Naieni, K. Amini-Naieni, T. Han, and A. Zisserman, “Open-world text-specified object counting,” arXiv:2306.01851, 2023. https://doi.org/10.48550/arXiv.2306.01851
  41. [41] C. Tao et al., “TOV: The original vision model for optical remote sensing image understanding via self-supervised learning,” IEEE J. of Selected Topics in Applied Earth Observations and Remote Sensing, Vol.16, pp. 4916-4930, 2023. https://doi.org/10.1109/JSTARS.2023.3271312
  42. [42] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detection,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6154-6162, 2018. https://doi.org/10.1109/CVPR.2018.00644
  43. [43] H. Perreault, G.-A. Bilodeau, N. Saunier, and M. Héritier, “SpotNet: Self-attention multi-task network for object detection,” arXiv:2002.05540, 2020. https://doi.org/10.48550/arXiv.2002.05540
  44. [44] E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner, “Precise detection in densely packed scenes,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 5222-5231, 2019. https://doi.org/10.1109/CVPR.2019.00537
  45. [45] Y. Wang et al., “SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in earth observation,” arXiv:2211.07044, 2023. https://doi.org/10.48550/arXiv.2211.07044
  46. [46] H. Perreault, G.-A. Bilodeau, N. Saunier, and M. Héritier, “FFAVOD: Feature fusion architecture for video object detection,” Pattern Recognition Letters, Vol.151, pp. 294-301, 2021. https://doi.org/10.1016/j.patrec.2021.09.002
  47. [47] S. Kang, W. Moon, E. Kim, and J.-P. Heo, “VLCounter: Text-aware visual representation for zero-shot object counting,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.38, No.3, pp. 2714-2722, 2024. https://doi.org/10.1609/aaai.v38i3.28050
  48. [48] D. Muhtar, X. Zhang, P. Xiao, Z. Li, and F. Gu, “CMID: A unified self-supervised learning framework for remote sensing image understanding,” IEEE Trans. on Geoscience and Remote Sensing, Vol.61, Article No.5607817, 2023. https://doi.org/10.1109/tgrs.2023.3268232
  49. [49] P.-Y. Chen, M.-C. Chang, J.-W. Hsieh, and Y.-S. Chen, “Parallel residual bi-fusion feature pyramid network for accurate single-shot object detection,” IEEE Trans. on Image Processing, Vol.30, pp. 9099-9111, 2021. https://doi.org/10.1109/tip.2021.3118953
  50. [50] M. Shi, H. Lu, C. Feng, C. Liu, and Z. Cao, “Represent, compare, and learn: A similarity-aware framework for class-agnostic counting,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9519-9528, 2022. https://doi.org/10.1109/CVPR52688.2022.00931
  51. [51] U. Mall, B. Hariharan, and K. Bala, “Change-aware sampling and contrastive learning for satellite images,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 5261-5270, 2023. https://doi.org/10.1109/CVPR52729.2023.00509

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Mar. 19, 2026