Research Paper:
DroneDetect: Multiscale Feature Fusion and Attention-Driven Architecture for UAV Object Detection
Huiyao Zhang*,**
*School of IoT Engineering, Wuxi Taihu University
No.68 Qianrong Road, Binhu District, Wuxi, Jiangsu 214064, China
**Provincial Key (Construction) Laboratory of Intelligent Internet of Things Technology and Applications in Universities
No.68 Qianrong Road, Binhu District, Wuxi, Jiangsu 214064, China
Aerial object detection continues to face significant challenges such as complex scene compositions, highly variable object sizes and densities, and diverse imaging perspectives. This study presents DroneDetect, a series of progressively enhanced models specifically designed to address these challenges in aerial detection tasks. To address the complexity of aerial scenes and enable effective semantic information extraction, we propose an efficient up-convolution block with a multi-branch auxiliary feature pyramid network that enhances the multiscale feature fusion capabilities. Building on this foundation, we address the critical need for precise spatial localization by introducing a Cross-Stage Receptive Field Attention module that integrates CSPNet with an improved receptive field attention convolution, enabling dynamic spatial attention mechanisms to capture fine-grained positional information. To ensure practical deployment efficiency while maintaining detection accuracy, we developed a lightweight shared detail-enhanced convolution detection head that optimizes parameter utilization and reduces the computational overhead. Extensive experiments on multiple aerial datasets demonstrate the effectiveness of the proposed approach. On VisDrone, DroneDetect-Enhanced achieved an AP of 24.05%, representing a significant improvement of 2.39% over the baseline YOLOv8s on the validation sets. The cross-validation results further validate the generalizability of our model, with performance gains of 3.9% on UAVDT, 3.0% on CARPK, and 1.5% on DIOR. Notably, DroneDetect-Enhanced maintains comparable or reduced computational complexity while using fewer parameters than the baseline. Comprehensive ablation studies and comparative analyses with state-of-the-art methods confirm that our approach effectively balances the accuracy and efficiency of real-world aerial object detection applications. This code is available at https://github.com/aerialCV/DroneDetect.
- [1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 779-788, 2016. https://doi.org/10.1109/CVPR.2016.91
- [2] S.-Y. Fu, D. Wei, and L.-Y. Zhou, “Improved YOLOv8-based algorithm for detecting helmets of electric moped drivers and passengers,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.2, pp. 349-357, 2025. https://doi.org/10.20965/jaciii.2025.p0349
- [3] J. Leng et al., “Recent advances for aerial object detection: A survey,” ACM Computing Surveys, Vol.56, No.12, Article No.296, 2024. https://doi.org/10.1145/3664598
- [4] T.-Y. Lin et al., “Feature pyramid networks for object detection,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 936-944, 2017. https://doi.org/10.1109/CVPR.2017.106
- [5] Z. Yang et al., “Multi-branch auxiliary fusion YOLO with re-parameterization heterogeneous convolutional for accurate object detection,” Proc. of the 7th Chinese Conf. on Pattern Recognition and Computer Vision, Part 12, pp. 492-505, 2024. https://doi.org/10.1007/978-981-97-8858-3_34
- [6] J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, “BAM: Bottleneck attention module,” arXiv:1807.06514, 2018. https://doi.org/10.48550/arXiv.1807.06514
- [7] T. Zhang et al., “CAS-ViT: Convolutional additive self-attention vision transformers for efficient mobile applications,” arXiv:2408.03703, 2024. https://doi.org/10.48550/arXiv.2408.03703
- [8] G. Jocher, J. Qiu, and A. Chaurasia, “Ultralytics YOLO (version 8.0.0),” 2023. https://github.com/ultralytics/ultralytics [Accessed August 12, 2023]
- [9] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv:1804.02767, 2018. https://doi.org/10.48550/arXiv.1804.02767
- [10] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv:2004.10934, 2020. https://doi.org/10.48550/arXiv.2004.10934
- [11] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “YOLOv9: Learning what you want to learn using programmable gradient information,” Proc. of the 18th European Conf. on Computer Vision, Part 31, pp. 1-21, 2024. https://doi.org/10.1007/978-3-031-72751-1_1
- [12] A. Wang et al., “YOLOv10: Real-time end-to-end object detection,” arXiv:2405.14458, 2024. https://doi.org/10.48550/arXiv.2405.14458
- [13] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 8759-8768, 2018. https://doi.org/10.1109/CVPR.2018.00913
- [14] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient object detection,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 10778-10787, 2020. https://doi.org/10.1109/CVPR42600.2020.01079
- [15] C. Wang et al., “Gold-YOLO: Efficient object detector via gather-and-distribute mechanism,” Proc. of the 37th Int. Conf. on Neural Information Processing Systems, pp. 51094-51112, 2023.
- [16] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 7132-7141, 2018. https://doi.org/10.1109/CVPR.2018.00745
- [17] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block attention module,” Proc. of the 15th European Conf. on Computer Vision, Part 7, pp. 3-19, 2018. https://doi.org/10.1007/978-3-030-01234-2_1
- [18] C. Li et al., “YOLOv6 v3.0: A full-scale reloading,” arXiv:2301.05586, 2023. https://doi.org/10.48550/arXiv.2301.05586
- [19] M. M. Rahman, M. Munir, and R. Marculescu, “EMCAD: Efficient multi-scale convolutional attention decoding for medical image segmentation,” 2024 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11769-11779, 2024. https://doi.org/10.1109/CVPR52733.2024.01118
- [20] C.-Y. Wang, H.-Y. M. Liao, and I.-H. Yeh, “Designing network design strategies through gradient path analysis,” arXiv:2211.04800, 2022. https://doi.org/10.48550/arXiv.2211.04800
- [21] X. Zhang et al., “RFAConv: Innovating spatial attention and standard convolutional operation,” arXiv:2304.03198, 2023. https://doi.org/10.48550/arXiv.2304.03198
- [22] Z. Chen, Z. He, and Z.-M. Lu, “DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention,” IEEE Trans. on Image Processing, Vol.33, pp. 1002-1015, 2024. https://doi.org/10.1109/TIP.2024.3354108
- [23] D. Du et al., “VisDrone-DET2019: The vision meets drone object detection in image challenge results,” 2019 IEEE/CVF Int. Conf. on Computer Vision Workshop, pp. 213-226, 2019. https://doi.org/10.1109/ICCVW.2019.00030
- [24] D. Du et al., “The unmanned aerial vehicle benchmark: Object detection and tracking,” arXiv:1804.00518, 2018. https://doi.org/10.48550/arXiv.1804.00518
- [25] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, “Drone-based object counting by spatially regularized regional proposal network,” 2017 IEEE Int. Conf. on Computer Vision, pp. 4165-4173, 2017. https://doi.org/10.1109/ICCV.2017.446
- [26] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS J. of Photogrammetry and Remote Sensing, Vol.159, pp. 296-307, 2020. https://doi.org/10.1016/j.isprsjprs.2019.11.023
- [27] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” Proc. of the 13th European Conf. on Computer Vision, Part 5, pp. 740-755, 2014. https://doi.org/10.1007/978-3-319-10602-1_48
- [28] S. Xu et al., “HCF-Net: Hierarchical context fusion network for infrared small object detection,” 2024 IEEE Int. Conf. on Multimedia and Expo, 2024. https://doi.org/10.1109/ICME57554.2024.10687431
- [29] Y. Jiang, Z. Tan, J. Wang, X. Sun, M. Lin, and H. Li, “GiraffeDet: A heavy-neck paradigm for object detection,” arXiv:2202.04256, 2022. https://doi.org/10.48550/arXiv.2202.04256
- [30] D. Bolya, S. Foley, J. Hays, and J. Hoffman, “TIDE: A general toolbox for identifying object detection errors,” Proc. of the 16th European Conf. on Computer Vision, Part 3, pp. 558-573, 2020. https://doi.org/10.1007/978-3-030-58580-8_33
- [31] K. Ayush et al., “Geography-aware self-supervised learning,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 10161-10170, 2021. https://doi.org/10.1109/ICCV48922.2021.01002
- [32] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” Proc. of the 29th Int. Conf. on Neural Information Processing Systems, pp. 91-99, 2015.
- [33] Y. Cong et al., “SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery,” Proc. of the 36th Int. Conf. on Neural Information Processing Systems, pp. 197-211, 2022.
- [34] Z. Li et al., “DetNet: A backbone network for object detection,” arXiv:1804.06215, 2018. https://doi.org/10.48550/arXiv.1804.06215
- [35] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A large contextual dataset for classification, detection and counting of cars with deep learning,” Proc. of the 14th European Conf. on Computer Vision, Part 3, pp. 785-800, 2016. https://doi.org/10.1007/978-3-319-46487-9_48
- [36] X. Sun et al., “RingMo: A remote sensing foundation model with masked image modeling,” IEEE Trans. on Geoscience and Remote Sensing, Vol.61, Article No.5612822, 2023. https://doi.org/10.1109/TGRS.2022.3194732
- [37] D. Wang et al., “Advancing plain vision transformer toward remote sensing foundation model,” IEEE Trans. on Geoscience and Remote Sensing, Vol.61, Article No.5607315, 2023. https://doi.org/10.1109/TGRS.2022.3222818
- [38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” 2017 IEEE Int. Conf. on Computer Vision, pp. 2999-3007, 2017. https://doi.org/10.1109/ICCV.2017.324
- [39] H. Perreault, M. Héritier, P. Gravel, G.-A. Bilodeau, and N. Saunier, “RN-VID: A feature fusion architecture for video object detection,” arXiv:2003.10898, 2020. https://doi.org/10.48550/arXiv.2003.10898
- [40] N. Amini-Naieni, K. Amini-Naieni, T. Han, and A. Zisserman, “Open-world text-specified object counting,” arXiv:2306.01851, 2023. https://doi.org/10.48550/arXiv.2306.01851
- [41] C. Tao et al., “TOV: The original vision model for optical remote sensing image understanding via self-supervised learning,” IEEE J. of Selected Topics in Applied Earth Observations and Remote Sensing, Vol.16, pp. 4916-4930, 2023. https://doi.org/10.1109/JSTARS.2023.3271312
- [42] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detection,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6154-6162, 2018. https://doi.org/10.1109/CVPR.2018.00644
- [43] H. Perreault, G.-A. Bilodeau, N. Saunier, and M. Héritier, “SpotNet: Self-attention multi-task network for object detection,” arXiv:2002.05540, 2020. https://doi.org/10.48550/arXiv.2002.05540
- [44] E. Goldman, R. Herzig, A. Eisenschtat, J. Goldberger, and T. Hassner, “Precise detection in densely packed scenes,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 5222-5231, 2019. https://doi.org/10.1109/CVPR.2019.00537
- [45] Y. Wang et al., “SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in earth observation,” arXiv:2211.07044, 2023. https://doi.org/10.48550/arXiv.2211.07044
- [46] H. Perreault, G.-A. Bilodeau, N. Saunier, and M. Héritier, “FFAVOD: Feature fusion architecture for video object detection,” Pattern Recognition Letters, Vol.151, pp. 294-301, 2021. https://doi.org/10.1016/j.patrec.2021.09.002
- [47] S. Kang, W. Moon, E. Kim, and J.-P. Heo, “VLCounter: Text-aware visual representation for zero-shot object counting,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.38, No.3, pp. 2714-2722, 2024. https://doi.org/10.1609/aaai.v38i3.28050
- [48] D. Muhtar, X. Zhang, P. Xiao, Z. Li, and F. Gu, “CMID: A unified self-supervised learning framework for remote sensing image understanding,” IEEE Trans. on Geoscience and Remote Sensing, Vol.61, Article No.5607817, 2023. https://doi.org/10.1109/tgrs.2023.3268232
- [49] P.-Y. Chen, M.-C. Chang, J.-W. Hsieh, and Y.-S. Chen, “Parallel residual bi-fusion feature pyramid network for accurate single-shot object detection,” IEEE Trans. on Image Processing, Vol.30, pp. 9099-9111, 2021. https://doi.org/10.1109/tip.2021.3118953
- [50] M. Shi, H. Lu, C. Feng, C. Liu, and Z. Cao, “Represent, compare, and learn: A similarity-aware framework for class-agnostic counting,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9519-9528, 2022. https://doi.org/10.1109/CVPR52688.2022.00931
- [51] U. Mall, B. Hariharan, and K. Bala, “Change-aware sampling and contrastive learning for satellite images,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 5261-5270, 2023. https://doi.org/10.1109/CVPR52729.2023.00509
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.