single-jc.php

JACIII Vol.29 No.6 pp. 1249-1261
doi: 10.20965/jaciii.2025.p1249
(2025)

Research Paper:

Pedestrian Re-Recognition Based on Spatiotemporal Transformer Skeleton Contrastive Learning and Feature Optimization

Yanru Jia*1, Yuanyuan Zhang*2,*3,*4,†, and Yilun Gao*2,*3,*4

*1School of Big Data and Artificial Intelligence, Xinyang University
7th New Avenue West, Xinyang, Henan 464000, China

*2School of Automation, China University of Geosciences
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

*3Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

*4Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

Corresponding author

Received:
January 28, 2025
Accepted:
May 26, 2025
Published:
November 20, 2025
Keywords:
pedestrian re-identification, Transformer, comparative learning, prompt learning, skeleton recognition
Abstract

Person re-identification is an important task in computer vision, aimed at achieving cross-camera identity confirmation by identifying and matching the same pedestrian under different cameras. However, when traditional image-based methods are affected by factors such as lighting changes, occlusion, and changes in viewing angles, the advantages of skeleton data become increasingly apparent. Existing methods typically use primitive body joint design skeleton descriptors or learn skeleton sequence representations, but they often cannot simultaneously simulate the relationships between different body components, and rarely model skeleton information from both temporal and spatial dimensions. Therefore, in this paper, we propose a universal skeleton contrastive learning method based on the spatiotemporal Transformer (Space-time Transformer, StFormer). The method first adopts the Space-time Attention (S-T Attention) mechanism and achieves relationship modeling of spatiotemporal features by stacking multiple S-T Attention blocks. Secondly, to improve the important clues for extracting data features from the model, a Feature Refinement Box (FR Box) was proposed. Finally, we purpose a unique prompt learning mechanism (P-Study) which utilizes the spatiotemporal context of graph nodes to prompt skeleton graph reconstruction and help capture more valuable patterns and graph semantics.

Cite this article as:
Y. Jia, Y. Zhang, and Y. Gao, “Pedestrian Re-Recognition Based on Spatiotemporal Transformer Skeleton Contrastive Learning and Feature Optimization,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.6, pp. 1249-1261, 2025.
Data files:
References
  1. [1] S. Xiang et al., “Rethinking person re-identification via semantic-based pretraining,” ACM Trans. on Multimedia Computing, Communications and Applications, Vol.20, No.3, Article No.90, 2023. https://doi.org/10.1145/3628452
  2. [2] Y. Yan et al., “Learning multi-attention context graph for group-based re-identification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.6, pp. 7001-7018, 2023. https://doi.org/10.1109/TPAMI.2020.3032542
  3. [3] M. Jia, X. Cheng, S. Lu, and J. Zhang, “Learning disentangled representation implicitly via transformer for occluded person re-identification,” IEEE Trans. on Multimedia, Vol.25, pp. 1294-1305, 2023. https://doi.org/10.1109/TMM.2022.3141267
  4. [4] H. Sun, M. Li, and C.-G. Li, “Hybrid contrastive learning with cluster ensemble for unsupervised person re-identification,” Proc. of the 6th Asian Conf. on Pattern Recognition, pp. 532-546, 2021. https://doi.org/10.1007/978-3-031-02444-3_40
  5. [5] Y. Zhang et al., “Local correlation ensemble with GCN based on attention features for cross-domain person Re-ID,” ACM Trans. on Multimedia Computing, Communications and Applications, Vol.19, No.2, Article No.56, 2023. https://doi.org/10.1145/3542820
  6. [6] W. Zajdel, Z. Zivkovic, and B. J. A. Krose, “Keeping track of humans: Have I seen this person before?,” Proc. of the 2005 IEEE Int. Conf. on Robotics and Automation, pp. 2081-2086, 2005. https://doi.org/10.1109/ROBOT.2005.1570420
  7. [7] N. Gheissari, T. B. Sebastian, and R. Hartley, “Person reidentification using spatiotemporal appearance,” 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 1528-1535, 2006. https://doi.org/10.1109/CVPR.2006.223
  8. [8] I. Kviatkovsky, A. Adam, and E. Rivlin, “Color invariants for person reidentification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.35, No.7, pp. 1622-1634, 2013. https://doi.org/10.1109/TPAMI.2012.246
  9. [9] S. Karanam, Y. Li, and R. J. Radke, “Person re-identification with discriminatively trained viewpoint invariant dictionaries,” 2015 IEEE Int. Conf. on Computer Vision, pp. 4516-4524, 2015. https://doi.org/10.1109/ICCV.2015.513
  10. [10] M. J. Gómez-Silva, A. de la Escalera, and J. M. Armingol, “Back-propagation of the Mahalanobis istance through a deep triplet learning model for person Re-Identification,” Integrated Computer-Aided Engineering, Vol.28, No.3, pp. 277-294, 2021. https://doi.org/10.3233/ICA-210651
  11. [11] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2959-2968, 2022. https://doi.org/10.1109/CVPR52688.2022.00298
  12. [12] J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” 2023 IEEE/CVF Int. Conf. on Computer Vision, pp. 10410-10419, 2023. https://doi.org/10.1109/ICCV51070.2023.00958
  13. [13] L. Ke, K.-C. Peng, and S. Lyu, “Towards To-a-T Spatio-Temporal Focus for skeleton-based action recognition,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.36, No.1, pp. 1131-1139, 2022. https://doi.org/10.1609/aaai.v36i1.19998
  14. [14] J. Li, P. Zhou, C. Xiong, and S. C. H. Hoi, “Prototypical contrastive learning of unsupervised representations,” arXiv:2005.04966, 2020. https://doi.org/10.48550/arXiv.2005.04966
  15. [15] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” Proc. of the 37th Int. Conf. on Machine Learning, pp. 1597-1607, 2020.
  16. [16] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9726-9735, 2020. https://doi.org/10.1109/CVPR42600.2020.00975
  17. [17] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. of Computer Vision, Vol.130, No.9, pp. 2337-2348, 2022. https://doi.org/10.1007/s11263-022-01653-1
  18. [18] H. Rao and C. Miao, “SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification,” arXiv:2204.09826, 2022. https://doi.org/10.48550/arXiv.2204.09826
  19. [19] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929
  20. [20] K. Kim, M.-J. Kim, H. Kim, S. Park, and J. Paik, “Person re-identification method using text description through CLIP,” 2023 Int. Conf. on Electronics, Information, and Communication, 2023. https://doi.org/10.1109/ICEIC57457.2023.10049924
  21. [21] Y. Li et al., “Diverse part discovery: Occluded person re-identification with Part-Aware Transformer,” 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2897-2906, 2021. https://doi.org/10.1109/CVPR46437.2021.00292
  22. [22] S. He et al., “TransReID: Transformer-based object re-identification,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 14993-15002, 2021. https://doi.org/10.1109/ICCV48922.2021.01474
  23. [23] C. Zheng et al., “3D human pose estimation with spatial and temporal transformers,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 11636-11645, 2021. https://doi.org/10.1109/ICCV48922.2021.01145
  24. [24] W. Li et al., “Exploiting temporal contexts with strided transformer for 3D human pose estimation,” IEEE Trans. on Multimedia, Vol.25, pp. 1282-1293, 2023. https://doi.org/10.1109/TMM.2022.3141231
  25. [25] M. Hassanin, A. Khamiss, M. Bennamoun, F. Boussaid, and I. Radwan, “CrossFormer: Cross spatio-temporal transformer for 3D human pose estimation,” arXiv:2203.13387, 2022. https://doi.org/10.48550/arXiv.2203.13387
  26. [26] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 3733-3742, 2018. https://doi.org/10.1109/CVPR.2018.00393
  27. [27] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018. https://doi.org/10.48550/arXiv.1807.03748
  28. [28] H. Wang, H. Chu, S. Fu, Z. Liu, and H. Hu, “Renovate yourself: Calibrating feature representation of misclassified pixels for semantic segmentation,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.36, No.3, pp. 2450-2458, 2022. https://doi.org/10.1609/aaai.v36i3.20145
  29. [29] H. Rao et al., “A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.44, No.10, pp. 6649-6666, 2022. https://doi.org/10.1109/TPAMI.2021.3092833
  30. [30] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” Proc. of the 13th Int. Conf. on Artificial Intelligence and Statistics, pp. 297-304, 2010.
  31. [31] X. Wang and G.-J. Qi, “Contrastive learning with stronger augmentations,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.5, pp. 5549-5560, 2023. https://doi.org/10.1109/TPAMI.2022.3203630
  32. [32] Y. You et al., “Graph contrastive learning with augmentations,” Proc. of the 34th Int. Conf. on Neural Information Processing Systems, pp. 5812-5823. 2020.
  33. [33] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and trends® in Optimization, Vol.1, No.3, pp. 127-239, 2014. https://doi.org/10.1561/2400000003
  34. [34] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” Trans. of the Association for Computational Linguistics, Vol.8, pp. 423-438, 2020. https://doi.org/10.1162/tacl_a_00324
  35. [35] Y. Chen, G. Yang, D. Wang, and D. Li, “Eliciting knowledge from language models with automatically generated continuous prompts,” Expert Systems with Applications, Vol.239, Article No.122327, 2024. https://doi.org/10.1016/j.eswa.2023.122327
  36. [36] Z. Zhong, D. Friedman, and D. Chen, “Factual probing is [MASK]: Learning vs. learning to recall,” arXiv:2104.05240, 2021. https://doi.org/10.48550/arXiv.2104.05240
  37. [37] A. Radford et al., “Learning transferable visual models from natural language supervision,” Proc. of the 38th Int. Conf. on Machine Learning, pp. 8748-8763, 2021.
  38. [38] A. Nambiar, A. Bernardino, J. C. Nascimento, and A. Fred, “Context-aware person re-identification in the wild via fusion of gait and anthropometric features,” 12th IEEE Int. Conf. on Automatic Face & Gesture Recognition, pp. 973-980, 2017. https://doi.org/10.1109/FG.2017.121
  39. [39] V. O. Andersson and R. M. Araujo, “Person identification using anthropometric and gait data from kinect sensor,” Proc. of the 29th AAAI Conf. on Artificial Intelligence. pp. 425-431, 2015.
  40. [40] M. Munaro, A. Fossati, A. Basso, E. Menegatti, and L. V. Gool, “One-shot person re-identification with a consumer depth camera,” S. Gong, M. Cristani, S. Yan, and C. C. Loy (Eds.), “Person Re-Identification,” pp. 161-181, Springer, 2014. https://doi.org/10.1007/978-1-4471-6296-4_8
  41. [41] E. Poongothai and A. Suruliandi, “Features analysis of person re-identification techniques,” 2016 Int. Conf. on Computing Technologies and Intelligent Data Engineering, 2016. https://doi.org/10.1109/ICCTIDE.2016.7725330
  42. [42] M. Taiana, D. Figueira, A. Nambiar, J. Nascimento, and A. Bernardino, “Towards fully automated person re-identification,” 2014 Int. Conf. on Computer Vision Theory and Applications, pp. 140-147, 2014.
  43. [43] X. Zhu, X. Zhu, M. Li, V. Murino, and S. Gong, “Intra-camera supervised person re-identification: A new benchmark,” 2019 IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 1079-1087, 2019. https://doi.org/10.1109/ICCVW.2019.00138
  44. [44] H. Rao, X. Hu, J. Cheng, and B. Hu, “SM-SGE: A self-supervised multi-scale skeleton graph encoding framework for person re-identification,” Proc. of the 29th ACM Int. Conf. on Multimedia, pp. 1812-1820, 2021. https://doi.org/10.1145/3474085.3475330
  45. [45] H. Rao et al., “Self-supervised gait encoding with locality-aware attention for person re-identification,” arXiv:2008.09435, 2020. https://doi.org/10.48550/arXiv.2008.09435
  46. [46] H. Rao, S. Xu, X. Hu, J. Cheng, and B. Hu, “Multi-level graph encoding with structural-collaborative relation learning for skeleton-based person re-identification,” arXiv:2106.03069, 2021. https://doi.org/10.48550/arXiv.2106.03069
  47. [47] Y. Chen et al., “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” 2021 IEEE/CVF Int. Conf. on Computer Vision, 2021. https://doi.org/10.1109/ICCV48922.2021.01311

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Nov. 19, 2025