Pedestrian Re-Recognition Based on Spatiotemporal Transformer Skeleton Contrastive Learning and Feature Optimization

Yanru Jia; Yuanyuan Zhang; Yilun Gao

doi:10.20965/jaciii.2025.p1249

single-jc.php

« previous

JACIII Vol.29 No.6 pp. 1249-1261

doi: 10.20965/jaciii.2025.p1249

(2025)

Research Paper:

Views over last 60 days: 15

Pedestrian Re-Recognition Based on Spatiotemporal Transformer Skeleton Contrastive Learning and Feature Optimization

Yanru Jia^1, Yuanyuan Zhang^2,3,4,†, and Yilun Gao^2,3,*4

^*1School of Big Data and Artificial Intelligence, Xinyang University
7th New Avenue West, Xinyang, Henan 464000, China

^*2School of Automation, China University of Geosciences
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

^*3Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

^*4Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

^†Corresponding author

Received:

January 28, 2025

Accepted:

May 26, 2025

Published:

November 20, 2025

Keywords:

pedestrian re-identification, Transformer, comparative learning, prompt learning, skeleton recognition

Abstract

Person re-identification is an important task in computer vision, aimed at achieving cross-camera identity confirmation by identifying and matching the same pedestrian under different cameras. However, when traditional image-based methods are affected by factors such as lighting changes, occlusion, and changes in viewing angles, the advantages of skeleton data become increasingly apparent. Existing methods typically use primitive body joint design skeleton descriptors or learn skeleton sequence representations, but they often cannot simultaneously simulate the relationships between different body components, and rarely model skeleton information from both temporal and spatial dimensions. Therefore, in this paper, we propose a universal skeleton contrastive learning method based on the spatiotemporal Transformer (Space-time Transformer, StFormer). The method first adopts the Space-time Attention (S-T Attention) mechanism and achieves relationship modeling of spatiotemporal features by stacking multiple S-T Attention blocks. Secondly, to improve the important clues for extracting data features from the model, a Feature Refinement Box (FR Box) was proposed. Finally, we purpose a unique prompt learning mechanism (P-Study) which utilizes the spatiotemporal context of graph nodes to prompt skeleton graph reconstruction and help capture more valuable patterns and graph semantics.

Cite this article as:

Y. Jia, Y. Zhang, and Y. Gao, “Pedestrian Re-Recognition Based on Spatiotemporal Transformer Skeleton Contrastive Learning and Feature Optimization,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.6, pp. 1249-1261, 2025.

Data files:

References

[1] S. Xiang et al., “Rethinking person re-identification via semantic-based pretraining,” ACM Trans. on Multimedia Computing, Communications and Applications, Vol.20, No.3, Article No.90, 2023. https://doi.org/10.1145/3628452
[2] Y. Yan et al., “Learning multi-attention context graph for group-based re-identification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.6, pp. 7001-7018, 2023. https://doi.org/10.1109/TPAMI.2020.3032542
[3] M. Jia, X. Cheng, S. Lu, and J. Zhang, “Learning disentangled representation implicitly via transformer for occluded person re-identification,” IEEE Trans. on Multimedia, Vol.25, pp. 1294-1305, 2023. https://doi.org/10.1109/TMM.2022.3141267
[4] H. Sun, M. Li, and C.-G. Li, “Hybrid contrastive learning with cluster ensemble for unsupervised person re-identification,” Proc. of the 6th Asian Conf. on Pattern Recognition, pp. 532-546, 2021. https://doi.org/10.1007/978-3-031-02444-3_40
[5] Y. Zhang et al., “Local correlation ensemble with GCN based on attention features for cross-domain person Re-ID,” ACM Trans. on Multimedia Computing, Communications and Applications, Vol.19, No.2, Article No.56, 2023. https://doi.org/10.1145/3542820
[6] W. Zajdel, Z. Zivkovic, and B. J. A. Krose, “Keeping track of humans: Have I seen this person before?,” Proc. of the 2005 IEEE Int. Conf. on Robotics and Automation, pp. 2081-2086, 2005. https://doi.org/10.1109/ROBOT.2005.1570420
[7] N. Gheissari, T. B. Sebastian, and R. Hartley, “Person reidentification using spatiotemporal appearance,” 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 1528-1535, 2006. https://doi.org/10.1109/CVPR.2006.223
[8] I. Kviatkovsky, A. Adam, and E. Rivlin, “Color invariants for person reidentification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.35, No.7, pp. 1622-1634, 2013. https://doi.org/10.1109/TPAMI.2012.246
[9] S. Karanam, Y. Li, and R. J. Radke, “Person re-identification with discriminatively trained viewpoint invariant dictionaries,” 2015 IEEE Int. Conf. on Computer Vision, pp. 4516-4524, 2015. https://doi.org/10.1109/ICCV.2015.513
[10] M. J. Gómez-Silva, A. de la Escalera, and J. M. Armingol, “Back-propagation of the Mahalanobis istance through a deep triplet learning model for person Re-Identification,” Integrated Computer-Aided Engineering, Vol.28, No.3, pp. 277-294, 2021. https://doi.org/10.3233/ICA-210651
[11] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2959-2968, 2022. https://doi.org/10.1109/CVPR52688.2022.00298
[12] J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” 2023 IEEE/CVF Int. Conf. on Computer Vision, pp. 10410-10419, 2023. https://doi.org/10.1109/ICCV51070.2023.00958
[13] L. Ke, K.-C. Peng, and S. Lyu, “Towards To-a-T Spatio-Temporal Focus for skeleton-based action recognition,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.36, No.1, pp. 1131-1139, 2022. https://doi.org/10.1609/aaai.v36i1.19998
[14] J. Li, P. Zhou, C. Xiong, and S. C. H. Hoi, “Prototypical contrastive learning of unsupervised representations,” arXiv:2005.04966, 2020. https://doi.org/10.48550/arXiv.2005.04966
[15] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” Proc. of the 37th Int. Conf. on Machine Learning, pp. 1597-1607, 2020.
[16] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9726-9735, 2020. https://doi.org/10.1109/CVPR42600.2020.00975
[17] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. of Computer Vision, Vol.130, No.9, pp. 2337-2348, 2022. https://doi.org/10.1007/s11263-022-01653-1
[18] H. Rao and C. Miao, “SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification,” arXiv:2204.09826, 2022. https://doi.org/10.48550/arXiv.2204.09826
[19] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929
[20] K. Kim, M.-J. Kim, H. Kim, S. Park, and J. Paik, “Person re-identification method using text description through CLIP,” 2023 Int. Conf. on Electronics, Information, and Communication, 2023. https://doi.org/10.1109/ICEIC57457.2023.10049924
[21] Y. Li et al., “Diverse part discovery: Occluded person re-identification with Part-Aware Transformer,” 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2897-2906, 2021. https://doi.org/10.1109/CVPR46437.2021.00292
[22] S. He et al., “TransReID: Transformer-based object re-identification,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 14993-15002, 2021. https://doi.org/10.1109/ICCV48922.2021.01474
[23] C. Zheng et al., “3D human pose estimation with spatial and temporal transformers,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 11636-11645, 2021. https://doi.org/10.1109/ICCV48922.2021.01145
[24] W. Li et al., “Exploiting temporal contexts with strided transformer for 3D human pose estimation,” IEEE Trans. on Multimedia, Vol.25, pp. 1282-1293, 2023. https://doi.org/10.1109/TMM.2022.3141231
[25] M. Hassanin, A. Khamiss, M. Bennamoun, F. Boussaid, and I. Radwan, “CrossFormer: Cross spatio-temporal transformer for 3D human pose estimation,” arXiv:2203.13387, 2022. https://doi.org/10.48550/arXiv.2203.13387
[26] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 3733-3742, 2018. https://doi.org/10.1109/CVPR.2018.00393
[27] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018. https://doi.org/10.48550/arXiv.1807.03748
[28] H. Wang, H. Chu, S. Fu, Z. Liu, and H. Hu, “Renovate yourself: Calibrating feature representation of misclassified pixels for semantic segmentation,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.36, No.3, pp. 2450-2458, 2022. https://doi.org/10.1609/aaai.v36i3.20145
[29] H. Rao et al., “A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.44, No.10, pp. 6649-6666, 2022. https://doi.org/10.1109/TPAMI.2021.3092833
[30] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” Proc. of the 13th Int. Conf. on Artificial Intelligence and Statistics, pp. 297-304, 2010.
[31] X. Wang and G.-J. Qi, “Contrastive learning with stronger augmentations,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.5, pp. 5549-5560, 2023. https://doi.org/10.1109/TPAMI.2022.3203630
[32] Y. You et al., “Graph contrastive learning with augmentations,” Proc. of the 34th Int. Conf. on Neural Information Processing Systems, pp. 5812-5823. 2020.
[33] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and trends® in Optimization, Vol.1, No.3, pp. 127-239, 2014. https://doi.org/10.1561/2400000003
[34] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” Trans. of the Association for Computational Linguistics, Vol.8, pp. 423-438, 2020. https://doi.org/10.1162/tacl_a_00324
[35] Y. Chen, G. Yang, D. Wang, and D. Li, “Eliciting knowledge from language models with automatically generated continuous prompts,” Expert Systems with Applications, Vol.239, Article No.122327, 2024. https://doi.org/10.1016/j.eswa.2023.122327
[36] Z. Zhong, D. Friedman, and D. Chen, “Factual probing is [MASK]: Learning vs. learning to recall,” arXiv:2104.05240, 2021. https://doi.org/10.48550/arXiv.2104.05240
[37] A. Radford et al., “Learning transferable visual models from natural language supervision,” Proc. of the 38th Int. Conf. on Machine Learning, pp. 8748-8763, 2021.
[38] A. Nambiar, A. Bernardino, J. C. Nascimento, and A. Fred, “Context-aware person re-identification in the wild via fusion of gait and anthropometric features,” 12th IEEE Int. Conf. on Automatic Face & Gesture Recognition, pp. 973-980, 2017. https://doi.org/10.1109/FG.2017.121
[39] V. O. Andersson and R. M. Araujo, “Person identification using anthropometric and gait data from kinect sensor,” Proc. of the 29th AAAI Conf. on Artificial Intelligence. pp. 425-431, 2015.
[40] M. Munaro, A. Fossati, A. Basso, E. Menegatti, and L. V. Gool, “One-shot person re-identification with a consumer depth camera,” S. Gong, M. Cristani, S. Yan, and C. C. Loy (Eds.), “Person Re-Identification,” pp. 161-181, Springer, 2014. https://doi.org/10.1007/978-1-4471-6296-4_8
[41] E. Poongothai and A. Suruliandi, “Features analysis of person re-identification techniques,” 2016 Int. Conf. on Computing Technologies and Intelligent Data Engineering, 2016. https://doi.org/10.1109/ICCTIDE.2016.7725330
[42] M. Taiana, D. Figueira, A. Nambiar, J. Nascimento, and A. Bernardino, “Towards fully automated person re-identification,” 2014 Int. Conf. on Computer Vision Theory and Applications, pp. 140-147, 2014.
[43] X. Zhu, X. Zhu, M. Li, V. Murino, and S. Gong, “Intra-camera supervised person re-identification: A new benchmark,” 2019 IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 1079-1087, 2019. https://doi.org/10.1109/ICCVW.2019.00138
[44] H. Rao, X. Hu, J. Cheng, and B. Hu, “SM-SGE: A self-supervised multi-scale skeleton graph encoding framework for person re-identification,” Proc. of the 29th ACM Int. Conf. on Multimedia, pp. 1812-1820, 2021. https://doi.org/10.1145/3474085.3475330
[45] H. Rao et al., “Self-supervised gait encoding with locality-aware attention for person re-identification,” arXiv:2008.09435, 2020. https://doi.org/10.48550/arXiv.2008.09435
[46] H. Rao, S. Xu, X. Hu, J. Cheng, and B. Hu, “Multi-level graph encoding with structural-collaborative relation learning for skeleton-based person re-identification,” arXiv:2106.03069, 2021. https://doi.org/10.48550/arXiv.2106.03069
[47] Y. Chen et al., “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” 2021 IEEE/CVF Int. Conf. on Computer Vision, 2021. https://doi.org/10.1109/ICCV48922.2021.01311

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] S. Xiang et al., “Rethinking person re-identification via semantic-based pretraining,” ACM Trans. on Multimedia Computing, Communications and Applications, Vol.20, No.3, Article No.90, 2023. https://doi.org/10.1145/3628452

[2] [2] Y. Yan et al., “Learning multi-attention context graph for group-based re-identification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.6, pp. 7001-7018, 2023. https://doi.org/10.1109/TPAMI.2020.3032542

[3] [3] M. Jia, X. Cheng, S. Lu, and J. Zhang, “Learning disentangled representation implicitly via transformer for occluded person re-identification,” IEEE Trans. on Multimedia, Vol.25, pp. 1294-1305, 2023. https://doi.org/10.1109/TMM.2022.3141267

[4] [4] H. Sun, M. Li, and C.-G. Li, “Hybrid contrastive learning with cluster ensemble for unsupervised person re-identification,” Proc. of the 6th Asian Conf. on Pattern Recognition, pp. 532-546, 2021. https://doi.org/10.1007/978-3-031-02444-3_40

[5] [5] Y. Zhang et al., “Local correlation ensemble with GCN based on attention features for cross-domain person Re-ID,” ACM Trans. on Multimedia Computing, Communications and Applications, Vol.19, No.2, Article No.56, 2023. https://doi.org/10.1145/3542820

[6] [6] W. Zajdel, Z. Zivkovic, and B. J. A. Krose, “Keeping track of humans: Have I seen this person before?,” Proc. of the 2005 IEEE Int. Conf. on Robotics and Automation, pp. 2081-2086, 2005. https://doi.org/10.1109/ROBOT.2005.1570420

[7] [7] N. Gheissari, T. B. Sebastian, and R. Hartley, “Person reidentification using spatiotemporal appearance,” 2006 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 1528-1535, 2006. https://doi.org/10.1109/CVPR.2006.223

[8] [8] I. Kviatkovsky, A. Adam, and E. Rivlin, “Color invariants for person reidentification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.35, No.7, pp. 1622-1634, 2013. https://doi.org/10.1109/TPAMI.2012.246

[9] [9] S. Karanam, Y. Li, and R. J. Radke, “Person re-identification with discriminatively trained viewpoint invariant dictionaries,” 2015 IEEE Int. Conf. on Computer Vision, pp. 4516-4524, 2015. https://doi.org/10.1109/ICCV.2015.513

[10] [10] M. J. Gómez-Silva, A. de la Escalera, and J. M. Armingol, “Back-propagation of the Mahalanobis istance through a deep triplet learning model for person Re-Identification,” Integrated Computer-Aided Engineering, Vol.28, No.3, pp. 277-294, 2021. https://doi.org/10.3233/ICA-210651

[11] [11] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2959-2968, 2022. https://doi.org/10.1109/CVPR52688.2022.00298

[12] [12] J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” 2023 IEEE/CVF Int. Conf. on Computer Vision, pp. 10410-10419, 2023. https://doi.org/10.1109/ICCV51070.2023.00958

[13] [13] L. Ke, K.-C. Peng, and S. Lyu, “Towards To-a-T Spatio-Temporal Focus for skeleton-based action recognition,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.36, No.1, pp. 1131-1139, 2022. https://doi.org/10.1609/aaai.v36i1.19998

[14] [14] J. Li, P. Zhou, C. Xiong, and S. C. H. Hoi, “Prototypical contrastive learning of unsupervised representations,” arXiv:2005.04966, 2020. https://doi.org/10.48550/arXiv.2005.04966

[15] [15] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” Proc. of the 37th Int. Conf. on Machine Learning, pp. 1597-1607, 2020.

[16] [16] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 9726-9735, 2020. https://doi.org/10.1109/CVPR42600.2020.00975

[17] [17] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. of Computer Vision, Vol.130, No.9, pp. 2337-2348, 2022. https://doi.org/10.1007/s11263-022-01653-1

[18] [18] H. Rao and C. Miao, “SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification,” arXiv:2204.09826, 2022. https://doi.org/10.48550/arXiv.2204.09826

[19] [19] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929

[20] [20] K. Kim, M.-J. Kim, H. Kim, S. Park, and J. Paik, “Person re-identification method using text description through CLIP,” 2023 Int. Conf. on Electronics, Information, and Communication, 2023. https://doi.org/10.1109/ICEIC57457.2023.10049924

[21] [21] Y. Li et al., “Diverse part discovery: Occluded person re-identification with Part-Aware Transformer,” 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 2897-2906, 2021. https://doi.org/10.1109/CVPR46437.2021.00292

[22] [22] S. He et al., “TransReID: Transformer-based object re-identification,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 14993-15002, 2021. https://doi.org/10.1109/ICCV48922.2021.01474

[23] [23] C. Zheng et al., “3D human pose estimation with spatial and temporal transformers,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 11636-11645, 2021. https://doi.org/10.1109/ICCV48922.2021.01145

[24] [24] W. Li et al., “Exploiting temporal contexts with strided transformer for 3D human pose estimation,” IEEE Trans. on Multimedia, Vol.25, pp. 1282-1293, 2023. https://doi.org/10.1109/TMM.2022.3141231

[25] [25] M. Hassanin, A. Khamiss, M. Bennamoun, F. Boussaid, and I. Radwan, “CrossFormer: Cross spatio-temporal transformer for 3D human pose estimation,” arXiv:2203.13387, 2022. https://doi.org/10.48550/arXiv.2203.13387

[26] [26] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 3733-3742, 2018. https://doi.org/10.1109/CVPR.2018.00393

[27] [27] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018. https://doi.org/10.48550/arXiv.1807.03748

[28] [28] H. Wang, H. Chu, S. Fu, Z. Liu, and H. Hu, “Renovate yourself: Calibrating feature representation of misclassified pixels for semantic segmentation,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.36, No.3, pp. 2450-2458, 2022. https://doi.org/10.1609/aaai.v36i3.20145

[29] [29] H. Rao et al., “A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.44, No.10, pp. 6649-6666, 2022. https://doi.org/10.1109/TPAMI.2021.3092833

[30] [30] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” Proc. of the 13th Int. Conf. on Artificial Intelligence and Statistics, pp. 297-304, 2010.

[31] [31] X. Wang and G.-J. Qi, “Contrastive learning with stronger augmentations,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.5, pp. 5549-5560, 2023. https://doi.org/10.1109/TPAMI.2022.3203630

[32] [32] Y. You et al., “Graph contrastive learning with augmentations,” Proc. of the 34th Int. Conf. on Neural Information Processing Systems, pp. 5812-5823. 2020.

[33] [33] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and trends® in Optimization, Vol.1, No.3, pp. 127-239, 2014. https://doi.org/10.1561/2400000003

[34] [34] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” Trans. of the Association for Computational Linguistics, Vol.8, pp. 423-438, 2020. https://doi.org/10.1162/tacl_a_00324

[35] [35] Y. Chen, G. Yang, D. Wang, and D. Li, “Eliciting knowledge from language models with automatically generated continuous prompts,” Expert Systems with Applications, Vol.239, Article No.122327, 2024. https://doi.org/10.1016/j.eswa.2023.122327

[36] [36] Z. Zhong, D. Friedman, and D. Chen, “Factual probing is [MASK]: Learning vs. learning to recall,” arXiv:2104.05240, 2021. https://doi.org/10.48550/arXiv.2104.05240

[37] [37] A. Radford et al., “Learning transferable visual models from natural language supervision,” Proc. of the 38th Int. Conf. on Machine Learning, pp. 8748-8763, 2021.

[38] [38] A. Nambiar, A. Bernardino, J. C. Nascimento, and A. Fred, “Context-aware person re-identification in the wild via fusion of gait and anthropometric features,” 12th IEEE Int. Conf. on Automatic Face & Gesture Recognition, pp. 973-980, 2017. https://doi.org/10.1109/FG.2017.121

[39] [39] V. O. Andersson and R. M. Araujo, “Person identification using anthropometric and gait data from kinect sensor,” Proc. of the 29th AAAI Conf. on Artificial Intelligence. pp. 425-431, 2015.

[40] [40] M. Munaro, A. Fossati, A. Basso, E. Menegatti, and L. V. Gool, “One-shot person re-identification with a consumer depth camera,” S. Gong, M. Cristani, S. Yan, and C. C. Loy (Eds.), “Person Re-Identification,” pp. 161-181, Springer, 2014. https://doi.org/10.1007/978-1-4471-6296-4_8

[41] [41] E. Poongothai and A. Suruliandi, “Features analysis of person re-identification techniques,” 2016 Int. Conf. on Computing Technologies and Intelligent Data Engineering, 2016. https://doi.org/10.1109/ICCTIDE.2016.7725330

[42] [42] M. Taiana, D. Figueira, A. Nambiar, J. Nascimento, and A. Bernardino, “Towards fully automated person re-identification,” 2014 Int. Conf. on Computer Vision Theory and Applications, pp. 140-147, 2014.

[43] [43] X. Zhu, X. Zhu, M. Li, V. Murino, and S. Gong, “Intra-camera supervised person re-identification: A new benchmark,” 2019 IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 1079-1087, 2019. https://doi.org/10.1109/ICCVW.2019.00138

[44] [44] H. Rao, X. Hu, J. Cheng, and B. Hu, “SM-SGE: A self-supervised multi-scale skeleton graph encoding framework for person re-identification,” Proc. of the 29th ACM Int. Conf. on Multimedia, pp. 1812-1820, 2021. https://doi.org/10.1145/3474085.3475330

[45] [45] H. Rao et al., “Self-supervised gait encoding with locality-aware attention for person re-identification,” arXiv:2008.09435, 2020. https://doi.org/10.48550/arXiv.2008.09435

[46] [46] H. Rao, S. Xu, X. Hu, J. Cheng, and B. Hu, “Multi-level graph encoding with structural-collaborative relation learning for skeleton-based person re-identification,” arXiv:2106.03069, 2021. https://doi.org/10.48550/arXiv.2106.03069

[47] [47] Y. Chen et al., “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” 2021 IEEE/CVF Int. Conf. on Computer Vision, 2021. https://doi.org/10.1109/ICCV48922.2021.01311

Pedestrian Re-Recognition Based on Spatiotemporal Transformer Skeleton Contrastive Learning and Feature Optimization

Yanru Jia*1, Yuanyuan Zhang*2,*3,*4,†, and Yilun Gao*2,*3,*4

Yanru Jia^1, Yuanyuan Zhang^2,3,4,†, and Yilun Gao^2,3,*4