Research Paper:
Perceptual Interaction System of Elderly Care Robot Based on Multimodal Large Language Models and Distributed Computing
Aihui Wang
, Kuozhan Wang, Yan Wang
, Xuebin Yue
, Hengyi Li
, and Yao Yao
School of Automation and Electrical Engineering, Zhongyuan University of Technology
No.41 Zhongyuan Road, Zhengzhou 450007, China
Corresponding author
Population aging intensifies the demands for elderly care robotics. However, current robots have limited environmental perception and interaction, making it difficult to meet application needs. We propose a perceptual interaction system for elderly care robots. The system combines multimodal large language models with distributed computing. Distributed computing allows the vision-language models and large language model to be deployed on a server. The robot performs environmental perception and kinematic solving. Computational resources are rationally allocated, enabling complex human–robot interactions such as dialogue, visual question answering, and object retrieval assistance. Experimental results show that the VQA module achieves an accuracy of 53.16% on the COCO-QA dataset and 66.7% on the VQA-v2 dataset. The mAP of the ZSD module on the COCO val 2014 dataset is 43.8%. These models were deployed to the robotic system in a constrained simulated living room setting. The average response time of the robotic interaction system was 1.346 seconds. We also collected feedback from 10 participating users to verify the feasibility of the robotic system in a home setting.
Elderly care robot system architecture
- [1] C. T. Kulik, S. Ryan, S. Harper, and G. George, “Aging Populations and Management,” The Academy of Management J., Vol.57, No.4, pp. 929-935, 2014. https://doi.org/10.5465/amj.2014.4004
- [2] W. C. Sanderson and S. Scherbov, “A new perspective on population aging,” Demographic Research, Vol.16, pp. 27-58, 2007. https://doi.org/10.4054/DemRes.2007.16.2
- [3] Y. Cui, L. Zhang, Y. Hou, and G. Tian, “Design of intelligent home pension service platform based on machine learning and wireless sensor network,” J. of Intelligent & Fuzzy Systems, Vol.40, Issue 2, pp. 2529-2540, 2021. https://doi.org/10.3233/JIFS-189246
- [4] S. Guo and S. Dong, “Research and Innovation of a Community Intelligent Pension Service System: Taking Longhua District, Shenzhen, China, as an Example,” J. of Computer Science and Technology Studies, Vol.6, No.2, pp. 71-75, 2024. https://doi.org/10.32996/jcsts.2024.6.2.8
- [5] J. Wang, Y. Liang, S. Cao, P. Cai, and Y. Fan, “Application of Artificial Intelligence in Geriatric Care: Bibliometric Analysis,” J. of Medical Internet Research, Vol.25, Article No.e46014, 2023. https://doi.org/10.2196/46014
- [6] T. Bin, H. Yan, N. Wang, M. N. Nikolić, J. Yao, and T. Zhang, “A survey on the visual perception of humanoid robot,” Biomimetic Intelligence and Robotics, Vol.5, Issue 1, Article No.100197, 2025. https://doi.org/10.1016/j.birob.2024.100197
- [7] Z. Zhu, C. Chen, X. Liu, K. Liang, and Y. Jia, “Design and Implementation of Digital Twin System of OCS Maintenance Robot,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.5, pp. 1062-1067, 2025. https://doi.org/10.20965/jaciii.2025.p1062
- [8] R. Harada, T. Oyama, K. Fujimoto, T. Shimizu, M. Ozawa, J. S. Amar, and M. Sakai, “Trash Detection Algorithm Suitable for Mobile Robots Using Improved YOLO,” J. Adv. Comput. Intell. Intell. Inform., Vol.27, No.4, pp. 622-631, 2023. https://doi.org/10.20965/jaciii.2023.p0622
- [9] Y. Sone and J. Woo, “Design of a Human-Centric Robotic System for User Support Based on Gaze Information,” J. Adv. Comput. Intell. Intell. Inform., Vol.29, No.4, pp. 796-802, 2025. https://doi.org/10.20965/jaciii.2025.p0796
- [10] Y. Fan, Y. Chen, C.-T. Chen, and J. Zhao, “Design of intelligent elderly care robot system based on ROS,” 3rd Int. Conf. on Electronic Information Engineering and Data Processing (EIEDP 2024), Vol.13184, pp. 1436-1443, 2024. https://doi.org/10.1117/12.3032907
- [11] K. K. F. So, H. Kim, S. Q. Liu, X. Fang, and J. Wirtz, “Service robots: The dynamic effects of anthropomorphism and functional perceptions on consumers’ responses,” European J. of Marketing, Vol.58, Issue 1, pp. 1-32, 2024. https://doi.org/10.1108/EJM-03-2022-0176
- [12] Y. Yamazaki, M. Ishii, T. Ito, and T. Hashimoto, “Frailty Care Robot for Elderly and its Application for Physical and Psychological Support,” J. Adv. Comput. Intell. Intell. Inform., Vol.25, No.6, pp. 944-952, 2021. https://doi.org/10.20965/jaciii.2021.p0944
- [13] J. C. Briede-Westermeyer, P. G. R. Fraga, M. J. Schilling-Norman, and C. Pérez-Villalobos, “Identifying the Needs of Older Adults Associated with Daily Activities: A Qualitative Study,” Int. J. of Environmental Research and Public Health, Vol.20, Issue 5, Article No.4257, 2023. https://doi.org/10.3390/ijerph20054257
- [14] G. D’Onofrio, L. Fiorini, H. Hoshino, A. Matsumori, Y. Okabe, M. Tsukamoto, R. Limosani, A. Vitanza, F. Greco, A. Greco et al., “Assistive robots for socialization in elderly people: Results pertaining to the needs of the users,” Aging Clinical and Experimental Research, Vol.31, No.9, pp. 1313-1329, 2019. https://doi.org/10.1007/s40520-018-1073-z
- [15] M. Shimosaka, H. Nishimoto, S. Okahashi, D. Zeng, K. Fukui, T. Kawasaki, I. Akiguchi, and A. Kinoshita, “Assessment of instrumental activities of daily living in patients with cognitive impairment based on their ability to use household appliances,” J. of Alzheimer’s Disease, Vol.104, Issue 3, pp. 919-932, 2025. https://doi.org/10.1177/13872877251320668
- [16] R. A. Cohen and L. Mykyta, “Prescription Medication Use, Coverage, and Nonadherence Among Adults Age 65 and Older: United States, 2021-2022,” National Health Statistics Reports, No.209, 2024. https://doi.org/10.15620/cdc/160016
- [17] A. C. Umfress and M. A. Brantley Jr., “Eye Care Disparities and Health-Related Consequences in Elderly Patients with Age-Related Eye Disease,” Seminars in Ophthalmology, Vol.31, Issue 4, pp. 432-438, 2016. https://doi.org/10.3109/08820538.2016.1154171
- [18] J. Wu, J. Gao, J. Yi, P. Liu, and C. Xu, “Environment Perception Technology for Intelligent Robots in Complex Environments: A Review,” 2022 7th Int. Conf. on Communication, Image and Signal Processing (CCISP), pp. 479-485, 2022. https://doi.org/10.1109/CCISP55629.2022.9974277
- [19] M. Marge, C. Espy-Wilson, N. G. Ward, A. Alwan, Y. Artzi, M. Bansal, G. Blankenship, J. Chai, H. Daumé III, D. Dey, M. Harper, T. Howard, C. Kennington, I. Kruijff-Korbayová, D. Manocha, C. Matuszek, R. Mead, R. Mooney, R. K. Moore, M. Ostendorf, H. Pon-Barry, A. I. Rudnicky, M. Scheutz, R. St. Amant, T. Sun, S. Tellex, D. Traum, and Z. Yu, “Spoken language interaction with robots: Recommendations for future research,” Computer Speech & Language, Vol.71, Article No.101255, 2022. https://doi.org/10.1016/j.csl.2021.101255
- [20] Q. Sheng, Z. Zhou, J. Li, X. Mi, P. Xiang, Z. Chen, H. Xu, S. Jia, X. Wu, Y. Cui, S. Ye, J. Yu, Y. Du, S. Zhai, K. Xu, Y. Yang, Z. Lou, Z. Song, Z. Yin, Y. Sun, R. Xiong, J. Zou, and H. Yang, “A Comprehensive Review of Humanoid Robots,” SmartBot, Vol.1, Issue 1, Article No.e12008, 2025. https://doi.org/10.1002/smb2.12008
- [21] C. Zhang, Z. Yang, X. He, and L. Deng, “Multimodal Intelligence: Representation Learning, Information Fusion, and Applications,” IEEE J. of Selected Topics in Signal Processing, Vol.14, Issue 3, pp. 478-493, 2020. https://doi.org/10.1109/JSTSP.2020.2987728
- [22] J. Kuffner, K. Nishiwaki, S. Kagami, M. Inaba, and H. Inoue, “Motion Planning for Humanoid Robots,” P. Dario and R. Chatila (Eds.), “Robotics Research – 11th Int. Symp.,” pp. 365-374, Springer, 2005. https://doi.org/10.1007/11008941_39
- [23] Y. Guo, G. Ding, J. Han, and Y. Gao, “Zero-Shot Learning with Transferred Samples,” IEEE Trans. on Image Processing, Vol.26, Issue 7, pp. 3277-3290, 2017. https://doi.org/10.1109/TIP.2017.2696747
- [24] J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation,” Proc. of the 39th Int. Conf. on Machine Learning, pp. 12888-12900, 2022.
- [25] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” Proc. of the 40th Int. Conf. on Machine Learning, pp. 19730-19742, 2023.
- [26] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, K.-W. Chang, and J. Gao, “Grounded Language-Image Pre-training,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 10955-10965, 2022. https://doi.org/10.1109/CVPR52688.2022.01069
- [27] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, “GLIPv2: Unifying Localization and VL Understanding,” 36th Conf. Neural Inf. Process. Syst. (NeurIPS 2022), 2022.
- [28] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint, arXiv:2501.12948, 2025. https://doi.org/10.48550/arXiv.2501.12948
- [29] A. K. Pandey and R. Gelin, “A Mass-Produced Sociable Humanoid Robot: Pepper: The First Machine of Its Kind,” IEEE Robotics & Automation Magazine, Vol.25, Issue 3, pp. 40-48, 2018. https://doi.org/10.1109/MRA.2018.2833157
- [30] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Proc. of the 34th Int. Conf. on Advances in Neural Information Processing Systems (NIPS’20), Vol.33, pp. 1877-1901, 2020.
- [31] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “LLaMA: Open and Efficient Foundation Language Models,” arXiv preprint, arXiv:2302.13971, 2023. https://doi.org/10.48550/arXiv.2302.13971
- [32] J. Wang, E. Shi, H. Hu, C. Ma, Y. Liu, X. Wang, Y. Yao, X. Liu, B. Ge, and S. Zhang, “Large language models for robotics: Opportunities, challenges, and perspectives,” J. of Automation and Intelligence, Vol.4, Issue 1, pp. 52-64, 2025. https://doi.org/10.1016/j.jai.2024.12.003
- [33] R. Mon-Williams, G. Li, R. Long, W. Du, and C. G. Lucas, “Embodied large language models enable robots to complete complex tasks in unpredictable environments,” Nature Machine Intelligence, Vol.7, pp. 592-601, 2025. https://doi.org/10.1038/s42256-025-01005-x
- [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” 31st Conf. on Neural Information Processing Systems (NIPS 2017), Vol.30, 2017.
- [35] A. Masumori, N. Maruyama, and T. Ikegami, “Personogenesis Through Imitating Human Behavior in a Humanoid Robot “Alter3”,” Frontiers in Robotics and AI, Vol.7, Article No.532375, 2021. https://doi.org/10.3389/frobt.2020.532375
- [36] Y. Ye, H. You, and J. Du, “Improved Trust in Human-Robot Collaboration with ChatGPT,” IEEE Access, Vol.11, pp. 55748-55754, 2023. https://doi.org/10.1109/ACCESS.2023.3282111
- [37] A. Obludzyner, F. Zaldivar, and O. E. Ramos, “Kinematic Control for the Motion Generation of Robot Manipulators Using MoMask LLM,” 2024 IEEE XXXI Int. Conf. on Electronics, Electrical Engineering and Computing (INTERCON), 2024. https://doi.org/10.1109/INTERCON63140.2024.10833232
- [38] H. Liu, Y. Zhu, K. Kato, A. Tsukahara, I. Kondo, T. Aoyama, and Y. Hasegawa, “Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration,” IEEE Robotics and Automation Letters, Vol.9, Issue 8, pp. 6904-6911, 2024. https://doi.org/10.1109/LRA.2024.3415931
- [39] D. Zheng, S. Huang, L. Zhao, Y. Zhong, and L. Wang, “Towards Learning a Generalist Model for Embodied Navigation,” 2024 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 13624-13634, 2024. https://doi.org/10.1109/CVPR52733.2024.01293
- [40] X. Yue and L. Meng, “YOLO-SM: A Lightweight Single-Class Multi-Deformation Object Detection Network,” IEEE Trans. on Emerging Topics in Computational Intelligence, Vol.8, Issue 3, pp. 2467-2480, 2024. https://doi.org/10.1109/TETCI.2024.3367821
- [41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 779-788, 2016. https://doi.org/10.1109/CVPR.2016.91
- [42] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image Captioning with Semantic Attention,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4651-4659, 2016. https://doi.org/10.1109/CVPR.2016.503
- [43] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6077-6086, 2018. https://doi.org/10.1109/CVPR.2018.00636
- [44] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning Transferable Visual Models From Natural Language Supervision,” Proc. of the 38th Int. Conf. on Machine Learning, pp. 8748-8763, 2021.
- [45] W. Yuan, H. Sun, X. Wang, and X. Liu, “Towards Efficient Deployment of Cloud Applications through Dynamic Reverse Proxy Optimization,” 2013 IEEE 10th Int. Conf. on High Performance Computing and Communications & 2013 IEEE Int. Conf. on Embedded and Ubiquitous Computing, pp. 651-658, 2013. https://doi.org/10.1109/HPCC.and.EUC.2013.97
- [46] K. Wang, A. Wang, Y. Wang, X. Yue, J. Xie, and Y. Wang, “Target Grasping and Multi-modal Interaction System Based on Pepper Robot,” 2024 Int. Conf. on Advanced Mechatronic Systems (ICAMechS), pp. 181-186, 2024. https://doi.org/10.1109/ICAMechS63130.2024.10818731
- [47] X. Yue, H. Li, and L. Meng, “AI-based Prevention Embedded System Against COVID-19 in Daily Life,” Procedia Computer Science, Vol.202, pp. 152-157, 2022. https://doi.org/10.1016/j.procs.2022.04.021
- [48] S. Wen, Z. Shi, and H. Li, “Coordinated Transport by Dual Humanoid Robots Using Distributed Model Predictive Control,” Biomimetics, Vol.9, Issue 6, Article No.332, 2024. https://doi.org/10.3390/biomimetics9060332
- [49] P. I. Corke, “A Simple and Systematic Approach to Assigning Denavit–Hartenberg Parameters,” IEEE Trans. on Robotics, Vol.23, Issue 3, pp. 590-594, 2007. https://doi.org/10.1109/TRO.2007.896765
- [50] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual Question Answering,” 2015 IEEE Int. Conf. on Computer Vision (ICCV), pp. 2425-2433, 2015. https://doi.org/10.1109/ICCV.2015.279
- [51] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02), pp. 311-318, 2002. https://doi.org/10.3115/1073083.1073135
- [52] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," Proc. of ACL-WMT, pp. 65-72, 2005.
- [53] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” Text Summarization Branches Out, pp. 74-81, 2004.
- [54] D. K. Po, “Similarity Based Information Retrieval Using Levenshtein Distance Algorithm,” Int. J. Adv. Sci. Res. Eng., Vol.6, Issue 4, pp. 6-10, 2020. https://doi.org/10.31695/IJASRE.2020.33780
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.