single-jc.php

JACIII Vol.29 No.5 pp. 1190-1202
doi: 10.20965/jaciii.2025.p1190
(2025)

Research Paper:

Verbal Representation of Object Collision Prediction Based on Physical Properties

Eri Kuroda ORCID Icon and Ichiro Kobayashi ORCID Icon

Ochanomizu University
2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610, Japan

Received:
March 8, 2025
Accepted:
June 2, 2025
Published:
September 20, 2025
Keywords:
real-world recognition, prediction, physical common sense knowledge
Abstract

In recent years, prediction models for the real world have been widely proposed. Most research that deals with the recognition and prediction of the real world generates prediction results from visual predictions such as changes in pixels or numerical changes in physical simulators, and few models can predict them based on both visual and physical characteristics, as humans can. Therefore, in this study, we constructed a new prediction model based on both visual information and physical characteristics in the environment by integrating the mechanism of variational temporal abstraction, which extracts change points in the observation environment from visual information, into PreCNet. Furthermore, to make the prediction results interpretable, we generated the inferred prediction content as a sentence. In addition, we verified whether the generated sentences could explain collision situations in as much detail as a human being when given physical common sense about the environment, such as the movement and mass of objects.

Overview of our research

Overview of our research

Cite this article as:
E. Kuroda and I. Kobayashi, “Verbal Representation of Object Collision Prediction Based on Physical Properties,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.5, pp. 1190-1202, 2025.
Data files:
References
  1. [1] D. Ha and J. Schmidhuber, “World models,” arXiv preprint, arXiv:1803.10122, 2018. https://doi.org/10.5281/zenodo.1207631
  2. [2] Y. LeCun, “A path towards autonomous machine intelligence,” 2022.
  3. [3] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” arXiv preprint, arXiv:1605.08104, 2016. https://doi.org/10.48550/arXiv.1605.08104
  4. [4] Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, P. S. Yu, and M. Long, “PredRNN: A recurrent neural network for spatiotemporal predictive learning,” arXiv preprint, arXiv:2103.09504, 2021. https://doi.org/10.48550/arXiv.2103.09504
  5. [5] Z. Straka, T. Svoboda, and M. Hoffmann, “PreCNet: Next-frame video prediction based on predictive coding,” IEEE Trans. Neural Netw. Learn. Syst., Vol.35, No.8, pp. 10353-10367, 2023. https://doi.org/10.1109/TNNLS.2023.3240857
  6. [6] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning,” arXiv preprint, arXiv:1612.06890, 2016. https://doi.org/10.48550/arXiv.1612.06890
  7. [7] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum, “CLEVRER: CoLlision events for video REpresentation and reasoning,” arXiv preprint, arXiv:1910.01442, 2019. https://doi.org/10.48550/arXiv.1910.01442
  8. [8] C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner, “MONet: Unsupervised scene decomposition and representation,” arXiv preprint, arXiv:1901.11390, 2019. https://doi.org/10.48550/arXiv.1901.11390
  9. [9] J. Mao, X. Yang, X. Zhang, N. D. Goodman, and J. Wu, “CLEVRER-humans: Describing physical and causal events the human way,” arXiv preprint, arXiv:2310.03635, 2023. https://doi.org/10.48550/arXiv.2310.03635
  10. [10] R. Zellers, A. Holtzman, M. Peters, R. Mottaghi, A. Kembhavi, A. Farhadi, and Y. Choi, “PIGLeT: Language grounding through neuro-symbolic interaction in a 3D world,” Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int. Joint Conf. on Natural Language Processing (Vol.1: Long Papers), pp. 2040-2050, 2021.
  11. [11] T. Kim, S. Ahn, and Y. Bengio, “Variational temporal abstraction,” arXiv preprint, arXiv:1910.00775, 2019. https://doi.org/10.48550/arXiv.1910.00775
  12. [12] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time instance segmentation,” arXiv preprint, arXiv:1904.02689, 2019. https://doi.org/10.48550/arXiv.1904.02689
  13. [13] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common objects in context,” arXiv preprint, arXiv:1405.0312, 2014. https://doi.org/10.48550/arXiv.1405.0312
  14. [14] A. Mafla, S. Dey, A. F. Biten, L. Gomez, and D. Karatzas, “Multi-modal reasoning graph for scene-text based fine-grained image classification and retrieval,” 2021 IEEE Winter Conf. on Applications of Computer Vision (WACV), pp. 4023-4033, 2021.
  15. [15] Y. Zhang, J. Gao, X. Yang, C. Liu, Y. Li, and C. Xu, “Find objects and focus on highlights: Mining object semantics for video highlight detection via graph neural networks,” Proc. Conf. AAAI Artif. Intell., Vol.34, No.07, pp. 12902-12909, 2020. https://doi.org/10.1609/aaai.v34i07.6988
  16. [16] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” arXiv preprint, arXiv:1607.00653, 2016. https://doi.org/10.48550/arXiv.1607.00653
  17. [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint, arXiv:1301.3781, 2013. https://doi.org/10.48550/arXiv.1301.3781
  18. [18] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal, “graph2vec: Learning distributed representations of graphs,” arXiv preprint, arXiv:1707.05005, 2017. https://doi.org/10.48550/arXiv.1707.05005
  19. [19] Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents,” arXiv preprint, arXiv:1405.4053, 2014. https://doi.org/10.48550/arXiv.1405.4053
  20. [20] E. Kuroda and I. Kobayashi, “Predictive inference model of the physical environment that emulates predictive coding,” A. Bifet, A. C. Lorena, R. P. Ribeiro, J. Gama, and P. H. Abreu (Eds.), “Discovery Science,” pp. 431-445, Springer, 2023. https://doi.org/10.1007/978-3-031-45275-8_29
  21. [21] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” arXiv preprint, arXiv:1506.02557, 2015. https://doi.org/10.48550/arXiv.1506.02557
  22. [22] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02), pp. 311-318, 2002. https://doi.org/10.3115/1073083.1073135
  23. [23] A. Lavie and A. Agarwal, “METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments,” Proc. of the Second Workshop on Statistical Machine Translation, pp. 228-231, 2007.
  24. [24] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” arXiv preprint, arXiv:1411.5726, 2014. https://doi.org/10.48550/arXiv.1411.5726
  25. [25] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint, arXiv:1910.10683, 2019. https://doi.org/10.48550/arXiv.1910.10683
  26. [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint, arXiv:1706.03762, 2017. https://doi.org/10.48550/arXiv.1706.03762
  27. [27] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint, arXiv:1711.05101, 2017. https://doi.org/10.48550/arXiv.1711.05101
  28. [28] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating text generation with BERT,” arXiv preprint, arXiv:1904.09675, 2019. https://doi.org/10.48550/arXiv.1904.09675
  29. [29] T. Sellam, D. Das, and A. P. Parikh, “BLEURT: Learning robust metrics for text generation,” arXiv preprint, arXiv:2004.04696, 2020. https://doi.org/10.48550/arXiv.2004.04696
  30. [30] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” Proc. of the Workshop on Text Summarization Branches Out (WAS 2004), pp. 74-81, 2004.
  31. [31] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: NLG evaluation using GPT-4 with better human alignment,” arXiv preprint, arXiv:2303.16634, 2023. https://doi.org/10.48550/arXiv.2303.16634

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Sep. 19, 2025