single-jc.php

JACIII Vol.29 No.6 pp. 1417-1426
doi: 10.20965/jaciii.2025.p1417
(2025)

Research Paper:

Interactive Image Caption Generation Reflecting User Intent from Trace Using a Diffusion Language Model

Satoko Hirano and Ichiro Kobayashi ORCID Icon

Ochanomizu University
2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan

Received:
March 28, 2025
Accepted:
July 16, 2025
Published:
November 20, 2025
Keywords:
deep learning, natural language processing, image captioning, diffusion model, interaction
Abstract

This study proposes an image captioning method designed to incorporate user-specific explanatory intentions into the generated text, as signaled by the user’s trace on the image. We extract areas of interest from dense sections of the trace, determine the order of explanations by tracking changes in the pen-tip coordinates, and assess the degree of interest in each area by analyzing the time spent on them. Additionally, a diffusion language model is utilized to generate sentences in a non-autoregressive manner, allowing control over sentence length based on the temporal data of the trace. In the actual caption generation task, the proposed method achieved higher string similarity than conventional methods, including autoregressive models, and successfully captured user intent from the trace and faithfully reflected it in the generated text.

Overview of the proposed method

Overview of the proposed method

Cite this article as:
S. Hirano and I. Kobayashi, “Interactive Image Caption Generation Reflecting User Intent from Trace Using a Diffusion Language Model,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.6, pp. 1417-1426, 2025.
Data files:
References
  1. [1] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,” arXiv preprint, arXiv:2303.03378, 2023. https://doi.org/10.48550/arXiv.2303.03378
  2. [2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” arXiv preprint, arXiv:2103.00020, 2021. https://doi.org/10.48550/arXiv.2103.00020
  3. [3] J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” Proc. of the 39th Int. Conf. on Machine Learning, pp. 12888-12900, 2022.
  4. [4] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” Proc. of the 40th Int. Conf. on Machine Learning (ICML’23), pp. 19730-19742, 2023. https://doi.org/10.48550/arXiv.2301.12597
  5. [5] A. Bhattacharyya, M. Palmer, and C. Heckman, “ReCAP: Semantic role enhanced caption generation,” Proc. of the 2024 Joint Int. Conf. on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 13633-13649, 2024.
  6. [6] K. Basioti, M. A. Abdelsalam, F. Fancellu, V. Pavlovic, and A. Fazly, “Cic-bart-ssa: Controllable image captioning with structured semantic augmentation,” arXiv preprint, arXiv:2407.11393, 2024. https://doi.org/10.48550/arXiv.2407.11393
  7. [7] S. Mao, C. Zhang, H. Su, H. Song, I. Shalyminov, and W. Cai, “Controllable contextualized image captioning: Directing the visual narrative through user-defined highlights,” Proc. of the 18th European Conf. on Computer Vision (ECCV), 2024. https://doi.org/10.1007/978-3-031-72973-7_27
  8. [8] X. Wang, M. Diao, B. Li, H. Zhang, K. Liang, and Z. Ma, “From simple to professional: A combinatorial controllable image captioning agent,” arXiv preprint, arXiv:2412.11025, 2024. https://doi.org/10.48550/arXiv.2412.11025
  9. [9] Y. Zhao, Y. Liu, Z. Guo, W. Wu, C. Gong, F. Wan, and Q. Ye, “Controlcap: Controllable region-level captioning,” arXiv preprint, arXiv:2401.17910, 2024. https://doi.org/10.48550/arXiv.2401.17910
  10. [10] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint, arXiv:2204.06125, 2022. https://doi.org/10.48550/arXiv.2204.06125
  11. [11] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” arXiv preprint, arXiv:2112.10752, 2021. https://doi.org/10.48550/arXiv.2112.10752
  12. [12] X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. Hashimoto, “Diffusion-lm improves controllable text generation,” arXiv preprint, arXiv:2205.14217, 2022. https://doi.org/10.48550/arXiv.2205.14217
  13. [13] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint, arXiv:2207.12598, 2022. https://doi.org/10.48550/arXiv.2207.12598
  14. [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” Prof. of Advances in Neural Information Processing Systems (NIPS 2017), Vol.30, 2017.
  15. [15] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), pp. 234-241, 2015. https://doi.org/10.1007/978-3-319-24574-4_28
  16. [16] H. Zhang, X. Liu, and J. Zhang, “DiffuSum: Generation enhanced extractive summarization with diffusion,” Findings of the Association for Computational Linguistics (ACL 2023), pp. 13089-13100, 2023. https://doi.org/10.18653/v1/2023.findings-acl.828.
  17. [17] S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, “DiffuSeq: Sequence to sequence text generation with diffusion models,” Proc. of Int. Conf. on Learning Representations (ICLR 2023), 2023.
  18. [18] H. Yuan, Z. Yuan, C. Tan, F. Huang, and S. Huang, “Seqdiffuseq: Text diffusion with encoder-decoder transformers,” arXiv preprint, arXiv:2212.10325, 2022. https://doi.org/10.48550/arXiv.2212.10325
  19. [19] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” arXiv preprint, arXiv:2105.05233, 2021. https://doi.org/10.48550/arXiv.2105.05233
  20. [20] X. Han, S. Kumar, and Y. Tsvetkov, “SSD-LM: Semi-autoregressive simplex-based diffusion language model for text generation and modular control,” Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers), pp. 11575-11596, 2023. https://doi.org/10.18653/v1/2023.acl-long.647.
  21. [21] Z. Horvitz, A. Patel, C. Callison-Burch, Z. Yu, and K. McKeown, “Paraguide: Guided diffusion paraphrasers for plug-and-play textual style transfer,” arXiv preprint, arXiv:2308.15459, 2024. https://doi.org/10.48550/arXiv.2308.15459
  22. [22] T. Wu, Z. Fan, X. Liu, Y. Gong, Y. Shen, J. Jiao, H.-T. Zheng, J. Li, Z. Wei, J. Guo, N. Duan, and W. Chen, “Ar-diffusion: Auto-regressive diffusion model for text generation,” arXiv preprint, arXiv:2305.09515, 2023. https://doi.org/10.48550/arXiv.2305.09515
  23. [23] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Prof. of Advances in Neural Information Processing Systems (NIPS’20), Vol.33, pp. 6840-6851, 2020.
  24. [24] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” Proc. of the 32nd Int. Conf. on Machine Learning, pp. 2256-2265, 2015.
  25. [25] J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, and V. Ferrari, “Connecting vision and language with localized narratives,” European Conf. on Computer Vision (ECCV 2020), pp. 647-664, 2020. https://doi.org/10.1007/978-3-030-58558-7_38
  26. [26] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y. Choi, “CLIPScore: A reference-free evaluation metric for image captioning,” Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing, pp. 7514-7528, 2021. https://doi.org/10.18653/v1/2021.emnlp-main.595
  27. [27] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, 2015. https://doi.org/10.1109/CVPR.2015.7298932
  28. [28] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint, arXiv:1711.05101, 2019. https://doi.org/10.48550/arXiv.1711.05101
  29. [29] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318, 2002. https://doi.org/10.3115/1073083.1073135
  30. [30] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” Text Summarization Branches Out (WAS 2004), pp. 74-81, 2004.
  31. [31] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” Int. Conf. on Learning Representations (ICLR 2020), 2020.
  32. [32] C. Deng, N. Ding, M. Tan, and Q. Wu, “Length-controllable image captioning,” arXiv preprint, arXiv:2007.09580, 2020. https://doi.org/10.48550/arXiv.2007.09580
  33. [33] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, “Git: A generative image-to-text transformer for vision and language,” arXiv preprint, arXiv:2205.14100, 2022. https://doi.org/10.48550/arXiv.2205.14100
  34. [34] S. Watanabe and I. Kobayashi, “Image captioning that reflects the intent of the explainer based on tracing with a pen,” Proc. of the 36th Annual Conf. of the Japanese Society for Artificial Intelligence (JSAI2022), Article No.3Yin2-23, 2022 (in Japanese). https://doi.org/10.11517/pjsai.JSAI2022.0_3Yin223

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Nov. 19, 2025