Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings

Kosuke Ota; Keiichiro Shirai; Hidetoshi Miyao; Minoru Maruyama

doi:10.20965/jaciii.2022.p0995

single-jc.php

« previous

JACIII Vol.26 No.6 pp. 995-1003

doi: 10.20965/jaciii.2022.p0995

(2022)

Paper:

Views over last 60 days: 809

Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings

Kosuke Ota, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama

Department of Electrical and Computer Engineering, Shinshu University
4-17-1 Wakasato, Nagano 380-8553, Japan

Received:

September 8, 2020

Accepted:

July 16, 2022

Published:

November 20, 2022

Keywords:

machine vision and scene understanding, natural language processing, deep learning, siamese network

Abstract

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neural network called multimodal siamese network. The network consists of recurrent neural networks and convolutional neural networks based on the siamese architecture. We also introduce an effective procedure to generate analogy examples from an image-caption dataset for training of our network. In our experiments, we test our model on analogy-based image retrieval tasks. The results show that our method outperforms the previous work in qualitative evaluation.

Image retrieval based on multimodal analogy based on tuples of an image and words

Cite this article as:

K. Ota, K. Shirai, H. Miyao, and M. Maruyama, “Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings,” J. Adv. Comput. Intell. Intell. Inform., Vol.26 No.6, pp. 995-1003, 2022.

Data files:

References

[1] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” arXiv:1411.2539, 2014.
[2] A. Karpathy and F.-F. Li, “Deep Visual-Semantic Alignments for Generating Image Descriptions,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, 2015.
[3] H. Nam, J.-W. Ha, and J. Kim, “Dual Attention Networks for Multimodal Reasoning and Matching,” Proc. of the IEEE Conf. on CVPR, pp. 2156-2164, 2017.
[4] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense video captioning with masked transformer,” Proc. of the IEEE Conf. on CVPR, pp. 8739-8748, 2018.
[5] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image Captioning: Transforming Objects into Words,” Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Article No.5963, 2019.
[6] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-Memory Transformer for Image Captioning,” Proc. of the IEEE Conf. on CVPR, pp. 10578-10587, 2020.
[7] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” Advances in Neural Information Processing Systems 26 (NIPS 2013), Article No.1048, 2013.
[8] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean, “Zero-Shot Learning by Convex Combination of Semantic Embeddings,” arXiv:1312.5650, 2013.
[9] L. Wang, Y. Li, and S. Lazebnik, “Learning Deep Structure-Preserving Image-Text Embeddings,” Proc. of the IEEE Conf. on CVPR, pp. 5005-5013, 2016.
[10] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding,” Proc. of the IEEE Int. Conf. on Computer Vision (ICCV), pp. 1881-1889, 2017.
[11] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improved Visual-Semantic Embeddings,” arXiv:1707.05612v1, 2017.
[12] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks,” Proc. of the IEEE ICCV, pp. 5907-5915, 2017.
[13] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks,” Proc. of the IEEE Conf. on CVPR, pp. 1316-1324, 2018.
[14] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-Shot Text-to-Image Generation,” arXiv:2102.12092, 2021.
[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and Their Compositionality,” NIPS 2013, Article No.1421, 2013.
[16] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation,” Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, 2014.
[17] S. J. Hwang, K. Grauman, and F. Sha, “Analogy-Preserving Semantic Embedding for Visual Object Categorization,” Proc. of the 30th Int. Conf. on Machine Learning (ICML), Vol.28, No.3, pp. 639-647, 2013.
[18] F. Sadeghi, C. L. Zitnick, and A. Farhadi, “Visalogy: Answering Visual Analogy Questions,” Advances in Neural Information Processing Systems 28 (NIPS 2015), Article No.1152, 2015.
[19] X. Wang, K. M. Kitani, and M. Hebert, “Contextual Visual Similarity,” arXiv:1612.02534, 2016.
[20] J. Peyre, J. Sivic, I. Laptev, and C. Schmid, “Detecting Unseen Visual Relations Using Analogies,” Proc. of the IEEE ICCV, pp. 1981-1990, 2019.
[21] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a “Siamese” time delay neural network,” Advances in Neural Information Processing Systems 6 (NIPS 1993), Article No.769, 1993.
[22] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556, 2014.
[23] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv:1412.3555, 2014.
[24] T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” Computer Vision – European Conf. on Computer Vision (ECCV) 2014, pp. 740-755, 2014.
[25] “GitHub – ryankiros/visual-semantic-embedding.” https://github.com/ryankiros/visual-semantic-embedding [accessed April 1, 2018]
[26] D. P. Kingma and L. J. Ba, “Adam: A Method for Stochastic Optimization,” Int. Conf. on Learning Representations (ICLR 2015), Poster Presentations, 2015.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems 30 (NIPS 2017), Article No.3058, 2017.
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. of the IEEE Conf. on CVPR, pp. 770-778, 2016.
[29] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale,” Proc. of ICLR 2021, 2021.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” arXiv:1411.2539, 2014.

[2] [2] A. Karpathy and F.-F. Li, “Deep Visual-Semantic Alignments for Generating Image Descriptions,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, 2015.

[3] [3] H. Nam, J.-W. Ha, and J. Kim, “Dual Attention Networks for Multimodal Reasoning and Matching,” Proc. of the IEEE Conf. on CVPR, pp. 2156-2164, 2017.

[4] [4] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end dense video captioning with masked transformer,” Proc. of the IEEE Conf. on CVPR, pp. 8739-8748, 2018.

[5] [5] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image Captioning: Transforming Objects into Words,” Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Article No.5963, 2019.

[6] [6] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-Memory Transformer for Image Captioning,” Proc. of the IEEE Conf. on CVPR, pp. 10578-10587, 2020.

[7] [7] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” Advances in Neural Information Processing Systems 26 (NIPS 2013), Article No.1048, 2013.

[8] [8] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean, “Zero-Shot Learning by Convex Combination of Semantic Embeddings,” arXiv:1312.5650, 2013.

[9] [9] L. Wang, Y. Li, and S. Lazebnik, “Learning Deep Structure-Preserving Image-Text Embeddings,” Proc. of the IEEE Conf. on CVPR, pp. 5005-5013, 2016.

[10] [10] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding,” Proc. of the IEEE Int. Conf. on Computer Vision (ICCV), pp. 1881-1889, 2017.

[11] [11] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improved Visual-Semantic Embeddings,” arXiv:1707.05612v1, 2017.

[12] [12] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks,” Proc. of the IEEE ICCV, pp. 5907-5915, 2017.

[13] [13] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks,” Proc. of the IEEE Conf. on CVPR, pp. 1316-1324, 2018.

[14] [14] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-Shot Text-to-Image Generation,” arXiv:2102.12092, 2021.

[15] [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and Their Compositionality,” NIPS 2013, Article No.1421, 2013.

[16] [16] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global Vectors for Word Representation,” Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, 2014.

[17] [17] S. J. Hwang, K. Grauman, and F. Sha, “Analogy-Preserving Semantic Embedding for Visual Object Categorization,” Proc. of the 30th Int. Conf. on Machine Learning (ICML), Vol.28, No.3, pp. 639-647, 2013.

[18] [18] F. Sadeghi, C. L. Zitnick, and A. Farhadi, “Visalogy: Answering Visual Analogy Questions,” Advances in Neural Information Processing Systems 28 (NIPS 2015), Article No.1152, 2015.

[19] [19] X. Wang, K. M. Kitani, and M. Hebert, “Contextual Visual Similarity,” arXiv:1612.02534, 2016.

[20] [20] J. Peyre, J. Sivic, I. Laptev, and C. Schmid, “Detecting Unseen Visual Relations Using Analogies,” Proc. of the IEEE ICCV, pp. 1981-1990, 2019.

[21] [21] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a “Siamese” time delay neural network,” Advances in Neural Information Processing Systems 6 (NIPS 1993), Article No.769, 1993.

[22] [22] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556, 2014.

[23] [23] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv:1412.3555, 2014.

[24] [24] T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” Computer Vision – European Conf. on Computer Vision (ECCV) 2014, pp. 740-755, 2014.

[25] [25] “GitHub – ryankiros/visual-semantic-embedding.” https://github.com/ryankiros/visual-semantic-embedding [accessed April 1, 2018]

[26] [26] D. P. Kingma and L. J. Ba, “Adam: A Method for Stochastic Optimization,” Int. Conf. on Learning Representations (ICLR 2015), Poster Presentations, 2015.

[27] [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems 30 (NIPS 2017), Article No.3058, 2017.

[28] [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. of the IEEE Conf. on CVPR, pp. 770-778, 2016.

[29] [29] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale,” Proc. of ICLR 2021, 2021.