JACIII Vol.26 No.6 pp. 995-1003
doi: 10.20965/jaciii.2022.p0995


Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings

Kosuke Ota, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama

Department of Electrical and Computer Engineering, Shinshu University
4-17-1 Wakasato, Nagano 380-8553, Japan

September 8, 2020
July 16, 2022
November 20, 2022
machine vision and scene understanding, natural language processing, deep learning, siamese network
Image retrieval based on multimodal analogy based on tuples of an image and words

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neural network called multimodal siamese network. The network consists of recurrent neural networks and convolutional neural networks based on the siamese architecture. We also introduce an effective procedure to generate analogy examples from an image-caption dataset for training of our network. In our experiments, we test our model on analogy-based image retrieval tasks. The results show that our method outperforms the previous work in qualitative evaluation.

Cite this article as:
K. Ota, K. Shirai, H. Miyao, and M. Maruyama, “Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings,” J. Adv. Comput. Intell. Intell. Inform., Vol.26, No.6, pp. 995-1003, 2022.
