Joint Audio-Visual Tracking Based on Dynamically Weighted Linear Combination of Probability State Density
Masaru Tsuchida*,**, Takahito Kawanishi*, Hiroshi Murase*,***, and Shigeru Takagi*
*NTT Communication Science Laboratories, NTT Corporation, 3-1, Morinosato-Wakamiya, Atsugi 243-0198, Japan
**Currently, NTT-DATA Corporation, Kayabacho Tower Bldg., 1-21-2 Shinkawa, Chuo-ku, Tokyo 104-0033, Japan
***Currently, Graduate School of Information Science, Nagoya University, Furo-cho, Chigusa-ku, Nagoya 464-8603, Japan
This paper proposes a method that can be applied to speaker tracking under stabilized, continuous conditions using visual and audio information even when input information is interrupted due to disturbance or occlusion caused by the effects of noise or varying illumination. Using this method, the position of a speaker is expressed based on a likelihood distribution that is obtained through integration of visual information and audio information. First, visual and audio information is integrated as as a weighted linear combination of probability density distribution, which is estimated as a result of the observation of the visual and audio information. In this case, the weight is taken as a variable, which varys in proportion to the maximum value of probability density distributions obtained for each type of information. Next, the result obtained as described above and the weighted linear combination of the distribution in the past are obtained, and the result thus obtained is taken as the likelihood distribution related to the position of the speaker. By changing the weight dynamically, it becomes possible to select the type of information freely or to add weight and, accordingly, to conduct stabilized, continuous tracking even when the speaker cannot be detected momentarily due to occlusion, voice interruption, or noise. We conducted a series of experiments on speaker tracking using circular microphone array and an omni-directional camera. In this way, we have succeeded in confirming it possible to perform stabilized tracking on speakers continuously in spite of occlusion or voice interruption.