Audio Visual Speaker Localization Using Graphical Models

profil-zyak-2012 - Jean Ponce Thomas Huang

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

5 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Niveau: Supérieur, Doctorat, Bac+8
Audio-Visual Speaker Localization Using Graphical Models Akash Kushal Mandar Rahurkar Li Fei-Fei Jean Ponce Thomas Huang University of Illinois, Urbana Champaign Abstract In this work we propose an approach to combine audio and video modalities for person tracking using graphical models. We demonstrate a principled and intuitive frame- work for combining these modalities to obtain robustness against occlusion and change in appearance. We further exploit the temporal correlations that exist for a moving ob- ject between adjacent frames to account for the cases where having both modalities might still not be enough, e.g., when the person being tracked is occluded and not speaking. Im- provement in tracking results is shown at each step and compared with manually annotated ground truth. 1 Introduction Multi-modal information fusion is an important prob- lem in multimedia. The challenge is to combine different modalities to have a synergistic effect. There has been sub- stantial work done in tracking moving objects using video, e.g. [2, 8, 7]. Multiple microphones have also been used to estimate the position of a speaker, e.g. [3]. Depending upon the position of the person, the sound reaches one mi- crophone before the other and thus the signals received by the microphones are displaced by some number of samples ? . However, the problem of using these modalities together is relatively new.

shows video

point there

audio

mean random gaussian

video

audio-visual speaker

localization using

li fei-fei

adding zero

Informations

Publié par	profil-zyak-2012
Nombre de lectures	17
Langue	English

Extrait

Audio-Visual Speaker Localization Using Graphical Models Akash KushalMandar RahurkarLi Fei-FeiJean PonceThomas Huang University of Illinois, Urbana Champaign Abstract Speakerθ In this work we propose an approach to combine audio and video modalities for person tracking using graphical△ models. Wedemonstrate a principled and intuitive frame-Mic1 Mic2 work for combining these modalities to obtain robustness Camera against occlusion and change in appearance.We further Figure 1. Experimental setup exploit the temporal correlations that exist for a moving ob-ject between adjacent frames to account for the cases where having both modalities might still not be enough, e.g., when locks onto the background instead of the target person. We the person being tracked is occluded and not speaking. Im-propose an alternate video model that explicitly accounts provement in tracking results is shown at each step and for the background appearance.The proposed model also compared with manually annotated ground truth. explicitly models the intermittent occlusion of the speaker in the video.Second, the model in [1] uses the data as a 1 Introduction bag of frames and does not make use of the strong correla-tion among the position of the moving person in consecutive Multi-modal information fusion is an important prob-frames. Wemodel this correlation and propose a dynamic lem in multimedia.The challenge is to combine different programming algorithm to determine the most likely path modalities to have a synergistic effect. There has been sub-of moving person in the video. stantial work done in tracking moving objects using video, e.g. [2,8, 7].Multiple microphones have also been used 2 AGenerative Model for Audio-Video Data to estimate the position of a speaker, e.g.[3]. Depending upon the position of the person, the sound reaches one mi-Our experimental setup (shown in Fig. 1) is similar to crophone before the other and thus the signals received by that of [1].Two microphones with sampling rate44KH z the microphones are displaced by some number of samples are placed about30cmapart and a camera in the center cap-τ. However, the problem of using these modalities together tures120×160video frames at a frame rate of10f ps. Fig-is relatively new. Garg et.al [6] address the speaker detec-ure 2 shows the proposed Audio-Visual graphical model. tion problem by combining multiple features, such as skin This section describes the audio, video and linkage param-color, lip motion etc., using a probabilistic approach.The eters of the proposed model. particle ﬁltering approach of [2] was extended by Vermaak et al. [9] to include audio by modeling the cross correlationsAudio: The variablesx1andx2represent the audio signals between the signals recieved by a microphone array.observed at the microphones1and2respectively. Both the Our focus however is using generative graphical modelssignals are partitioned into disjoint parts, one for each video to solve this problem in a systematic manner.This paperframe. Thenumber of samplesNin each part is the ratio builds on the work of Beal et.al [1] on audio-visual track-of the audio sampling rate to the video frame rate (in our ing and makes the following contributions. First, the videocaseN= 4400).Afrepresents the true audio signal corre-model proposed in [1] models the foreground by a subsetsponding to framef. The observationx1at the left micro-of the pixels with constant appearance, translating togetherphone is generated by adding zero mean random gaussian against a background of random noise throughout the videonoise with precision matrix (inverse covariance matrix)ν sequence. Theresults shown in [1] have the target persontoAf. Theobserved signalx2at the right microphone is occupying a large portion of the image and the people in thegenerated by shiftingAfby a discrete sample delayτ, mul-background are moving around which adds randomness totiplying it by the relative attenuation ratioλbetween the two the background appearance.Hence, the model of [1] suc-microphones and again adding zero mean random gaussian cessfully locks onto the target person.This is not the casenoise with precision matrixνto the result. We use circular in most scenes where the background appearance remainsshift in our implementation to simplify computation. Also, more or less constant.In this case the video model of [1]we modelνas a diagonal matrix with a constant diagonal.

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

Audio Visual Speaker Localization Using Graphical Models

YouScribe

Le catalogue

Le service

Les conditions