Low-dimensional representation of Gaussian mixture model supervector for language recognition

biomed - Yang Jinchao , Xiang Zhang , Suo Hongbin , Lu Li , Zhang Jianping , Yan , Yan Yonghong

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

7 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

In this article, we propose a new feature which could be used for the framework of SVM-based language recognition, by introducing the idea of total variability used in speaker recognition to language recognition. We consider the new feature as low-dimensional representation of Gaussian mixture model supervector. Thus we propose multiple total variability (MTV) language recognition system based on total variability (TV) language recognition system. Our experiments show that the total factor vector includes the language dependent information; what's more, multiple total factor vector contains more language dependent information. Experimental results on 2007 National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) databases show that MTV outperforms TV in 30 s tasks, and both TV and MTV systems can achieve performance similar to that obtained by state-of-the-art approaches. Best performance of our acoustic language recognition systems can be further improved by combining these two new systems.

Sujets

Language recognition

Support vector machine

Linear discriminant analysis

Informations

Publié par	biomed
Publié le	01 janvier 2012
Nombre de lectures	149
Langue	English

Extrait

Yang et al. EURASIP Journal on Advances in Signal Processing 2012, 2012:47
http://asp.eurasipjournals.com/content/2012/1/47
RESEARCH Open Access
Low-dimensional representation of Gaussian
mixture model supervector for language
recognition
*Jinchao Yang , Xiang Zhang, Hongbin Suo, Li Lu, Jianping Zhang and Yonghong Yan
Abstract
In this article, we propose a new feature which could be used for the framework of SVM-based language
recognition, by introducing the idea of total variability used in speaker recognition to language recognition. We
consider the new feature as low-dimensional representation of Gaussian mixture model supervector. Thus we
propose multiple total variability (MTV) language recognition system based on total variability (TV) language
recognition system. Our experiments show that the total factor vector includes the language dependent
information; what’s more, multiple total factor vector contains more language dependent information.
Experimental results on 2007 National Institute of Standards and Technology (NIST) Language Recognition
Evaluation (LRE) databases show that MTV outperforms TV in 30 s tasks, and both TV and MTV systems can achieve
performance similar to that obtained by state-of-the-art approaches. Best performance of our acoustic language
recognition systems can be further improved by combining these two new systems.
Keywords: language recognition, total variability (TV), multiple total variability (MTV), support vector machine, linear
discriminant analysis, locality preserving projection
1 Introduction vector machine (SVM) in language recognition to form
The aim of language recognition is to determine the lan- GMM-SVM system [5,6]. In language recognition evalua-
guage spoken in a given segment of speech. It is generally tion, MMI and GMM-SVM are primary acoustic systems.
believed that phonotactic feature and spectral feature pro- Recently, total variability approach has been proposed
vide complementary cues to each other [1,2]. Phone recog- in speaker recognition [7,8], which uses the factor analy-
nizer followed by language models (PRLM) and parallel sis to define a new low-dimensional space that is named
PRLM (PPRLM) approaches that use phonotactic informa- total variability space. In contrast to classical joint factor
tion have shown very successful performance [2,3]. The analysis (JFA), the speaker and the channel variability
acoustic method which uses spectral feature has the advan- are contained simultaneously in this new space. The
tage that it does not require specialized language knowl- intersession compensation can be carried out in low-
edge and is computationally simple. This article focuses on dimensional space.
the acoustic component of the language recognition sys- Actually, we can consider total variability approach as
tems. The spectral features of speech are collected as inde- a classical application of the probabilistic principal com-
ponent analysis (PPCA) [9]. The factor analysis of thependent vectors. The collection of vectors can be extracted
as shifted-delta-cepstral acoustic features, and then mod- total variability approach can obtain useful information
eled by Gaussian mixture model (GMM). The result was by reducing the dimension of the space of GMM super-
reported in [4]. The approach was further improved by vectors. That is all utterances could in fact be well
using discriminative training that is named maxi-mum represented in a low-dimensional space. We believe use-
mutual information (MMI). Several studies use support ful language information can be obtained by similar
front-end processes. Therefore we try to introduce the
idea of total variability to language recognition. We esti-* Correspondence: superyoungking@163.com
Key Laboratory of Speech Acoustics and Content Understanding, Chinese mate the language total variability space by using the
Academy of Sciences, Beijing, P.R. China
© 2012 Yang et al; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution
License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.Yang et al. EURASIP Journal on Advances in Signal Processing 2012, 2012:47 Page 2 of 7
http://asp.eurasipjournals.com/content/2012/1/47
dataset shown in Section 5, and we suppose that a given We believe useful language information can be
target language’s entire set of utterances is regarded as obtained by similar front-end process. Thus we try to
having been belonging to different language. Then, the apply total variability in language recognition.
total factor vector is extracted by projecting an utter-
ance to the language total variability space. As in 2.2 Support vector machines
speaker recognition, intersession compensation can also SVM [17] is used as a classifier after our proposed
be performed well on low-dimension total factor vector. front-end process in language recognition system. An
In our experiments, two intersession compensation SVM is a two-class classifier constructed from sums of a
techniques–linear discriminant analysis (LDA) [6] and kernel function K(,):
locality preserving projection (LPP) [10-12]–are used to
Nimprove the performance of language recognition.
f(x)= α tK(x, x)+d (2)i i iIn some previous studies [13,14], rich information is
i=1
obtained by using multiple reference models, such as
male and female gender-dependent models in speaker where N is the number of support vectors, t is thei
recognition. Generally, there are abundant language data ideal output, a is the weight for the support vector x,i i
Nfor each target language in language recognition, and a >0and . The ideal outputs are either 1α t =0i i ii=1
the number of target languages is limited. Based on TV
or -1, depending upon whether the corresponding sup-
language recognition system [12,15], we propose MTV
port vector belongs to class 0 or class 1. For classifica- system where we use language-
tion, a class decision is based upon whether the value, f
dependent GMMs instead of universal background
(x), is above or below a threshold.
model (UBM) in the process of language total variability
space estimation and total factor vector extraction. Our
2.3 Compensation of channel factors
experiments show that total factor vector (TV system)
Compensating the variability from changes in speaker,
includes the language dependent information; what’s
channel, gender, and environment are the key for the
more, multiple total factor vector (MTV system) con-
performance of automatic language recognition systems.
tains more language dependent information.
In our proposed front-end process, the process of an
This article is organized as follows: In Section 2, we
intersession compensation technique in spectral feature
give a simple review of total variability, support vector
domain is still adopted, which has been proposed for
machines, and compensation of channel factors. In Sec-
speaker and language recognition in [18,19]. The adap-
tion 3, we apply total variability in language recognition.
(i)tation of the feature vector oˆ (t) is obtained by sub-In Section 4, the proposed language recognition system
tracting from the original observation feature a valueis presented in detail. Corpora and evaluation are given
that is a weighted sum of the intersession compensationin Section 5. Section 6 gives the experimental results.
offset values.Finally, we conclude in Section 7.

(i) (i) (i)oˆ (t)= o (t) − γ (t) ∗U ∗ym m (3)2 Background
m
2.1 Total variability in speaker recognition
In speaker recognition, unlike in classical joint factor where g (t) is the Gaussian posterior probability ofm
analysis (JFA), the total variability approach defines a each Gaussian mixture m of the universal background
new low-dimensional space that is named total variabil- model (UBM) for a given frame of an utterance. U andm
(i)ity space, which contains the speaker and the channel y are about the intersession compensation related to
variability simultaneously. The total variability approach the mth Gaussian of UBM. U is intersession subspacem
(i)in speaker recognition relaxes the independent assump- and y is channel factor vector. In our proposed lan-
tion between speaker and channel variability spaces in guage recognition system, we use spectral feature after
JFA speaker recognition [16]. compensation of channel factors.
For a given utterance, the speaker and channel varia-
bility dependent GMM supervector is denoted in Equa- 3 Applying total variability in language
tion (1). recognition
There is only one difference between total variability
(1)M = m + Twubm space T estimation and eigenvoice space estimation in
speaker recognition [8,20]. All the recordings of awhere m is the UBM supervector, T is total varia-ubm
speaker are considered as to belong to the same personbility space, and the member of the vector w is total
in the eigenvoice estimation. However, in the totalfactor.Yang et al. EURASIP Journal on Advances in Signal Processing 2012, 2012:47 Page 3 of 7
http://asp.eurasipjournals.com/content/2012/1/47
variability space estimation, a given speaker’sentireset (4)M = m +T wmandarin mandarin mandarin mandarin
of utterances is regarded as having been produced by