Investigations on linear transformations for speaker adaptation and normalization [Elektronische Ressource] / von Michael Pitz

rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen - Michael Pitz

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

172 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Investigations on Linear Transformations
for Speaker Adaptation and
Normalization
Von der Fakult at fur Mathematik, Informatik und
Naturwissenschaften der Rheinisch-Westf alischen Technischen
Hochschule Aachen zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften genehmigte Dissertation
von
Diplom–Physiker Michael Pitz
aus
Aachen
Berichter: Universit atsprofessor Dr.-Ing. Hermann Ney
Professor Dr. Christian Wellekens
Tag der mundlic hen Prufung: 14. M arz 2005
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online
verfug bar.Zwei Dinge sind zu unserer Arbeit n otig: Unermud liche
Ausdauer und die Bereitschaft, etwas, in das man viel
Zeit und Arbeit gesteckt hat, wieder wegzuwerfen.
Albert EinsteinAcknowledgments
First I would like to thank my supervisor Prof. Dr.-Ing. Hermann Ney, head of the
Lehrstuhl fur Informatik VI at the RWTH Aachen, for the opportunity to realize
this work as part of your team. You introduced me to the exciting eld of pattern
recognition in general and speech recognition in particular. You allowed me great
latitude to pursue my ideas and followed them with great interest. I would also like
to thank you for the numerous interesting and enlightening discussions we had.
I am also grateful to my second supervisor Prof. Christian Wellekens, who is with
the Multimedia Communications Department of Institut Eurecom, France, for your
interest in my work, the in-depth reading of this thesis and the valuable comments.
Stephan Kanthak, you have been an enormous help in many computer problems
and di cult debugging sessions. I always admired your deep insight in computer
technology, Linux and C++. Besides that, we had many funny talks about the
world and his brother.
Ralf Schluter, I am grateful for our discussions and numerous sessions at the
whiteboard, which gave me a deeper insight into speech recognition and helped to
solve a couple of problems.
Oliver Bender, Michael Motter, Stefan Koltermann, Mirko Kohns, Achim Sixtus
and Klaus Macherey, you kept the computers running and patiently dealt with all
my requests.
I always enjoyed very much the relaxing time at lunch and co ee breaks with the
“Geigeltruppe” Achim, Andras, Frank, Nicola, Ralf, Sirko, Sonja, and Stephan.
To all current and former colleagues of the Lehrstuhl fur Informatik VI for the
motivating atmosphere, many interesting discussions and also many laughter.
IwanttoexpressaveryspecialthanktomygirlfriendBeate. Youhadanimportant
part in the success of this thesis. Without you, life would be less wonderful.
Nicht zuletzt moc hte ich besonders meinen Eltern danken. Ihr habt meinen Weg
immer verfolgt, mich ermutigt und unterstutzt.
This work was partially funded by the European Commission under the Human
Language Technologies project CORETEX (IST-1999-11876), and by the DFG
(Deutsche Forschungsgemeinschaft) under contract NE 572/4-1 and NE 572/4-3.Abstract
This thesis deals with linear transformations at various stages of the automatic
speech recognition process.
In current state-of-the-art speech recognition systems linear transformations are
widely used to care for a potential mismatch of the training and testing data and
thus enhance the recognition performance. A large number of approaches has been
proposed in literature, though the connections between them have been disregarded
sofar. Bydevelopingauni edmathematicalframework,closerelationshipsbetween
the particular approaches are identi ed and analyzed in detail.
MelfrequencyCepstralcoe cients(MFCC)arecommonlyusedfeaturesforauto-
matic speechrecognitionsystems. The traditionalwayofcomputingMFCCs su ers
from a twofold smoothing, which complicates both the MFCC computation and the
system optimization. An improved approach is developed that does not use any
lter bank and thus avoids the twofold smoothing. This integrated approach allows
a very compact implementation and needs less parameters to be optimized.
Starting from this new computation scheme for MFCCs, it is proven analytically
that vocal tract normalization (VTN) equals a linear transformation in the Cepstral
space for arbitrary invertible warping functions. The transformation matrix for
VTN is explicitly calculated exemplary for three commonly used warping functions.
Based on some general characteristics of typical VTN warping functions, a common
structure of the transformation matrix is derived that is almost independent of the
speci c functional form of the warping function. By expressing VTN as a linear
transformation it is possible, for the rst time, to take the Jacobian determinant of
the transformation into account for any warping function. The e ect of considering
the Jacobian determinant on the warping factor estimation is studied in detail.
Thesecondpartofthisthesisdealswithaspeciallineartransformationforspeaker
adaptation, the Maximum Likelihood Linear Regression (MLLR) approach. Based
on the close interrelationship between MLLR and VTN proven in the rst part, the
general structure of the VTN matrix is adopted to restrict the MLLR matrix to a
band structure, which signi cantly improves the MLLR adaptation for the case of
limited available adaptation data.
Finally, several enhancements to MLLR speaker adaptation are discussed. One
deals with re ned de nitions of regression classes, which is of special importance for
fast adaptation when only limited adaptation data are available. Another enhance-
ment makes use of con dence measures to care for recognition errors that decrease
the adaptation performance in the rst pass of a two-pass adaptation process.Zusammenfassung
DieseArbeitbefa tsichmitlinearenTransformationenanverschiedenenStellendes
automatischen Spracherkennungsprozesses.
In modernen automatischen Spracherkennungssystemen sind lineare Transforma-
tionen ein beliebtes Mittel, um einer Diskrepanz von Trainings- und Testdaten ent-
gegenzuwirken und somit die Erkennungsleistung zu steigern. Eine Vielzahl von
Ansatzen ist in der Literatur vorgeschlagen worden, allerdings wurden die Zusam-
menhange zwischen den Ansatzen bisher vernachlassigt. Durch die Entwicklung ei-
ner vereinheitlichten mathematischen Beschreibung werden enge Zusammenhang e
zwischen den einzelnen Ansatzen aufgezeigt und ausfuhrlich untersucht.
Mel-Frequenz Cepstrum Koe zienten (MFCC) werden sehr h au g als Merkma-
le in automatischen Spracherkennungssystemen eingesetzt. Der ubliche Ansatz zur
Berechnung der MFCC beinhaltet allerdings eine doppelte Glattung, was sowohl
die Berechnung der MFCC als auch die Parameteroptimierung erschwert. Es wird
ein verbesserter Ansatz vorgestellt, der auf eine Filterbank verzichtet und somit die
doppelte Glat tung vermeidet. Dieser integrierte Ansatz erlaubt eine sehr kompakte
Implementierung und benotigt weniger zu optimierende Parameter.
Ausgehend von dieser neuen Methode zur Berechnung der MFCC wird analytisch
gezeigt, da Vokaltraktlangennormierung (VTN) fur beliebige invertierbare Verzer-
rungsfunktionen als eine lineare Transformation im Cepstrumraum dargestellt wer-
den kann. Die Transformationsmatrix fur VTN wird beispielhaft fur drei hau g ver-
wendete Verzerrungsfunktionen explizit berechnet. Basierend auf einigen generellen
EigenschaftentypischerVTNVerzerrungsfunktionenwirdeinegemeinsameStruktur
der Transformationsmatrizen abgeleitet, die gro tenteils unabh angig von der funk-
tionellen Form der Verzerrungsfunktion ist. Durch die Mog lichkeit VTN als lineare
Transformation auszudrucken ist es erstmals moglich die Jacobi-Determinante der
Trtionfur beliebigeWarpingfunktionenzuberuc ksichtigen.DieAuswirkun-
gen der Berucksichtigung der Jacobi-Determinante bei der Warpingfaktorschatzung
werden ausfuhrlic h untersucht.
DerzweiteTeildieserArbeitbeschaftigtsichmiteinerspeziellenlinearenTransfor-
mation zur Sprecheradaption, des Maximum Likelihood Linear Regression (MLLR)
Ansatzes. Basierend auf dem engen Zusammenhang von MLLR und VTN, der im
ersten Teil gezeigt wurde, wird die generelle Form der VTN-Matrix auf die MLLR-
Matrix ub ertragen, um diese auf eine Bandstruktur einzuschrank en. Dadurch wird
die MLLR Adaption besonders fur den Fall von wenigen verfugbaren Adaptionsda-
ten erheblich verbessert.
Schlie lich werden mehrere Verbesserungen der Sprecheradaption mittels MLLR
prasentiert. Eine Erweiterung zielt auf eine verbesserte De nition der Regressions-
klassen ab, was speziell fur den Fall einer schnellen Adaption mit wenigen Adapti-
onsdateneinebesondereBedeutunghat.EineweitereVerbesserungnutztKon denz-
ma e, um einer Verschlechterung der Adaptionsleistung durch Erkennungsfehler im
ersten Durchgang eines mehrstu gennsprozesses entgegenzuwirken.

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

Livre audio en ligne - Développement personnel Livre en ligne Tout le catalogue Tous les Intérêts

Publié par	rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen
Publié le	01 janvier 2005
Nombre de lectures	16
Langue	English
Poids de l'ouvrage	1 Mo

Investigations on linear transformations for speaker adaptation and normalization [Elektronische Ressource] / von Michael Pitz

Informatik

Mathematics

YouScribe

Le catalogue

Le service

Les conditions