Discriminative training and acoustic modeling for automatic speech recognition [Elektronische Ressource] / vorgelegt von Wolfgang Macherey

rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen - Wolfgang Macherey

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

215 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen
Publié le	01 janvier 2010
Nombre de lectures	12
Langue	English
Poids de l'ouvrage	3 Mo

Extrait

Discriminative Training and
Acoustic Modeling for Automatic
Speech Recognition
Von der Fakult¨at fur¨ Mathematik, Informatik und Naturwissenschaften der
RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors
der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Diplom–Informatiker Wolfgang Macherey
aus
Duren-Birk¨ esdorf
Berichter: Universit¨atsprofessor Dr.-Ing. Hermann Ney
Universit¨atsprofessor Dr.-Ing. Gerhard Rigoll
Tag der mundlic¨ hen Prufung:¨ 9. Mar¨ z 2010
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfu¨gbar.Poets say science takes away from the beauty of the stars – mere
globs of gas atoms. I, too, can see the stars on a desert night, and
feel them. The vastness of the heavens stretches my imagination –
stuck on this carousel my little eye can catch one-million-year-old
light. It does not hurt the mystery to know a little about it.
Richard P. Feynman
Nobel Prize Laureate in Physics, 1965Acknowledgments
At this point, I would like to express my gratitude to all the people who supported and
accompanied me during the progress of this work.
First I would like to thank my advisor Univ.-Prof. Dr. Ing. Hermann Ney, head of the Chair
of Computer Science 6 at RWTH Aachen University. This thesis would not have been possible
without his advice, continuous interest and support.
I would also like to thank my second reader Univ.-Prof. Dr. Ing. habil. Gerhard Rigoll for agreeing
to review this thesis and for his great interest in this work.
IamalsoverygratefultomymentorDr.rer.nat.RalfSchlu¨terwhointroducedmetothecomplex
and fascinating ﬁeld of discriminative training during my time as an undergraduate at RWTH
Aachen University.
All the people at the Chair of Computer Science 6 deserve my gratitude for many fruitful
discussions, helpful feedback, and for the very good working atmosphere. I want to thank all those
who helped me while I was writing this thesis by proofreading it and requesting clariﬁcations.
Furthermore, I would like to thank the secretaries, and the system administrators for their
continuous support.
Großer Dank gilt meinen Eltern Ria und Peter Macherey, die mir das Studium der Informatik
erm¨oglicht haben. Desweiteren m¨ochte ich mich bei meiner Familie und Freunden fu¨r den
angenehmen Ausgleich zum Arbeitsleben bedanken.To Ria and Peter MachereyAbstract
Discriminative training has become an important means for estimating model parameters in many
statistical pattern recognition tasks. While standard learning methods based on the Maximum
Likelihood criterion aim at optimizing model parameters only class individually, discriminative
approaches beneﬁt from taking all competing classes into account, thus leading to enhanced class
separability which is often accompanied by reduced error rates and improved system performance.
Motivatedbylearningalgorithmsevolvedfromneuralnetworks,discriminativemethodsestablished
as training methods for classiﬁcation problems such as complex as automatic speech recognition.
In this thesis, an extended unifying approach for a class of discriminative training criteria is
suggestedthat,inadditiontotheMaximumMutualInformation (MMI)criterionandtheMinimum
Classiﬁcation Error (MCE) criterion, also captures other criteria more recently proposed as, for
example, the Minimum Word Error (MWE) criterion and the closely related Minimum Phone
Error (MPE) criterion. The new approach allows for investigating a large number of diﬀerent
trainingcriteriawithinasingleframeworkandthustoyieldconsistentanalyticalandexperimental
results about their training behavior and recognition performance.
This thesis also presents the ﬁrst successful implementation of a large scale, lattice-based MCE
training. Experiments conducted on several speech recognition corpora show that the MCE
criterion yields recognition results that are similar to or even outperform the performance gains
obtained with both the MWE and the MPE criterion.
The parameter optimization problem is discussed for Gaussian mixture models where the
covariance matrices can be subject to arbitrary tying schemes. The re-estimation equations as
well as the choice of the iteration constants for controlling the convergence rate are discussed for
the case that full or diagonal covariance matrices are used. In case of full covariance matrices,
the problem of choosing the iteration constants in the Extended Baum (EB) algorithm is shown
to result in the solution of a quadratic eigenvalue problem. Two novel methods on setting the
iteration constants are proposed that provide faster convergence rates across diﬀerent variance
tying schemes.
This thesis also suggests a novel framework that models the posterior distribution directly as a
log-linear model. The direct model follows the principle of Maximum Entropy and can eﬀectively
be trained using the Generalized Iterative Scaling (GIS) algorithm. Both the direct model and its
optimization via the GIS algorithm are compared analytically and experimentally with the MMI
criterion and the EB algorithm.
Finally, this thesis presents a novel algorithm to eﬃciently compute and represent the exact
and unsmoothed error surface over all sentence hypotheses that are encoded in a word lattice if
all parameter settings of a log-linear model are considered that lie along an arbitrary line in the
parameterspace. Whilethenumberofsentencehypothesesencodedinawordlatticeisexponential
in the lattice size, the complexity of the error surface is shown to be always linearly bounded in
the number of lattice arcs. This bound is independent of the underlying error metric.
Experiments were conducted on several standardized speech recognition tasks that capture
diﬀerent levels of diﬃculty, ranging from elementary digit recognition (SieTill) over read speech
(Wall Street Journal and North American Business news texts) up to broadcast news
transcription tasks (Hub-4). Questions pursued in this context address the eﬀect that diﬀerent
variance tying schemes have on the recognition performance and to what extent increasing the
model complexity aﬀects the performance gain of the discriminative training procedure. All
experiments were carried out in the extended, unifying approach for a large number of diﬀerent
training criteria.Zusammenfassung
Diskriminative Lernverfahren haben sich zu einem wichtigen Instrument der Parametersch¨atzung
in vielen Mustererkennungsaufgaben entwickelt. W¨ahrend konventionelle, auf dem Maximum
Likelihood Prinzip basierende Verfahren die Modellparameter nur klassenindividuell sch¨atzen,
beruc¨ ksichtigendiskriminativeVerfahrenauchklassenfremdeTrainingsdatenundfuh¨ rensozueiner
verbesserten Klassentrennbarkeit, was sich oftmals in einer niedrigeren Fehlerrate niederschlagt.¨
Motiviert durch Lernverfahren, die in dem Bereich der neuronalen Netze entwickelt worden sind,
haben sich diskriminative Methoden inzwischen als Trainingsverfahren in komplexen Klassiﬁkati-
onsproblemen wie die automatische Spracherkennung etabliert.
In dieser Arbeit wird ein erweiterter, vereinheitlichender Ansatz fur¨ eine Klasse diskriminativer
Trainingskriterien vorgestellt, der neben dem Maximum Mutual Information Kriterium und dem
Minimum Classiﬁcation Error Kriterium weitere Kriterien wie zum Beispiel das Minimum Word
Error Kriterium oder das hiermit nah verwandte Minimum Phone Error Kriterium umfaßt. Der
neue Ansatz ermoglic¨ ht es, die genannten sowie zahlreiche weitere Kriterien in einer einheitlichen
Theorie darzustellen und somit zu klaren Aussagen in der theoretischen sowie experimentellen
Performanzanalyse zu kommen.
In dieser Arbeit wird ferner die erste erfolgreiche Implementierung eines rein Wortgraph-
basierten MCE Trainings fur¨ großes Vokabular vorgestellt. Experimente, die auf zahlreichen
Spracherkennugskorpora durchgefuh¨ rt wurden, zeigen, daß die mit Hilfe des MCE Kriteriums
erzielten Performanzen in derselben Großenord¨ nung liegen wie die mit dem MWE und dem MPE
Kriterium erzielten Fehlerraten, beziehungsweise diese sogar zu ub¨ ertreﬀen verm¨ogen.
Das Parameteroptimierungsproblem wird fur¨ Hidden Markov Modelle mit Gaußschen Misch-
verteilungsdichten formuliert, wobei die Reestimationsgleichungen als auch die Wahl der Ite-
rationskonstanten fur¨ die F¨alle diskutiert werden, daß entweder voll besetzte oder diagonale
Kovarianzmatrizen verwendet werden. Die Kovarianzmatrizen konn¨ en hierbei als gemeinsamer
Parameter in die Mischverteilungsdichten beliebiger Zustand¨ e eingehen (sogenannte Varianz
tying Schemata). Speziell fur¨ den Fall voll besetzer Kovarianzmatrizen wird gezeigt, daß die
Wahl der Iterationskonstanten im erweiterten Baum (EB) Algorithmus auf die Losung¨ eines
quadratischen Eigenwertproblemes zuruc¨ kgefuh¨ rt werden kann. Zwei neue Methoden zur Wahl
der Iterationskonstanten werden vorgeschlagen, die unabhangig vom verwendeten Varianz tying¨
Schema zu einer schnelleren Konvergenzrate fuhren als dies beispielsweise mit dem traditionellen¨
EB Algorithmus moglich ist.¨
In dieser Arbeit wird darub¨ er hinaus ein neuer Ansatz vorgestellt, der eine direkte Beschreibung
der Posterior-Verteilung mittels eines log-linearen Modells ermoglic¨