A log-linear discriminative modeling framework for speech recognition [Elektronische Ressource] / vorgelegt von Georg Heigold
210 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

A log-linear discriminative modeling framework for speech recognition [Elektronische Ressource] / vorgelegt von Georg Heigold

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
210 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

A Log-Linear Discriminative Modeling Framework forSpeech Recognition¨ ¨Von der Fakultat fur Mathematik, Informatik undNaturwissenschaften der Rheinisch-Westfalischen¨ TechnischenHochschule Aachen zur Erlangung des akademischen Grades einesDoktors der Naturwissenschaften genehmigte Dissertationvorgelegt vonDiplom–Physiker Georg HeigoldausLuzern, SchweizBerichter:Professor Dr.–Ing. Hermann NeyPr Dr. Dietrich KlakowTag der mundlichen¨ Prufung:¨ 29. Juni 2010Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar¨ .AcknowledgmentsAt this point, I would like to express my gratitude to all the people who supported andaccompanied me during the progress of this work. In particular, I would like to thank:Prof. Dr.-Ing. Hermann Ney for the opportunity for doing research in this interesting andchallenging area. This work would have not been possible without his continuous interest,advice, and support.Prof. Dr. Dietrich Klakow from Saarland University, Germany, for kindly taking over the taskof the co-referee for this thesis.Dr. rer.-nat. Ralf Schluter¨ for the introduction to speech recognition and discriminative training,and his continuous constructive advice.Patrick Lehnen and Stefan Hahn for the introduction to part-of-speech tagging and theirassistance with the experiments.Thomas Deselaers and Philippe Dreuw for their support with the experiments in handwritingrecognition.

Sujets

Informations

Publié par
Publié le 01 janvier 2010
Nombre de lectures 14
Langue English
Poids de l'ouvrage 1 Mo

Extrait

A Log-Linear Discriminative Modeling Framework for
Speech Recognition
¨ ¨Von der Fakultat fur Mathematik, Informatik und
Naturwissenschaften der Rheinisch-Westfalischen¨ Technischen
Hochschule Aachen zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Diplom–Physiker Georg Heigold
aus
Luzern, Schweiz
Berichter:
Professor Dr.–Ing. Hermann Ney
Pr Dr. Dietrich Klakow
Tag der mundlichen¨ Prufung:¨ 29. Juni 2010
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar¨ .Acknowledgments
At this point, I would like to express my gratitude to all the people who supported and
accompanied me during the progress of this work. In particular, I would like to thank:
Prof. Dr.-Ing. Hermann Ney for the opportunity for doing research in this interesting and
challenging area. This work would have not been possible without his continuous interest,
advice, and support.
Prof. Dr. Dietrich Klakow from Saarland University, Germany, for kindly taking over the task
of the co-referee for this thesis.
Dr. rer.-nat. Ralf Schluter¨ for the introduction to speech recognition and discriminative training,
and his continuous constructive advice.
Patrick Lehnen and Stefan Hahn for the introduction to part-of-speech tagging and their
assistance with the experiments.
Thomas Deselaers and Philippe Dreuw for their support with the experiments in handwriting
recognition.
Muhammad Ali Tahir for performing the experiments with the discriminative feature trans-
forms.
Christian Gollan, Thomas Deselaers, Bjorn¨ Homeister, Patrick Lehnen, Wolfgang Macherey,
Andras´ Zolnay, and all other people from the Chair of Computer Science 6 for the interesting
discussions on various speech recognition-related topics.
Oliver Bender, Thomas Deselaers, Mirko Kohns, Stefan Koltermann, Christian Plahl, and David
Rybach for their excellent support with the computing equipment without which I could not
have done so many experiments.
Stefan Hahn, Bjorn¨ Homeister, Patrick Lehnen, Markus Nußbaum, Christian Plahl, Muham-
mad Tahir, and Simon Wiesler for the proof-reading.
Volker Steinbiß, Gisela Gillmann, Jessica Kikum, Annette Kopp, Renate Linzenich, Ira Storms,
and Andreas Wergen for their support in financial issues.
Annette, Frederik, Thierry, Rebekka, and Christoph for their encouragment in the evenings and
at the weekends.
This work was partly funded by the European Commission under the integrated projects TC-
STAR (FP6-506738) and LUNA (FP6-033549), this work was partly realized as part of the
Quaero Programme, funded by OSEO, French State agency for innovation, and this work
is partly based upon work supported by the Defense Advanced Research Projects Agency
(DARPA) under Contract No. HR001-06-C-0023. Any opinions, findings and conclusions
or recommendations expressed in this material are those of the author(s) and do not necessarily
reflect the views of the DARPA.
3Abstract
Conventional speech recognition systems are based on Gaussian hidden Markov models
(HMMs). Discriminative techniques such as log-linear modeling have been investigated in
speech recognition only recently. This thesis establishes a log-linear modeling framework in the
context of discriminative training criteria, with examples from continuous speech recognition,
part-of-speech tagging, and handwriting recognition. The focus will be on the theoretical and
experimental comparison of dierent training algorithms.
Equivalence relations for Gaussian and log-linear models in speech recognition are derived.
It is shown how to incorporate a margin term into conventional discriminative training criteria
like for example minimum phone error (MPE). This permits to evaluate directly the utility
of the margin concept for string recognition. The equivalence relations and the margin-based
training criteria lead to a unified view of three major training paradigms, namely Gaussian
HMMs, log-linear models, and support vector machines (SVMs). Generalized iterative scaling
(GIS) is traditionally used for the optimization of log-linear models with the maximum mutual
information (MMI) criterion. This thesis suggests an extension of GIS to log-linear models
including hidden variables, and to other training criteria (e.g. MPE). Finally, investigations on
convex optimization in speech recognition are presented. Experimental results are provided
for a variety of tasks, including the European Parliament plenary sessions task and Mandarin
broadcasts.
Zusammenfassung
Konventionelle Spracherkennungssysteme basieren auf Gaußschen HMMs. Diskriminative
Techniken wie log-lineare Modellierung werden erst seit kurzem in der Spracherkennung
untersucht. Diese Dissertation fuhrt¨ einen log-linearen Formalismus im Kontext per diskrimina-
tiven Trainings-Kriterien ein - mit Beispielen aus der kontinuierlichen Spracherkennung, dem
Part-of-Speech-Tagging und der Handschrifterkennung. Der theoretische und experimentelle
Vergleich von verschiedenen Trainings-Algorithmen bildet den Schwerpunkt dieser Arbeit.
¨Aquivalenzrelationen fur¨ Gaußsche und log-lineare Modelle in der Spracherkennung wer-
den hergeleitet. Es wird gezeigt, wie ein Margin-Term in konventionellen diskriminativen
Trainings-Kriterien wie zum Beispiel Minimum Phone Error (MPE) eingebaut werden kann,
wodurch wir den Nutzen des Margin-Konzepts fur¨ die Erkennung von Strings direkt messen
¨konnen.¨ Die Aquivalenz-Relationen und die margin-basierten Trainings-Kriterien fuhren¨ zu
einer Vereinheitlichung drei wichtiger Trainingsparadigmen (Gaußsche HMMs, log-linearen
Modelle und Support-Vektor-Maschinen (SVMs)). Generalized Iterative Scaling (GIS) wird
traditionellerweise eingsetzt, um log-lineare Modelle mit dem Maximum Mutual Information
(MMI)-Kriterium zu optimieren. Diese Dissertation schlagt¨ eine Erweiterung von GIS fur¨ log-
lineare Modelle mit verborgenen Variablen und fur¨ andere Trainings-Kriterien (zum Beispiel
MPE) vor. Zum Schluss wird konvexe Optimierung in der Spracherkennung untersucht.
¨Experimentelle Ergebnisse werden fur eine Vielfalt von Aufgaben gezeigt, einschließlich der
European-Parliament-Plenary-Sessions-Aufgabe und Mandarin Broadcasts.Contents
1 Introduction 1
1.1 Statistical Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Signal analysis/feature extraction . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Acoustic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Discriminative Techniques: State of the Art . . . . . . . . . . . . . . . . . . . 8
1.2.1 Discriminative training criteria . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Transducer-based discriminative training . . . . . . . . . . . . . . . . 11
1.2.3 Discriminative models & parameterization . . . . . . . . . . . . . . . 11
1.2.4 Equivalence relations for generative and log-linear models . . . . . . . 13
1.2.5 Generalization ability . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Numerical optimization . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Scientific Goals 17
3 A Transducer-Based Discriminative Framework 21
3.1 Weighted Finite-State Transducers (WFSTs) . . . . . . . . . . . . . . . . . . . 21
3.1.1 WFSTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Semirings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Word Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Unified Training Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Gradient of Unified Training Criterion . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Ecient Calculation of N-th Order Statistics . . . . . . . . . . . . . . . . . . . 31
3.6 Transducer-Based Implementation . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 Error Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii CONTENTS
3.7.1 Hamming distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7.2 Edit distance between two strings . . . . . . . . . . . . . . . . . . . . 34
3.7.3 Edit distances on WFSTs . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7.4 Approximate accuracies on WFSTs . . . . . . . . . . . . . . . . . . . 37
3.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8.1 Comparison of conventional training criteria . . . . . . . . . . . . . . 39
3.8.2 of MWE with approximate and exact word errors . . . . . 39
3.8.3 Comparison of optimization algorithms . . . . . . . . . . . . . . . . . 40
3.8.4 Generative vs. discriminative training (model complexity) . . . . . . . 43
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Equivalence Relations 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Single events: Gaus

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents