Phrase based statistical machine translation [Elektronische Ressource] : models, search, raining / vorgelegt von Richard Zens

rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

167 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen
Publié le	01 janvier 2008
Nombre de lectures	3
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Phrase-based Statistical Machine Translation:
Models, Search, Training
Von der Fakult¨at fur¨ Mathematik, Informatik und
Naturwissenschaften der Rheinisch-Westf¨alischen Technischen
Hochschule Aachen zur Erlangung des akademischen Grades
eines Doktors der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Diplom–Informatiker Richard Zens
aus
Dur¨ en
Berichter:
Professor Dr.–Ing. Hermann Ney
Professor Dr. Francisco Casacuberta
Tag der mundli¨ chen Pruf¨ ung: Freitag, 29. Februar 2008
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugba¨ r.2Moderation is a fatal thing. Nothing succeeds like excess.
Oscar Wilde – A Woman of No Importance, Third Act, 1893
34Acknowledgments
At this point, I would like to express my gratitude to all the people who supported and
accompanied me during the progress of this work.
First, I would like to thank my advisor Professor Dr.-Ing. Hermann Ney, head of the
Chair of Computer Science 6 at the RWTH Aachen University. This thesis would not
have been possible without his advice, continuous interest and support.
I would also like to thank Professor Dr. Francisco Casacuberta from the Universidad
Politecnica de Valencia for agreeing to review this thesis and for his interest in this work.
All the people at the Chair of Computer Science 6 deserve my gratitude for many fruitful
discussions, helpful feedback, and for the very good working atmosphere. I want to
thank all those who helped me when writing this thesis by proofreading it, pointing out
bad formulations and requesting clariﬁcations. Furthermore, I would like to thank the
secretaries and the system administrators for their continuous support.
IamverythankfulforthefriendlyatmosphereandthesupportIreceivedattheAdvanced
Research Institute International, Kyoto, Japan during my stay in 2003. It was a very
interesting and valuable experience.
I would also like to thank all the people who made the CLSP summer research workshop
on the open source SMT toolkit ”Moses” possible: the organizers from CLSP/JHU and
all members of both teams. It was a productive and fun environment.
Grosser Dank gilt meinen Eltern, die mir das Studium der Informatik erm¨oglicht haben.
Desweiteren m¨ochte ich mich bei meiner Familie und Freunden fur¨ den angenehmen Aus-
gleich zum Arbeitsleben bedanken.
56Abstract
Machine translation is the task of automatically translating a text from one natural lan-
guage into another. In this work, we describe and analyze the phrase-based approach
to statistical machine translation. In any statistical approach to machine translation,
we have to address three problems: the modeling problem, i.e. how to structure the de-
pendencies of source and target language sentences; the search problem, i.e. how to ﬁnd
the best translation candidate among all possible target language sentences; the training
problem, i.e. how to estimate the free parameters of the model from the training data.
We will present improved alignment and translation models. We will present alignment
models which improve the alignment quality signiﬁcantly. We describe several phrase
translation models and analyze their contribution to the overall translation quality.
We formulate the search problem for phrase-based statistical machine translation and
present diﬀerent search algorithm in detail. We analyze the search and show that it
is important to focus on alternative reorderings, whereas on the other hand, already a
small number of lexical alternatives are suﬃcient to achieve good translation quality.
The reordering problem in machine translation is diﬃcult for two reasons: ﬁrst, it is
computationallyexpensivetoexploreallpossiblepermutations; second,itishardtoselect
a good permutation. We compare diﬀerent reordering constraints to solve this problem
eﬃciently and introduce a lexicalized reordering model to ﬁnd better reorderings.
We investigate alternative training criteria for phrase-based statistical machine transla-
tion. In this context, we generalize the known word posterior probabilities to n-gram
posterior probabilities.
The resulting machine translation system achieves state-of-the-art performance on the
large scale Chinese-English NIST task. Furthermore, the system was ranked ﬁrst in the
oﬃcial TC-Star evaluations in 2005, 2006 and 2007 for the Chinese-English broadcast
news speech translation task.
7Kurzfassung
¨Die maschinelle Ubersetzung befasst sich mit dem Problem der automatischen
¨UbersetzungeinesTextesderQuellspracheindieZielsprache. IndieserArbeitbeschreiben
und analysieren wir den phrasenbasierten statistischen Ansatz in der maschinellen
¨ ¨Ubersetzung. In der statistischen Ubersetzung muss¨ en im Allgemeinen drei Probleme
angegangen werden: erstens das Modellierungsproblem, d.h. wie die Abh¨angigkeiten
zwischen einem Satz der Quellsprache und dem entsprechenden Satz der Zielsprache
¨beschriebenwerden; zweitensdasSuchproblem,d.h.wiediebesteUbersetzungunterallen
m¨oglichen S¨atzen der Zielsprache gefunden wird; und drittens das Trainingsproblem, d.h.
wie die freien Parameter des Modells bestimmt werden.
¨Wir beschreiben verbesserte Alignment- und Ubersetzungsmodelle. Die Alignmentmod-
elle verbessern die Alignmentqualit¨at signiﬁkant. Es werden mehrere phrasenbasierte
¨ ¨Ubersetzungsmodelle beschrieben und deren Beitrag zur Ubersetzungsqualit¨at analysiert.
¨Wir formulieren das Suchproblem fur¨ die phrasenbasierte statistische Ubersetzung und
beschreiben verschiedene Suchalgorithmen im Detail. Wir analysieren die Suche und
zeigen, dass es insbesondere wichtig ist alternative Umordnungen zu beruc¨ ksichtigen. An-
dererseits ist bereits eine relativ geringe Anzahl lexikalischer Alternativen ausreichend,
¨um gute Ubersetzungen zu produzieren. Das Umordnungsproblem der maschinellen
¨Ubersetzung ist aus zwei Grunde¨ n schwierig: erstens ist es ein kombinatorisches Problem
alle Permutationen zu testen und zweitens ist es schwierig eine geeignete Permutation
auszuw¨ahlen. Wir vergleichen verschiedene Einschr¨ankungen um das Umordnungsprob-
lem eﬃzient zu losen.¨ Desweiteren beschreiben wir ein lexikalisiertes Umordnungsmodell
das hilft bessere Umordnungen auszuw¨ahlen.
Wir untersuchen verschiedene Trainingskriterien fur¨ die phrasenbasierte statistische
¨Ubersetzung. In diesem Kontext generalisieren wir die bekannten Wortposterior-
wahrscheinlichkeiten zu n-gramm-Posteriorwahrscheinlichkeiten.
¨ ¨Die Ubersetzungsqualit¨at des resultierenden Ubersetzungssystems entspricht dem ak-
tuellen Stand der Technik. Desweiteren erreichte das System den ersten Rang in den
oﬃziellen Evaluationen der Jahre 2005, 2006 und 2007 fur¨ die Chinesisch-Englische
¨Ubersetzungsaufgabe die im Rahmen des TC-Star Projektes der Europ¨aischen Union
durchgefuhr¨ t wurden.
8Contents
1 Introduction 1
1.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Statistical Approach to Machine Translation . . . . . . . . . . . . . . 2
1.2.1 Bayes decision rule for machine translation . . . . . . . . . . . . . . 3
1.2.2 Log-linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Phrase-based approach . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Scientiﬁc Goals 11
3 Improved Word Alignment Models 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Review of Statistical Word Alignment Models . . . . . . . . . . . . . . . . 14
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Symmetrized Lexicon Model . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.1 Linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.2 Log-linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.3 Evidence trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.4 Improved training initialization . . . . . . . . . . . . . . . . . . . . 18
3.5 Lexicon Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 State Occupation Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.8 Word Alignment Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.8.1 Evaluation criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.8.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8.3 Lexicon symmetrization . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8.4 Generalized alignments . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8.5 Lexicon smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8.6 Non-symmetric alignments . . . . . . . . . . . . . . . . . . . . . . . 30
3.8.7 Symmetric alignments . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Phrase-based Translation 35
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2