Statistical machine translation with cascaded probabilistic transducers [Elektronische Ressource] / vorgelegt von Stephan Vogel

rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

146 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Sujets

Statistical Machine Translation
with Cascaded Probabilistic Transducers
Von der Fakult¨at fu¨r Mathematik, Informatik
und Naturwissenschaften
der Rheinisch-Westf¨alischen Technischen Hochschule Aachen
genehmigte Dissertation zur Erlangung des akademischen
Grades eines Doktors der Naturwissenschaften
vorgelegt von
Diplom–Physiker
Stephan Vogel
aus Steinfeld
Berichter: Universitatsprofessor Dr.-Ing. Hermann Ney¨
Universitatsprofessor Dr. Alex Waibel¨
Tag der mu¨ndlichen Pru¨fung: 9. Dezember 2005
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfu¨gbar.Acknowledgments
Once upon a time ...
Yes, it has been a long, very long time since I started to work on statistical machine trans-
lation and on the work reported in this thesis. First of all, I would like to express my gratitude
to Prof. Dr. Hermann Ney. He gave me the opportunity to get into this exciting research ﬁeld.
In many discussions he introduced me to the principles and techniques of statistical natural lan-
guage processing. His questions helped me to develop my ideas. Above all, he gave an example
of how research should be done. As years go by, I see how much impact this has had on me.
I will always have fond memories of I6, the Lehrstuhl fu¨r Informatik at RWTH Aachen. I
had many good discussions with my colleague there, especially with Sonja, Franz, and Hassan
on translation, and often just a good time, like when playing volleyball or badminton, going on
a caving or rafting tour, or chatting over a cup of coﬀee. My thanks go to Stefan, Andreas,
Ralf, Frank, Jeannette, Stephan, Klaus, Achim, Sirko, Inge, Christiane, J¨org, Michaelandmany
more.
I would like to thank Prof. Dr. Alex Waibel, not only for agreeing to be co-reviewer for
this thesis, but also for inviting me to come to Carnegie Mellon University, which allowed me to
continue to work on statistical machine translation.
At some point, ﬁnishing this thesis became less and less important to me. There was always
some more interesting work ahead. Luckily, it remained important to others, and they pushed
me until I eventually gave in. I would like to thank them for all the friendly reminders and their
patience. I wish I had done this earlier and better.
My special thanks go to my wife Mary. Without her constant encouragement and support
during all those years, this would not have been possible.
... and he lived happily ever after.Abstract
Statistical machine translation is based on the idea to extract information from bilingual cor-
pora, which can be used to generate new translations. The current work combines aspects
from example-based machine translation and from grammar-based approaches, esp. bilingual
regular grammars, to develop a statistical translation system based on cascaded transducers.
These transducers can be constructed manually, semi-automatically, or – in restricted form –
fully automatically. A training method for these cascaded transducers is developed based on an
extension of the HMM alignment model to the alignment of graphs.
Togeneratenewtranslationsusingthetrainedmodelsadecoderisneeded. Thisisessentially
asearch for the translation with the highest probability. Adecoder had been developedwhich is
based on Dynamic Programming and which allows for pruning to control runtime. Recombina-
tion of hypotheses can be based on diﬀerent criteria: coverage of the source word positions, the
most recent target words, the number of generated target words, and any combination thereof.
Additional aspects covered in this dissertation include:
1. Segmentationoflongsentencesbasedonminimizingtheperplexityoftheunderlyingword
alignment models.
2. This technique is then extended into a new and robust phrase alignment. To ﬁnd the
targetphraseforagivenphraseinasourcesentencethealgorithmsearchesforthesegmentation
of the target sentence, which gives the highest word alignment probability under the constraints
of the segmentation.
3. The use and integration of manual dictionaries, including the addition of automatically
generated word forms for which probabilities are estimated from the bilingual corpora.
Experiments are described in which these diﬀerent methods had been tested. Corpora of
diﬀerentsizesandfordiﬀerentlanguagepairsareused. Cascadedtransducersaretestedesp. for
small corpora, while the word-based phrase alignment are applied to large corpora. In addition
- and for the situation of very restricted bilingual data - a comparison is done between the
statistical translation approach and an Interlingua-based translation system, and it is shown
that even in this scenario statistical translation can give comparable translation quality.Zusammenfassung
¨Statistische maschinelle Ubersetzung basiert darauf, aus vorliegenden bilingualen Korpora In-
¨formationen zu gewinnen, aus denen neue Ubersetzungen konstruiert werden konnen. In der¨
¨vorliegenden Arbeit werden Aspekte aus Beispiel-basierter Ubersetzung (Example-based Ma-
chine Translation), sowie von Grammatik-basierten Ansatzen, insbesondere bilinguale regulare¨ ¨
¨Grammatiken integriert, um ein statistisches Ubersetzungsverfahren basierend auf kaskadierten
Transducern zu entwickeln. Diese Transducer konnen manuell, semi-manuell, oder - in einfacher¨
Form - automatisch erzeugt werden. Durch eine Erweiterung des HMM Wort Alignment Mo-
dels auf die Alignierung von Graphstrukturen wird ein Trainingsverfahren fur die kaskadierten¨
Transducer entwickelt.
¨Um mit den trainierten Modellen neue Ubersetzungen erzeugen zu k¨onnen wird ein Dekoder
¨benotigt. Dies ist i.W. eine Suche nach der Ubersetzung mit der hochsten Wahrscheinlich-¨ ¨
keit. Es wurde ein Decoder entwickelt, der auf Dynamischer Programmierung berucht, und zur
Beschrankung der Laufzeit Pruning erlaubt. Zudem erlaubt er eine ﬂexible Steuerung der Re-¨
kombination der Hypothesen, indem Abdeckung der W¨orter im Quellsatz, die zuletzt erzeugten
Zielworter, und die Anzahl der Zielworter bei der Rekombination in beliebiger Weise kombiniert¨ ¨
werden k¨onnen.
Zus¨atzlich werden in der Arbeit folgende Aspekte behandelt:
1. Splitten von langen Satzen basierend auf Minimierung der Perplexitat des verwendeten¨ ¨
Wortalignmentmodells.
2. Dieses Verfahren wird erweitert zu einem neuen, leistungsstarken und robusten Phrasen-
¨Alignment. Zu einer Phrase im Quellsatz wird die Ubersetzung im Zielsatz gefunden, indem
die Segmentierung des Zielsatzes gesucht wird, die die hochste Wahrscheinlichkeit des Wortali-¨
gnments erzeugt, wobei das Wortalignment durch die Segmentierung eingeschr¨ankt wird.
3. Die Verwendung von manuellen Lexica. Insbesondere wird beschrieben, wie durch Hin-
zufugenautomatischerzeugterWortformeninVerbindungmitausbilingualenKorporageschatz-¨ ¨
¨ten Wahrscheinlichkeiten Verbesserungen in der erzielten Ubersetzungsqualit¨at erzielt werden
konnen.¨
In den Experimenten werden die vorgestellten Verfahren untersucht. Verschiedene Corpora
unterschiedlicher Grosse und fur verschiedene Sprachenpaare werden verwendet. Die Methode¨ ¨
der kaskadierten Transducer wird insbesondere bei kleinen Korpora eingesetzt, w¨ahrend bei den
sehr grossen Korpora das wort-basierte Phrasenalignment verwendet wird. Zusatzlich wird -¨
fu¨r die Situation sehr beschr¨ankter Datenmenge - ein Vergleich des statistischen Ansatzes mit
¨einem Interlingua-basierten Ubersetzungssystem durchgefuhrt und nachgewiesen, dass selbst in¨
¨ ¨dieser Situation ein statistisches Ubersetzungssystem vergleichbare Ubersetzungsqualitat errei-¨
chen kann.Contents
1 Introduction 1
1.1 Statistical Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Basic Alignment Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Current Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Example Based Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Translation with Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Subsequential Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Head Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Cascading Finite State Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Grammar based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Scientiﬁc Goals 11
3 Sentence Splitting 13
3.1 Objective Function for Sentence Splitting . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Experiments in Sentence Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Phrase Alignment 19
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Phrase Alignment from Viterbi Paths . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Phrase Alignment via Constrained Word Alignment . . . . . . . . . . . . . . . . 20
4.4 Just-in-Time Phrase Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Phrase Translation Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iii