Robust machine translation for multi-domain tasks [Elektronische Ressource] / vorgelegt von Oliver Bender

rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

154 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen
Publié le	01 janvier 2010
Nombre de lectures	20
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Robust Machine Translation
for Multi-Domain Tasks
Von der Fakult at fur Mathematik, Informatik und Naturwissenschaften
der RWTH Aachen University zur Erlangung des akademischen Grades
eines Doktors der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Diplom-Informatiker Oliver Bender
aus Heinsberg
Berichter: Prof. Dr.{Ing. Hermann Ney
Prof. Dr. Francisco Casacuberta
Tag der mundlic hen Prufung: 11. M arz 2010
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.To GabyAbstract
In this thesis, we investigate and extend the phrase-based approach to statistical machine
translation. Due to improved concepts and algorithms, the quality of the generated translation
hypotheses has been signi cantly improved in recent years. Still, the translation quality leaves
a lot to be desired when going beyond traditional translation tasks, such as newswire articles,
and when addressing more ambitious translation problems. We extend the state-of-the-art in
phrase-based translation which enables us to build a robust translation system for multi-domain
input. Robustness is hereby regarded as the ability to produce high quality translations for
arbitrary input texts, e.g. automatic transcriptions of recognized speech or other unstructured,
potentially noisy input. In this work, we focus on Arabic-English translation tasks.
We study the search problem for phrase-based statistical machine translation in detail. For
this, we examine the e ect of the di erent models on the translation quality. Moreover, we
make an explicit distinction between reordering (coverage) and lexical hypotheses in the prun-
ing process and stress the importance of the coverage pruning to adjust the balance between
hypotheses representing di erent reorderings (coverage hypotheses) and hypotheses with di er-
ent lexical representations. We present constraints to solve the reordering problem in machine
translation.
To trim our translation system for multi-domain input and to improve the robustness built
into the decoder, we apply domain adaptation to the language models and rerank the candidate
translations using appropriate rescoring models. We also present our work on adjusting the
vocabularies of the speech recognizer and the machine translation system in a preprocessing
step and on predicting missing punctuation marks for automatically transcribed speech (in the
actual translation process).
Processing morphologically rich languages such as Arabic generally poses high demands on
preprocessing. We show that the choice of the appropriate preprocessing strategy depends on
the translation domain and on the structure of the input data. Experimental results emphasize
how the proper choice of the preprocessing approach helps to increase the translation quality.
In addition, we address the task of improving the translation quality by means of syntactically
motivated feature functions within a reranking concept. Then, we investigate di erent data-
driven approaches to the task of transliterating proper names. Often, such names are out-of-
vocabulary terms and the intention is to preserve the names by transliteration. Finally, we show
how human translators can be assisted by machine translation systems. We compare search
strategies for interactive machine translation.
The presented machine translation system achieves state-of-the-art performance and has
been successfully applied to the large-scale Arabic-English GALE translation evaluations. Fur-
thermore, the system was ranked among the top submissions for the NIST Open Machine
Translation Evaluation 2006 and for the series of IWSLT evaluation campaigns.
vKurzfassung
In dieser Arbeit untersuchen und erweitern wir den phrasenbasierten Ansatz zur maschinellen
Ubersetzung. Dank verbesserter Konzepte und verfeinerter Algorithmen konnte die Qualit at der
generierten Ubersetzungen in den letzten Jahren deutlich verbessert werden. Die Ubersetzungs-
qualit atasl st dennoch zu wunsc hen ubrig, geht man von traditionellen Aufgabenstellungen wie
der Ubersetzung von Zeitungsartikeln zu anspruchsvolleren Problemen ub er. Ziel dieser Arbeit
ist, den aktuellen Stand der Technik in der phrasenbasierten Ubersetzung zu verbessern und ein
Ubersetzungssystem zu entwickeln, welches robust ist und mehrere Dom anen unterstutzt. Der
Fokus liegt hierbei auf Aufgabenstellungen zur Ubersetzung aus dem Arabischen ins Englische.
Unter Robustheit verstehen wir die F ahigkeit, tre ende Ubersetzungen auch fur Transkriptionen
automatisch erkannter Sprache und andere, potentiell verrauschte, Eingabedaten zu liefern.
Wir beschreiben und analysieren das Suchproblem der phrasenbasierten, statistischen Uber-
setzung in allen Einzelheiten. Hierzu untersuchen wir den E ekt der einzelnen Modelle auf
die Qualit at der Ubersetzungen. Zudem tre en wir eine explizite Unterscheidung zwischen
Umordnungs- (Abdeckungs-) und lexikalischen Hypothesen w ahrend des Prunings. Wir heben
die Bedeutung des Prunings der Abdeckungshypothesen hervor, um die Anzahl an Hypothesen
zu steuern, die unterschiedliche Wortstellungen (Abdeckungshypothesen) und unterschiedliche
lexikalische Darstellungen repr asentieren. Wir zeigen Einschr ankungen, die das Umordnungs-
problem in der maschinellen Ubersetzungosen.l
Um unser Ubersetzungssystem an mehrfache Dom anen anzupassen und um die Robustheit
des System zu verbessern, adaptieren wir die Sprachmodelle an die jeweilige Dom ane. Mit
Hilfe geeigneter Modelle bewerten wir die Hypothesen ein weiteres Mal und aktualisieren die
ausgew ahlten Ubersetzungen. Zudem stellen wir unsere Arbeiten vor, die die Vokabularien des
Spracherkenners und des Ubersetzungssystems angleichen, und Interpunktionszeichen vorher-
sagen, die in den automatischen Transkriptionen fehlen.
Generell stellt die Verarbeitung morphologisch reicher Sprachen besondere Anforderungen an
die Vorverarbeitung der Daten. Wir zeigen, dass die Wahl einer geeigneten Strategie fur diese
Vorverarbeitung von der Dom ane und der Charakteristik der Eingabedaten abh angt. Exper-
imentelle Untersuchungen verdeutlichen, wie die Wahl der richtigen Vorverarbeitungsmethode
zur Verbesserung der Ubersetzungsqualit at beitragen kann.
Ferner befassen wir uns mit der Aufgabenstellung, die Ubersetzungsqualit at mit Hilfe
von syntaktisch motivierten Feature-Funktionen zu verbessern. Ein weiterer Aspekt ist
die Untersuchung verschiedener Ans atze zur Transliteration von Eigennamen, da diese dem
Ubersetzungssystem h au g unbekannt sind. Schlie lich befassen wir uns mit dem Bereich der
interaktiven Ubersetzung und vergleichen Suchstrategien fur den Einsatz in interaktiven Sys-
temen.
Das in dieser Arbeit beschriebene System erzielt Ergebnisse, die mit den besten, zur Zeit
ver o entlichten Ergebnissen vergleichbar sind. Es wurde im Rahmen der GALE-Evaluationen
fur die Ubersetzungsaufgaben vom Arabischen ins Englische erfolgreich eingesetzt. Des Weit-
eren geh orte das System zu den besten Systemen bei der \NIST Open Machine Translation
Evaluation 2006" sowie fur eine Reihe von IWSLT-Evaluationen.
viiAcknowledgments
At this point, I would like to express my gratitude to all the people who supported and accom-
panied me during the progress of this work.
First, I would like to thank Hermann Ney for supervising me during the last years and for
o ering doing research in this interesting and challenging area. In particular, I want to thank
for all the opportunities he gave me.
I would also like to thank Francisco Casacuberta from the Universidad Politecnica de Valencia
for agreeing to review this thesis and for his interest in this work.
Next, all my colleagues at our lab deserve my gratitude for many fruitful discussions, helpful
feedback, and for the very good working atmosphere. Some of you became real friends: Sasa,
Thomas, Christian, David, Philippe, and Bj orn. Special thanks go to the SMT group: Richard,
Evgeny, Shahram, Arne, Gregor, Saab, Daniel, Yuqi, Maja, Jia, Stefan, and Patrick. Also to
some of the \old guys": Nicola, Andr as, Klaus, Wolfgang, Daniel, and Franz. Furthermore, I
would like to thank the secretaries for their continuous support.
I am very thankful for the friendly atmosphere and the support I received at SRI Interna-
tional’s Speech Technology and Research (STAR) Laboratory, Menlo Park, CA during my stay
in 2007. It was a very interesting and valuable experience.
Next, I would like to thank my parents for supporting me and giving me all the chances I
had. Thank you to Esther and Kirsten for always encouraging me but also reminding me that
there are other things than this work.
Last but not least, this work would not have been possible without Gaby. I have to thank
for her love, encouragement, never-ending patience, and so much more. I also want to
Leo for just being there and for pushing me to write down this thesis.
ix