Statistical computer assisted translation [Elektronische Ressource] / vorgelegt von Shahram Khadivi
141 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Statistical computer assisted translation [Elektronische Ressource] / vorgelegt von Shahram Khadivi

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
141 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Statistical Computer-Assisted TranslationVon der Fakult¨at fur¨ Mathematik, Informatik undNaturwissenschaften der Rheinisch-Westf¨alischen TechnischenHochschule Aachen zur Erlangung des akademischen Gradeseines Doktors der Naturwissenschaften genehmigte Dissertationvorgelegt vonShahram Khadivi, M.Sc. Comp. Eng.ausEsfahan, IranBerichter: Professor Dr.-Ing. Hermann NeyProfessor Dr. Enrique VidalTag der mundl¨ ichen Prufung¨ : Donnerstag, 10. Juli 2008Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfug¨ bar.In the name of God, the beneficent, the mercifulAndofHissignsisthecreationoftheheavensandtheearth,andthedifference ofyour languagesandcolours. Lo! hereinindeedareportentsformenofknowledge.– Quran (The Romans, 30)To my beloved wife: MaryamTo my beloved sons: Meysam & MohammadAnd to all other beloved onesAcknowledgmentsIamdeeplyindebtedtomyadvisor,Prof.Dr.HermannNey,forhisconstantsupportandhisinvaluable advice and critical comments. Without his help, this work would not be possible.He gave me the opportunity to attend a variety of conferences and meetings, and gave methe possibility to organize two international projects.I am very grateful to Prof. Dr. Enrique Vidal for agreeing to take the time to evaluatethisthesisasaco-referee,andforhisusefulcommentsinthemeetingsofTransType2project.Iwouldalsoliketothankthemembersofmycommittee: Prof.Dr.Dr.h.c.WolfgangThomas,and Prof. Dr. Thomas Seidl.

Sujets

Informations

Publié par
Publié le 01 janvier 2008
Nombre de lectures 16
Langue English
Poids de l'ouvrage 1 Mo

Extrait

Statistical Computer-Assisted Translation
Von der Fakult¨at fur¨ Mathematik, Informatik und
Naturwissenschaften der Rheinisch-Westf¨alischen Technischen
Hochschule Aachen zur Erlangung des akademischen Grades
eines Doktors der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Shahram Khadivi, M.Sc. Comp. Eng.
aus
Esfahan, Iran
Berichter: Professor Dr.-Ing. Hermann Ney
Professor Dr. Enrique Vidal
Tag der mundl¨ ichen Prufung¨ : Donnerstag, 10. Juli 2008
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfug¨ bar.In the name of God, the beneficent, the merciful
AndofHissignsisthecreationoftheheavensandtheearth,andthedifference of
your languagesandcolours. Lo! hereinindeedareportentsformenofknowledge.
– Quran (The Romans, 30)
To my beloved wife: Maryam
To my beloved sons: Meysam & Mohammad
And to all other beloved onesAcknowledgments
Iamdeeplyindebtedtomyadvisor,Prof.Dr.HermannNey,forhisconstantsupportandhis
invaluable advice and critical comments. Without his help, this work would not be possible.
He gave me the opportunity to attend a variety of conferences and meetings, and gave me
the possibility to organize two international projects.
I am very grateful to Prof. Dr. Enrique Vidal for agreeing to take the time to evaluate
thisthesisasaco-referee,andforhisusefulcommentsinthemeetingsofTransType2project.
Iwouldalsoliketothankthemembersofmycommittee: Prof.Dr.Dr.h.c.WolfgangThomas,
and Prof. Dr. Thomas Seidl. Their advice and patience is appreciated.
MyPhDstudiesturnedouttobeanunforgettableexperience,mostlythankstothesup-
port from my colleagues and friends. I was lucky to be part of a great group of scientifically
ambitious, intelligent, and industrious researchers at the Lehrstuhl fur¨ Informatik 6: Oliver
Bender, Jan Bungeroth, Thomas Deselaers, Philippe Dreuw, Christian Gollan, Saˇsa Hasan,
Bj¨orn Hoffmeister, Daniel Keysers, Patrick Lehnen, Gregor Leusch, Jonas L¨o¨of, Wolfgang
and Klaus Macherey, Evgeny Matusov, Arne Mauser, Amr Mousa, Franz Och, Christian
Plahl, Maja Popovi´c, Sonja Nießen, Ralf Schluter,¨ Daniel Stein, Tibor Szilassy, Nicola Ueff-
ing, David Vilar, Jia Xu, Morteza Zahedi, Richard Zens, Yuqi Zhang, Andr`as Zolnay, and
all other individuals. Their kindness helped me to feel at home, and their enthusiasm was
contagious.
I thank Jan, Saˇsa, Evgeny, Oliver, David, Arne, and Daniel for their time to proof-read
thisthesis. IalsothankTibor,Andr`as,Jonas,andBj¨ornforprovidingtheASRwordgraphs.
Special thanks also goes to Richard who greatly enriched my knowledge with his exceptional
insights into statistical machine translation. I am also grateful to Franz and Sonja for their
helps and encouragements at the beginning of my study. I also thank the secretaries of
Informatik 6, Gisela Gillmann, Katja Ba¨cker, and Annette Kopp. Specially Gisela who is
always patient to listen to me and volunteers to help.
At this point, I would like to express my everlasting gratitude and love to my wife
for her patience, support, and love. I wish to extend my heartfelt thanks to my parents,
whose continuous encouragement lightens my path into higher education. My soulful thanks
definitivelyincludesmybelovedsons: MeysamandMohammad. Theywerethemainenergy
supplier for me to finish this thesis.
This dissertation was written during my time as a researcher with the Lehrstuhl fur¨
Informatik 6 of RWTH Aachen University in Aachen, Germany. This work was partially
fundedbythetheEuropeanUnionundertheRTDprojectTransType2(IST200132091)and
theintegratedprojectTC-STAR-TechnologyandCorporaforSpeechtoSpeechTranslation
-(IST-2002-FP6-506738, http://www.tc-star.org).Abstract
In recent years, significant improvements have been achieved in statistical machine transla-
tion (MT), but still even the best machine translation technology is far from replacing or
even competing with human translators. However, an MT system helps to increase the pro-
ductivity of human translators. Usually, human translators edit the MT system output to
correct the errors, or they may edit the source text to limit vocabulary. A way of increasing
the productivity of the whole translation process (MT plus human work) is to incorporate
thehumancorrectionactivitiesinthetranslationprocess, therebyshiftingtheMTparadigm
to that of computer-assisted translation (CAT). In a CAT system, the human translator be-
gins to type the translation of a given source text; by typing each character the MT system
interactively offers and enhances the completion of the translation. Human translator may
continuetypingoracceptthewholecompletionorpartofit. Here, wewilluseafullyfledged
translation system, phrase-based MT, to develop computer-assisted translation systems. An
important factor in a CAT system is the response time of the MT system. We will describe
an efficient search space representation using word hypotheses graphs, so as to guarantee a
fast response time. The experiments will be done on a small and a large standard task.
Skilledhumantranslatorsarefasterindictatingthantypingthetranslations,thereforea
desired feature of a CAT system is the integration of human speech into the CAT system. In
a CAT system with integrated speech, two sources of information are available to recognize
the speech input: the target language speech and the given source language text. The
target language speech is a human-produced translation of the source language text. The
main challenge in the integration of the automatic speech recognition (ASR) and the MT
models in a CAT system, is the search. The search in the MT and in the ASR systems are
already very complex, therefore a full single search to combine the ASR and the MT models
will considerably increase the complexity. In addition, a full single search becomes more
complex since there is not any specific model nor any appropriate training data. In this
work, we study different methods to integrate the ASR and the MT models. We propose
several new integration methods based on N-best list and word graph rescoring strategies.
We study the integration of both single-word based MT and phrase-based MT with ASR
models. The experiments are performed on a standard large task, namely the European
parliament plenary sessions.
A CAT system might be equipped with a memory-based module that does not actually
translate, but find the translation from a large database of exact or similar matches from
sentences or phrases that are already known. Such a database, known as bilingual corpora
are also essential in training the statistical machine translation models. Therefore, having
a larger database means a more accurate and faster translation system. In this thesis, we
will also investigate the efficient ways to compile bilingual sentence-aligned corpora from the
Internet. We propose two new methods for sentence alignment. The first one is a typical
extensionoftheexistingmethodsinthefieldofsentencetforparalleltexts. Wewill
show how we can employ sentence-length based models, word-to-word translation models,
cognates, bilingual lexica, and any other features in an efficient way. In the second method,
we propose a new method for aligning sentences based on bipartite graph matching. We
show that this new algorithm has a competitive performance with other methods for parallel
corpora, and at the same time it is very useful in handling different order of sentences in asource text and its corresponding translation text. Further, we propose an efficient way to
recognize and filter out wrong sentence pairs from the bilingual corpora.
Zusammenfassung
¨In den vergangenen Jahren konnte die maschinelle Ubersetzung mit statistischen Methoden
(engl. Statistical Machine Translation, MT) signifikante Verbesserungen erzielen, jedoch
¨ ¨ist auch die beste maschinelle Ubersetzung einem menschlichen Ubersetzer noch deutlich
unterlegen. Dennoch kann ein MT-System die Produktivit¨at einer menschlichen Arbeit-
¨ ¨skraftsteigern. UblicherweisebearbeitenmenschlicheUbersetzerdieMT-Systemausgabezur
Fehlerkorrektur, oder sie bearbeiten den Quelltext, um das Vokabular einzuschr¨anken. Eine
¨M¨oglichkeit, die Produktivit¨at des gesamten Ubersetzungsprozesses (MT und menschliche
¨Arbeit) zu erh¨ohen, ist, die menschlichen Korrekturarbeiten in den Ubersetzungsprozess
¨mit einzubeziehen. So wandelt sich das MT-Modell zu rechnerunterstutzter¨ Ubersetzung
¨(engl. Computer-Assisted Translation, CAT). Beginnt der menschliche Ubersetzer mit der
¨Eingabe seiner Ubersetzung eines gegebenen Quelltextes, so wird von einem solchen CAT-
¨System beim Tippen jedes Buchstabens eine interaktive Vervollst¨andigung der Ubersetzung
¨angeboten. Der Ubersetzer kann dann die Eingabe fortsetzen oder die Vervollst¨andigung in-
sgesamt oder teilweise ub¨ ernehmen. Dazu verwenden wir ein vollwertiges, phrasenbasiertes
¨Ubersetzungssystem als Teil eines gesamten CAT-Systems. Wichtig fur¨ ein solches System
ist eine schnelle Reaktionszeit des MT-Systems, die durch eine effiziente Darstellung des
Suchraums mit Wortgraphen gew¨ahrleistet wird. Die Experimente werden auf einer kleinen
und einer großen Standardaufgabe durchgefuhr¨ t.
¨ ¨Ausgebildete Ubersetzer k¨onnen schneller diktieren als eine Ubersetzung manue

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents