Improving statistical machine translation using morpho-syntactic information [Elektronische Ressource] / vorgelegt von Sonja Nießen
123 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Improving statistical machine translation using morpho-syntactic information [Elektronische Ressource] / vorgelegt von Sonja Nießen

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
123 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Improving Statistical Machine Translation usingMorpho-syntactic InformationVon der FakultatÄ furÄ Mathematik, Informatikund Naturwissenschaften derRheinisch-Westfalischen Technischen Hochschule AachenÄzur Erlangung des akademischen Grades einerDoktorin der Naturwissenschaften genehmigte Dissertationvorgelegt vonDiplom–InformatikerinSonja Nießenaus GeilenkirchenBerichter: UniversitatsprofessorÄ Dr.-Ing. Hermann NeyProfessor Dr. Enrique VidalTag der mundlichen Prufung: 2. Dezember 2002Ä ÄDiese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.ÄAcknowledgmentsThis thesis is based on work carried out during my time as a research scientist at theDepartment for Computer Science at the University of Technology in Aachen, Germany.First, I would like to express my gratitude to my advisor Professor Dr.-Ing. HermannNey, head of the Lehrstuhl furÄ Informatik VI at the University of Technology in Aachen.His advice, his continuous interest, and his support made this thesis ultimately possible.I would also like to thank my second advisor Professor Dr. Enrique Vidal, from theDepartamento de Sistemas Inform´aticos Y Computaci´on at the Universidad Politecnicade Valencia, for his interest in this work and the valuable comments on the early draftsof this thesis.Special thanks go to Gregor Leusch and Richard Zens for their valuable programmingwork.

Sujets

Informations

Publié par
Publié le 01 janvier 2002
Nombre de lectures 3
Langue English
Poids de l'ouvrage 1 Mo

Extrait

Improving Statistical Machine Translation using
Morpho-syntactic Information
Von der Fakultat? fur? Mathematik, Informatik
und Naturwissenschaften der
Rheinisch-Westfalischen Technischen Hochschule Aachen?
zur Erlangung des akademischen Grades einer
Doktorin der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Diplom–Informatikerin
Sonja Nießen
aus Geilenkirchen
Berichter: Universitatsprofessor? Dr.-Ing. Hermann Ney
Professor Dr. Enrique Vidal
Tag der mundlichen Prufung: 2. Dezember 2002? ?
Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfugbar.?Acknowledgments
This thesis is based on work carried out during my time as a research scientist at the
Department for Computer Science at the University of Technology in Aachen, Germany.
First, I would like to express my gratitude to my advisor Professor Dr.-Ing. Hermann
Ney, head of the Lehrstuhl fur? Informatik VI at the University of Technology in Aachen.
His advice, his continuous interest, and his support made this thesis ultimately possible.
I would also like to thank my second advisor Professor Dr. Enrique Vidal, from the
Departamento de Sistemas Inform´aticos Y Computaci´on at the Universidad Politecnica
de Valencia, for his interest in this work and the valuable comments on the early drafts
of this thesis.
Special thanks go to Gregor Leusch and Richard Zens for their valuable programming
work. All the other people at the Lehrstuhl fur? Informatik VI are also deserving of my
thanks for many fruitful discussions, and for the very good working atmosphere. Some of
my colleagues became real friends joining me in my “ups” and helping me through the
inevitable “downs”.
Thanks to the Nespole! consortium, listed on the project’s homepage [Nespole! 00], for
making available part of the Nespole! data. Special thanks to Alon Lavie, Lori Levin,
Stephan Vogel and Alex Waibel (in alphabetical order).
I am very grateful to my parents, Ingeborg Nießen and Karl-Heinz Nießen, who are
always there for me and who teached me honesty and perseverance.
Finally, I would like to thank Torsten Bausch for his understanding, love and support.
Aachen, February 2003. Sonja NießenAbstract
Intheframeworkofstatisticalmachinetranslation,correspondencesbetweenthewordsin
the source and the target language are learned from bilingual corpora, and often little or
no linguistic knowledge is used to structure the underlying models. The work presented
inthisthesisismotivatedbythewell-knownobservationthattrainingdatatypicallydoes
not sufficiently represent the range of phenomena in natural languages. In this thesis,
various methods of incorporating morphological and syntactic information into systems
for statistical machine translation are proposed and systematically assessed. The overall
goal is to improve quality and to reduce the amount of parallel text necessary
to train the model parameters. The development of the suggested methods is guided by
the analysis of important causes of errors.
Largedifferencesinwordorderbetweencorrespondingsentencesaredifficulttocapture
for automatic alignment algorithms. In this work, a range of sentence level restructuring
transformations is introduced, which are motivated by knowledge about the sentence
structure in the involved languages. These transformations aim at the assimilation of
word orders in related sentences. A detailed analysis of the effect on the corpora and
the translation quality reveals that their application results in better alignments and as a
consequence in less noisy probabilistic lexica, broader applicability of multi-word phrase
pairs and a better coverage of the language model.
Existingstatisticalsystemsformachinetranslationoftentreatdifferentinflectedforms
of the same lemma as if they were independent of each other. A better exploitation of
the bilingual training data can be achieved by explicitly taking into account the interde-
pendencies of the related inflected forms. In this work a hierarchy of equivalence classes
is defined on the basis of morphological and syntactic information beyond the surface
forms. Features from those hierarchy levels are combined to form hierarchical lexicon
models which can replace the standard probabilistic lexicon used in most statistical ma-
chine translation systems. The benefit from these combined models is twofold: Firstly,
the lexical coverage is improved, because the translation of unseen word forms can be
derived by considering information from lower levels in the hierarchy. Secondly, cate-
gory ambiguity can be resolved, because syntactical context information is made locally
accessible by means of annotation with morpho-syntactic tags.
Conventional bilingual dictionaries are often used as additional data to better train
the model parameters. One of the disadvantages of these dictionaries as compared to
full bilingual corpora is the fact that their entries typically contain no context to enable
the distinction between the translations for different readings of a word. In this work a
method for aligning corresponding readings in conventional dictionaries containing pairs
of fully inflected word forms is proposed. The approach uses information deduced from
one language side to resolve category ambiguity in the corresponding entry in the other
language. The resulting disambiguated dictionaries are better suited for improving the
quality of machine translation, especially if they are used in combination with the hier-
archical lexicon models.Itisacostlyandtimeconsumingtasktogatherlargetextsandhavethemtranslatedto
form bilingual corpora suitable for training the model parameters for statistical machine
translation. In this work the amount of bilingual data required to achieve an acceptable
quality of machine translation is systematically investigated. All the methods presented
in this thesis contribute to a better exploitation of the available bilingual data and thus
to improving translation quality in frameworks with scarce resources.
The combinationofthesuggestedmethodsresults insubstantialimprovementson the
Verbmobil task, the Nespole! task and the Zeres task, for German to English and English
to German translation and for text input and on the output of a speech recognizer.
The second focus of this thesis is on evaluation of machine translation quality. A tool
for the evaluation of translation quality which accounts for the specific requirements in
a research environment is developed. Evaluation criteria which are more adequate than
pureeditdistancearedefined. Themeasurementalongthesequalitycriteriaisperformed
semi-automatically in a fast, convenient and consistent way using the tool and the cor-
responding graphical user interface. The quality criteria themselves are systematically
assessed.Zusammenfassung
?Bei der statistischen maschinellen Ubersetzung wird die Korrespondenz von Wortern?
in der Quell- und der Zielsprache anhand von bilingualen Corpora gelernt, und haufig?
geht wenig oder gar kein linguistisches Wissen zur Strukturierung der zugrundeliegen-
den Modelle ein. Die hier dargestellte Arbeit ist motiviert durch die weithin bekannte
Beobachtung, dass das Trainingsmaterial typischerweise die Bandbreite der Eigenheiten
naturlic? her Sprachen nicht ausreichend widerspiegelt. Es werden verschiedene Methoden
?zur Einbettung morphologischer und syntaktischer Information in statistische Uberset-
zungssysteme vorgestellt und systematisch getestet. Ziel ist allgemein die Verbesserung
?der Ubersetzungsqualitat und die Verringerung der zum Training der Modellparameter?
notwendigen Datenmenge. Die Entwicklung der vorgeschlagenen Methoden ist ausgerich-
tet an der Analyse vorherrschender Fehlerursachen.
Esistschwierigfur? Alignierungsalgorithmen,großere? UnterschiedeinderWortstellung
zwischen einander entsprechenden Satzen zu behandeln. In dieser Arbeit wird eine Reihe?
vonUmordnungsoperationenaufSatzebeneeingefuhrt,? dieaufWissenub? erdieSatzstruk-
tur in den beteiligten Sprachen fußen. Zweck dieser Transformationen ist es, verwandte
Satze? einander anzugleichen. Eine detaillierte Analyse der Auswirkung auf Corpora und
?Ubersetzungsergebnisse lasst darauf schließen, dass ihre Anwendung zu besseren Wortali-?
gnments fuhrt? und folglich zu weniger verrauschten probabilistischen Lexika, zu breiterer
Anwendbarkeit von Mehrwortphrasen und zu einer besseren Abdeckung durch das Ziel-
sprachmodell.
?Die existierenden statistischen Ubersetzungssysteme betrachten verschiedene Wort-
formen des gleichen Lemmas als unabhangig voneinander. Das bilinguale Trainingsma-?
terial kann durch explizite Einbeziehung der wechselseitigen Abhangigk? eiten verwandter
?Wortformen besser ausgeschopft werden. In dieser Arbeit wird eine Hierarchie von Aqui-?
valenzklassen definiert, die auf morphologischer und syntaktischer Information ub? er die
Oberflachenformen hinaus beruht. Durch die Kombination von Merkmalen aus den ver-?
schiedenenHierarchieebenenwerdenhierarchischeLexikonmodellegebildet,diedieinden
?meisten statistischen Ubersetzungssystemen ublichen probabilistischen Lexika ersetzen?
konnen.? Diese kombinierten Modelle haben einen zweifachen Nutzen: Erstens verbessern
?sie die Vokabularabdeckung, da die Ubersetzungen fur? ungesehene Wortformen aus Infor-
mationen hergeleitet werden konnen, die von tieferen Ebenen in der Hierarchie stammen.?
Zum Zweiten k?

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents