Mining relations from the biomedical literature [Elektronische Ressource] / von Jörg Hakenberg

humboldt-universitat_zu_berlin - Dipl. Inf. Jörg Hakenberg

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

179 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	humboldt-universitat_zu_berlin
Publié le	01 janvier 2009
Nombre de lectures	12
Langue	English
Poids de l'ouvrage	4 Mo

Extrait

Mining Relations from the Biomedical Literature
DISSERTATION
zur Erlangung des akademischen Grades
Dr. Rer. Nat.
im Fach Informatik
eingereicht an der
Mathematisch-Wissenschaftlichen Fakultät II
Humboldt-Universität zu Berlin
von
Dipl. Inf. Jörg Hakenberg
Düren, 15. Juli 1975
Präsident der Humboldt-Universität zu Berlin:
Prof. Dr. Dr. h.c. Christoph Markschies
Dekan der Mathematisch-Wissenschaftlichen Fakultät II:
Prof. Dr. Peter Frensch
Gutachter:
1. Prof. Dr. Ulf Leser
2. Prof. Dr. Hans-Dieter Burkhard
3. Prof. Dr. Udo Hahn
eingereicht am: 11. März 2009
Tag der mündlichen Prüfung: 11. September 2009Abstract
Text mining deals with the automated annotation of texts and the extraction of
facts from textual data for subsequent analysis. Such texts range from short arti-
cles and abstracts to large documents, for instance web pages and scientiﬁc arti-
cles, but also include textual descriptions in otherwise structured databases. This
thesis focuses on two key problems in biomedical text mining: relationship extrac-
tion from biomedical abstracts —in particular, protein–protein interactions—, and
a pre-requisite step, named entity recognition —again focusing on proteins.
This thesis presents goals, challenges, and typical approaches for each of the
main building blocks in biomedical text mining. We present out own approaches
for named entity recognition of proteins and relationship extraction of protein-
protein interactions. For the ﬁrst, we describe two methods, one set up as a classiﬁ-
cation task, the other based on dictionary-matching. For relationship extraction, we
develop a methodology to automatically annotate large amounts of unlabeled data
for relations, and make use of such annotations in a pattern matching strategy. This
strategy ﬁrst extracts similarities between sentences that describe relations, stor-
ing them as consensus patterns. We develop a sentence alignment approach that
introduces multi-layer alignment, making use of multiple annotations per word.
For the task of extracting protein-protein interactions, empirical results show that
our methodology performs comparable to existing approaches that require a large
amount of human intervention, either for annotation of data or creation of models.
iiZusammenfassung
Textmining beschäftigt sich mit der automatisierten Annotierung von Texten und
der Extraktion einzelner Informationen aus Texten, die dann für die Weiterver-
arbeitung zur Verfügung stehen. Texte können dabei kurze Zusammenfassungen
oder komplette Artikel sein, zum Beispiel Webseiten und wissenschaftliche Artikel,
umfassen aber auch textuelle Einträge in sonst strukturierten Datenbanken. Die-
se Dissertationsschrift bespricht zwei wesentliche Themen des biomedizinischen
Textmining: die Extraktion von Zusammenhängen zwischen biologischen Entitä-
ten —das Hauptaugenmerk liegt dabei auf der Erkennung von Protein-Protein-
Interaktionen—, und einen notwendigen Vorverarbeitungsschritt, die Erkennung
von Proteinnamen.
Diese Schrift beschreibt Ziele, Herausforderungen, sowie typische Herangehens-
weisen für alle wesentlichen Komponenten des biomedizinischen Textmining. Wir
stellen eigene Methoden zur Erkennung von Proteinnamen sowie der Extrakti-
on von Protein-Protein-Interaktionen vor. Zwei eigene Verfahren zur Erkennung
von Proteinnamen werden besprochen, eines basierend auf einem Klassiﬁkations-
problem, das andere basierend auf Suche in Wörterbüchern. Für die Extraktion
von Interaktionen entwickeln wir eine Methode zur automatischen Annotierung
großer Mengen von Text im Bezug auf Relationen; diese Annotationen werden
dann zur Mustererkennung verwendet, um anschließend die gefundenen Mus-
ter auf neuen Text anwenden zu können. Um Muster zu erkennen, berechnen wir
Ähnlichkeiten zwischen zuvor gefundenen Sätzen, die denselben Typ von Rela-
tion/Interaktion beschreiben. Diese Ähnlichkeiten speichern wir als sogenannte
‘consensus patterns’. Wir entwickeln eine Alignmentstrategie, die mehrschichtige
Annotationen pro Position im Muster erlaubt. In Versuchen auf bekannten Bench-
marks zeigen wir empirisch, dass unser vollautomatisches Verfahren Resultate er-
zielt, die vergleichbar sind mit existierenden Methoden, welche umfangreiche Ein-
griffe von Experten voraussetzen.Contents
1. Introduction 1
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Outline of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Text Mining 5
2.1. Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1. Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2. Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3. Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . 13
2.1.4. Relation Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2. Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1. Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2. Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3. Sequence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3. Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1. Sentence Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.2. Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3. Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4. Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.5. Sentence Chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.6. Shallow Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.7. Full Sentence Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3. Named Entity Recognition 41
3.1. Token-based Named Entity Recognition . . . . . . . . . . . . . . . . . . . 42
3.1.1. Classiﬁcation task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.2. Post-processing to ﬁnd multi-word terms . . . . . . . . . . . . . . 46
3.1.3. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2. Dictionary-Based Named Entity Recognition . . . . . . . . . . . . . . . . 53
3.3. Extension of Dictionary-Based Approaches to Named Entity Identiﬁcation 55
3.4. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1. Dictionary-based approaches . . . . . . . . . . . . . . . . . . . . . 58
3.4.2. Rule-based approaches . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.3. Classiﬁcation-based approaches . . . . . . . . . . . . . . . . . . . 61
3.4.4. Sequence-based approaches . . . . . . . . . . . . . . . . . . . . . . 62
vContents
3.4.5. Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4. Language Patterns 67
4.1. Pattern representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.1. Consensus patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.2. Multi–layer . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.3. Weighted multi–layer patterns . . . . . . . . . . . . . . . . . . . . 73
4.2. Matching patterns against text . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.1. Patterns as regular expressions . . . . . . . . . . . . . . . . . . . . 74
4.2.2. Pattern matching with sentence alignment . . . . . . . . . . . . . 76
4.2.3. Substitution matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.4. Dynamic programming to solve the alignment problem . . . . . . 79
4.2.5. Weighted multi–layer alignment . . . . . . . . . . . . . . . . . . . 79
4.3. Learning patterns with multiple sentence alignment . . . . . . . . . . . . 81
4.3.1. Clustering of initial patterns . . . . . . . . . . . . . . . . . . . . . . 81
4.3.2. Multiple sentence alignment . . . . . . . . . . . . . . . . . . . . . 83
4.4. Pattern generation by pairwise alignment . . . . . . . . . . . . . . . . . . 84
4.5. Optimization using genetic algorithms . . . . . . . . . . . . . . . . . . . . 85
5. Applications and Evaluation 89
5.1. Collecting a large pattern sample . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.1. Curated data bases to ﬁnd examples . . . . . . . . . . . . . . . . . 89
5.1.2. Protein–protein interaction patterns using IntAct . . . . . . . . . 90
5.2. Assigning attributes to relations . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.1. Predicting directed relations . . . . . . . . . . . . . . . . . . . . . . 91
5.3. Mining protein-protein interactions . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1. Hand–crafted patterns . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.2. Optimized patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.3. Patterns learned from large samples . . . . . . . . . . . . . . . . . 97
5.4. Related Work . . . . . . . . . . . . . . . . . . . . . . . .