Information Extraction from Text for Improving Research on Small Molecules and Histone Modifications [Elektronische Ressource] / Corinna Klein

rheinische_friedrich-wilhelms-universitat_bonn - Corinna Klein

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

229 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Biologie

Informations

Publié par	rheinische_friedrich-wilhelms-universitat_bonn
Publié le	01 janvier 2011
Nombre de lectures	13
Langue	English
Poids de l'ouvrage	9 Mo

Extrait

Information Extraction from Text for
Improving Research on Small Molecules and
Histone Modiﬁcations
Dissertation
zur Erlangung des Doktorgrades (Dr. rer. nat.)
der
Mathematisch-Naturwissenschaftlichen Fakultät
der
Rheinischen Friedrich-Wilhelms-Universität Bonn
vorgelegt von
Corinna Klein
geb. Kolárikˇ
aus
Zittau
Bonn 2011Angefertigt mit Genehmigung der Mathematisch-Naturwissenschaftlichen Fakultät
der Rheinischen Friedrich-Wilhelms-Universität Bonn
1. Gutachter: Prof. Dr. rer. nat. Martin Hofmann-Apitius
2. Prof. Dr. rer. nat. Holger Fröhlich
Tag der Promotion: 10. Juni 2011
Erscheinungsjahr: 2011Abstract
The cumulative number of publications, in particular in the life sciences, requires efﬁcient
methods for the automated extraction of information and semantic information retrieval. The
recognition and identiﬁcation of information-carrying units in text – concept denominations
and named entities – relevant to a certain domain is a fundamental step. The focus of
this thesis lies on the recognition of chemical entities and the new biological named entity
type histone modiﬁcations, which are both important in the ﬁeld of drug discovery. As the
emergence of new research ﬁelds as well as the discovery and generation of novel entities
goes along with the coinage of new terms, the perpetual adaptation of respective named
entity recognition approaches to new domains is an important step for information extraction.
Two methodologies have been investigated in this concern: the state-of-the-art machine
learning method, Conditional Random Fields (CRF), and an approximate string search
method based on dictionaries. Recognition methods that rely on dictionaries are strongly
dependent on the availability of entity terminology collections as well as on its quality.
In the case of chemical entities the is distributed over more than 7 publicly
available data sources. The join of entries and accompanied terminology from selected
resources enables the generation of a new dictionary comprising chemical named entities.
Combined with the automatic processing of respective terminology – the dictionary curation
– the recognition performance reached anF measure of 0.54. That is an improvement by1
29 % in comparison to the raw dictionary. The highest recall was achieved for the class of
TRIVIAL-names with 0.79.
The recognition and identiﬁcation of chemical named entities provides a prerequisite
for the extraction of related pharmacological relevant information from literature data.
Therefore, lexico-syntactic patterns were deﬁned that support the automated extraction of
hypernymic phrases comprising pharmacological function terminology related to chemical
compounds. It was shown that 29-50 % of the automatically extracted terms can be proposed
for novel functional annotation of chemical entities provided by the reference database
DrugBank. Furthermore, they are a basis for building up concept hierarchies and ontologies
or for extending existing ones. Successively, the pharmacological function and biological
activity concepts obtained from text were included into a novel descriptor for chemical
compounds. Its successful application for the prediction of pharmacological function of
molecules and the extension of chemical classiﬁcation schemes, such as the the Anatomical
Therapeutic Chemical (ATC), is demonstrated.
In contrast to chemical entities, no comprehensive terminology resource has been available
for histone modiﬁcations. Thus, histone modiﬁcation concept terminology was primary
recognized in text via CRFs with a F measure of 0.86. Subsequent, linguistic variants1
of extracted histone modiﬁcation terms were mapped to standard representations that
were organized into a newly assembled histone modiﬁcation hierarchy. The mapping was
accomplished by a novel developed term mapping approach described in the thesis. Thecombination of term recognition and term variant resolution builds up a new procedure for
the assembly of novel terminology collections. It supports the generation of a term list that
is applicable in dictionary-based methods. For the recognition of histone modiﬁcation in
text it could be shown that the named entity recognition method based on dictionaries is
superior to the used machine learning approach.
In conclusion, the present thesis provides techniques which enable an enhanced utilization
of textual data, hence, supporting research in epigenomics and drug discovery.
2Acknowledgments
Herewith, I would like to take the opportunity to thank Prof. Dr. Martin Hofmann-Apitius
for giving me the opportunity to work on my thesis at the Bioinformatics department of the
Fraunhofer Institute SCAI. Furthermore, I would like to thank Prof. Dr. Holger Fröhlich
for his willingness to be the co-referent of the thesis. Special thanks go to Dr. Juliane Fluck,
who introduced me to text mining, gave me strong support during my work, and critically
reviewed the thesis.
I am very grateful to Theo Mevissen for his technical support, especially with ProMiner.
I would like to thank Roman Klinger for providing his adapted Mallet-implementation
of CRFs to me and for the many hours of good discussions. Furthermore, I appreciate
the good cooperation with Harsha Gurulingappa, whose master thesis was supervised by
me. I thank all other colleagues that accompanied me during my time at SCAI and thank
the Bonn-Aachen International Center for Information Technology (B-IT) for the ﬁnancial
support of my thesis. Last but not least, special thanks are dedicated to my husband Adrian
for his encouragement during the whole time.Contents
List of Figures 9
List of Tables 11
1 Introduction 15
1.1 Overview on the Biomedical Relevance and Information Resources of Chemi-
cal Entities and Histone Modiﬁcations . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1 Introduction to Epigenetics and its Role in Biology, Medicine, and
Pharmacology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Overview on Text Processing Methods . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.1 Introduction to Information Extraction . . . . . . . . . . . . . . . . . . 23
1.2.1.1 Introduction to Named Entity Recognition . . . . . . . . . . . 25
1.2.2 Challenges of Chemical Entity and Histone Modiﬁcation Term
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.2.1 Overview on Terminology of Chemical Substances . . . . . . 27
1.2.2.2 on Designators of Named Entities . . . . 30
1.2.2.3 Overview on Histone Modiﬁcation Terminology . . . . . . . 32
2 Problem Description and Goal 35
2.1 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
I Fundamentals 39
3 on Applied and Developed Methods 41
3.1 Information Extraction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 Overview on Methods applied in IE . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.2.1 Literature Survey on Biomedical and Chemical Named Entity
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.2.2 Dictionary-based NER Approaches . . . . . . . . . . . . . . . 44
3.1.2.3 Machine Learning-based NER Approaches . . . . . . . . . . 49
3.1.3 Term Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.3.1 Generation of Canonical Term Representatives . . . . . . . . 53
3.1.3.2 Mapping of Terms to Reference Identiﬁers . . . . . . . . . . . 53
3.1.4 Corpus Selection and Annotation . . . . . . . . . . . . . . . . . . . . . . 55
3.1.4.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.5 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Contents
3.2 Function Annotation of Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.1 Impact of Ontologies for Function Annotation and Data Management 59
3.2.2 Information Extraction for Supporting Function Annotation of Entities 60
3.2.2.1 Impact of Ontology for Information Extraction . . . . . . . . 61
3.2.2.2 Ontology Learning . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.3 Function Prediction of Chemical Entities . . . . . . . . . . . . . . . . . 66
3.2.3.1 Related Work – Class Prediction of Chemical Compounds . . 67
3.2.3.2 Classiﬁcation Methods . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
II Developed Systems and Implementations 75
4 Building a Framework for the Information Aggregation of Chemical Entities 77
4.1 Recognition of Chemical Named Entities in Text . . . . . . . . . . . . . . . . . 77
4.1.1 Generation of an Evaluation Text Corpus and Annotation of Chemical
Entity Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.2 Generation of a Chemical Named Entity Dictionary . . . . . . . . . . . 80
4.1.2.1 Raw Dictionary Generation and Performance Analysis . . . . 81
4.1.2.2 Improvement of the Dictionary Quality by Curation . . . . . 88
4.1.2.3 Adjusting the Approximate String Matching to the Chemical
Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.2.4 Evaluation of the Dictionary Curation and the Approximate
String Matching . . . .