GoWeb: semantic search and browsing for the life sciences [Elektronische Ressource] / eingereicht von Heiko Dietze
185 pages
English

GoWeb: semantic search and browsing for the life sciences [Elektronische Ressource] / eingereicht von Heiko Dietze

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
185 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

GoWeb: Semantic Search and Browsing for the Life SciencesDissertationzur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.)vorgelegt an derTechnischen Universität DresdenFakultät Informatikeingereicht vonDipl.-Inf. Heiko Dietzegeboren am 1. Februar 1980 in DresdenBetreuer Prof. Dr. Michael Schroeder, TU-Dresden, Fakultät InformatikGutachter Dr. Albert Burger, Heriot-Watt University, School of Mathematical and ComputerSciences, EdinburghTag der Verteidigung 20. Oktober 2010Dresden, den 11.08.2010ABSTRACTSearching is a fundamental task to support research. Current search engines are keyword-based.Semantic technologies promise a next generation of semantic engines, which will be ableto answer questions. Current approaches either apply natural language processing to unstructuredtext or they assume the existence of structured statements over which they can reason.This work provides a system for combining the classical keyword-based search engines withsemantic annotation. Conventional search results are annotated using a customized annotationalgorithm, which takes the textual properties and requirements such as speed and scalability intoaccount. The biomedical background knowledge consists of the GeneOntology and Medical Sub-ject Headings and other related entities, e.g. proteins/gene names and person names. Togetherthey provide the relevant semantic context for a search engine for the life sciences.

Sujets

Informations

Publié par
Publié le 01 janvier 2010
Nombre de lectures 21
Langue English
Poids de l'ouvrage 6 Mo

Extrait

GoWeb: Semantic Search and Browsing for the Life Sciences
Dissertation
zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.)
vorgelegt an der
Technischen Universität Dresden
Fakultät Informatik
eingereicht von
Dipl.-Inf. Heiko Dietze
geboren am 1. Februar 1980 in Dresden
Betreuer Prof. Dr. Michael Schroeder, TU-Dresden, Fakultät Informatik
Gutachter Dr. Albert Burger, Heriot-Watt University, School of Mathematical and Computer
Sciences, Edinburgh
Tag der Verteidigung 20. Oktober 2010
Dresden, den 11.08.2010ABSTRACT
Searching is a fundamental task to support research. Current search engines are keyword-based.
Semantic technologies promise a next generation of semantic engines, which will be able
to answer questions. Current approaches either apply natural language processing to unstructured
text or they assume the existence of structured statements over which they can reason.
This work provides a system for combining the classical keyword-based search engines with
semantic annotation. Conventional search results are annotated using a customized annotation
algorithm, which takes the textual properties and requirements such as speed and scalability into
account. The biomedical background knowledge consists of the GeneOntology and Medical Sub-
ject Headings and other related entities, e.g. proteins/gene names and person names. Together
they provide the relevant semantic context for a search engine for the life sciences. We develop
the system GoWeb for semantic web search and evaluate it using three benchmarks. It is shown
that GoWeb is able to aid question answering with success rates up to 79%.
Furthermore, the system also includes semantic hyperlinks that enable semantic browsing of
the knowledge space. The semantic hyperlinks facilitate the use of the eScience infrastructure,
even complex workflows of composed web services.
To complement the web search of GoWeb, other data source and more specialized information
needs are tested in different prototypes. This includes patents and intranet search. Semantic search
is applicable for these usage scenarios, but the developed systems also show limits of the semantic
approach. That is the size, applicability and completeness of the integrated ontologies, as well as
technical issues of text-extraction and meta-data information gathering.
Additionally, semantic indexing as an alternative approach to implement semantic search is
implemented and evaluated with a question answering benchmark. A semantic index can help to
answer questions and address some limitations of GoWeb. Still the maintenance and optimization
of such an index is a challenge, whereas GoWeb provides a straightforward system.CONTENTS
1 Motivation 1
1.1 Definition of Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 5
2.1 Semantic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 RDF-based semantic search engines . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Web-based semantic search . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Literature-based semantic search engines . . . . . . . . . . . . . . . . . 12
2.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Types of Questions and Answers . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Answering Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Competitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Ontological Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Data and Text Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Document and Content Types . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Web search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Automated Content Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Text Mining and Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.1 Finding Ontology Concepts in Text . . . . . . . . . . . . . . . . . . . . 36
2.5.2 Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.3 Algorithmic principles for concept and entity recognition . . . . . . . . . 41
2.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Indexing and Ranking for Information Extraction . . . . . . . . . . . . . . . . . 45
2.6.1 Indexing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6.2 Static Similarity Measures for Ranking . . . . . . . . . . . . . . . . . . 49
2.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3 GoWeb – Semantic Web Search for the Life Science 53
3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.1 Choosing an Algorithm for Annotation . . . . . . . . . . . . . . . . . . 54
3.1.2 Implementing a Radix Tree for . . . . . . . . . . . . . . . . 54
3.1.3 Runtime Assessment for the Annotation Algorithms . . . . . . . . . . . 56
3.1.4 Description of the Entity Recognition Algorithm . . . . . . . . . . . . . 58
3.1.5 Co-Occurrence-Based Filter . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58vi CONTENTS
3.3 Evaluation: Answering Research Questions . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Genes and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Symptoms and Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Proteins, Diseases and Evidences . . . . . . . . . . . . . . . . . . . . . 68
3.4 Comparison of GoWeb to other Approaches . . . . . . . . . . . . . . . . . . . . 71
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Semantic Browsing 75
4.1 Semantic Hyperlinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.1 Semantic Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.2 Compositions of Web Services . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.3 Semantic Hyperlinks in GoPubMed Extended . . . . . . . . . . . . . . . 77
4.1.4 Website Annotation with GoWeb . . . . . . . . . . . . . . . . . . . . . 78
4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Evaluation: Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 Specialized Semantic Search Engines 87
5.1 GoPatents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.1 Patents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.2 Patent Classification Schemes . . . . . . . . . . . . . . . . . . . . . . . 90
5.1.3 Patent Mining and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.4 GoFreePatentsOnline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.5 GoPIZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 GoCell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 MousePubMed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Extracting Gene Names, Anatomy Concepts and Developmental Stages . 96
5.3.2 Experiment Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 GoECDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4.1 Examples Knowledge-based Searching . . . . . . . . . . . . . . . . . . 103
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Towards Answering Questions with Semantic Indexing 109
6.1 Semantic Index using a Sentence-based Approach . . . . . . . . . . . . . . . . . 109
6.2 TREC Genomics 2006 Question Answering Benchmark . . . . . . . . . . . . . 111
6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7 Conclusion and Future Work 117
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Bibliography 121
A Evidence Data for GoWeb 141
A.1 Evidences for GoWeb Results with Google questions . . . . . . . . . . . . . . . 141
A.2 TREC Genomics 2006 Question and GoWeb Answers . . . . . . . . . . . . . . . 150
B Additional Data for TREC Genomics Question Answering 157
B.1 Mapping of TREC 2006 to Query Parameters . . . . . . . . 157
B.2 Answers to Genomics 2006 using TREC Genomics 2006 Corpus . . . . . 162PUBLICATIONS
• Peer-Reviewed Article
Heiko Dietze and Michael Schroeder
GoWeb: A semantic search engine for the life science web
In: Proceedings of the Intl. Workshop on Semantic Web Applications and Tools for the Life
Sciences SWAT4LS, Editors: Albert Burger, Adrian Paschke, Paolo Romano and Andrea

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents