Exploiting Social Semantics for Multilingual Information Retrieval [Elektronische Ressource] / Philipp Sorg. Betreuer: R. Studer

karlsruher_institut_fur_technologie - Philipp Sorg

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

209 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	karlsruher_institut_fur_technologie
Publié le	01 janvier 2011
Nombre de lectures	23
Langue	English
Poids de l'ouvrage	2 Mo

Extrait

Exploiting Social Semantics for Multilingual
Information Retrieval
Zur Erlangung des akademischen Grades eines
Doktors der Wirtschaftswissenschaften
(Dr. rer. pol.)
von der Fakultat¨ fur¨
W
am Karlsruher Institut fur¨ Technologie
vorgelegte Dissertation
von
Dipl.-Inf. Philipp Sorg
Tag der mundlic¨ hen Prufung:¨ 22. Juli 2011
Referent: Prof. Dr. Rudi Studer
Korreferent: Prof. Dr. Philipp Cimiano
Prufer:¨ Prof. Dr. Andreas Oberweis
Vorsitzender der Prufungsk¨ ommission: Prof. Dr. Christof WeinhardtiiAbstract
Information Retrieval (IR) deals with delivering relevant information items given
the speciﬁc information needs of users. As retrieval problems are deﬁned in various
environments such as the World Wide Web, corporate knowledge bases or even per-
sonal desktops, IR is an every day problem that concerns almost everybody in our
society. In this thesis, we present research results on the problem of Multilingual
IR (MLIR), which deﬁnes retrieval scenarios that cross language borders. MLIR is a
real-world problem which we motivate using different application scenarios, for ex-
ample search systems having users with reading skills in several languages or expert
retrieval.
As the main topic of this thesis, we consider how user-generated content that
is assembled by different popular Web portals can be exploited for MLIR. These
portals, prominent examples are Wikipedia or Yahoo! Answers, are built from the
contributions of millions of users. We deﬁne the knowledge that can be derived
from such portals as Social Semantics. Further, we identify important features of
Social Semantics, namely the support of multiple languages, the broad coverage of
topics and the ability to adapt to new topics. Based on these features, we argue that
Social Semantics can be exploited as background knowledge to support multilingual
retrieval systems.
Our main contribution is the integration of Social Semantics into multilingual re-
trieval models. Thereby, we present Cross-lingual Explicit Semantic Analysis, a se-
mantic document representation that is based on interlingual concepts exploited from
Wikipedia. Further, we propose a mixture language model that integrates different
sources of evidence, including the knowledge encoded in the category structure of
Yahoo! Answers.
For evaluation, we measure the beneﬁt of the proposed retrieval models that ex-
ploit Social Semantics. In our experiments, we apply these models to different es-
tablished datasets, which allows for the comparison to standard IR baselines and to
related approaches that are based on different kinds of background knowledge. As
standardized settings were not available for all the scenarios we considered, in partic-
ular for multilingual Expert Retrieval, we further organized an international retrieval
challenge that allowed the evaluation of our proposed retrieval models which were
not covered by existing challenges.
iiiivContents
I Introduction 1
I.1 Multilingual Retrieval Scenario.................... 2
I.2 Deﬁnition of Semantics ........................ 5
I.2.1 Historical Overview ..................... 5
I.2.2 Social Semantics deﬁned by Category Systems ....... 6
I.3 Deﬁnition of Information Retrieval .................. 8
I.3.1 Multilingual IR 9
I.3.2 Entity Search ......................... 10
I.4 Research Questions .......................... 10
I.5 Overview of the Thesis 13
II Preliminaries of IR 15
II.1 Document Preprocessing ....................... 17
II.1.1 Document Syntax and Encoding ............... 18
II.1.2 Tokenization 20
II.1.3 Normalization ........................ 21
II.1.4 Reference to this Thesis ................... 23
II.2 Monolingual IR ............................ 23
II.2.1 Document Representation .................. 23
II.2.2 Index Structures 25
II.2.3 Retrieval Models ....................... 25
II.2.4 Query Expansion 28
II.2.5 Document a priori Models 29
II.2.6 Reference to this Thesis ................... 29
II.3 Cross-lingual IR 30
II.3.1 Translation-based Approaches ................ 30
II.3.2 Machine Translation ..................... 32
II.3.3 Interlingual Document Representations ........... 33
II.3.4 Reference to this Thesis 34
II.4 Multilingual IR ............................ 34
II.4.1 Language Identiﬁcation ................... 35
II.4.2 Index Construction for MLIR ................ 35
vvi CONTENTS
II.4.3 Query Translation ...................... 36
II.4.4 Aggregation Models ..................... 37
II.4.5 Reference to this Thesis ................... 38
II.5 Evaluation in IR............................ 39
II.5.1 Experimental Setup 39
II.5.2 Relevance Assessments.................... 40
II.5.3 Evaluation Measures 40
II.5.4 Established Datasets 42
II.5.5 References to this Thesis ................... 44
II.6 Tools, Software and Resources 44
III Semantics in IR 47
III.1 Semantic Vector Spaces........................ 47
III.1.1 Generalized Vector Space Model............... 48
III.1.2 Latent Semantic Indexing .................. 50
III.1.3 Latent Dirichlet Allocation 52
III.1.4 Semantic Smoothing Kernels................. 53
III.1.5 Explicit Semantic Analysis 54
III.1.6 Ontology-based Document Representations ......... 54
III.2 Semantic Relatedness ......................... 55
III.2.1 Classiﬁcation of Semantic Relatedness ........... 56
III.2.2 Structured Knowledge Sources................ 57
III.2.3 Text Corpora as Knowledge Source ............. 58
III.3 Semantic Retrieval Models ...................... 59
III.3.1 Language....................... 59
III.3.2 Learning to Rank 60
III.3.3 Extended Query Models ................... 60
IV Cross-lingual Explicit Semantic Analysis 63
IV.1 Preliminaries ............................. 64
IV.2 Explicit Semantic Analysis 67
IV.2.1 Deﬁnition of Concepts .................... 67
IV.2.2 Original ESA Model ..................... 68
IV.2.3 Explicit Semantic Analysis (ESA) applied to IR....... 68
IV.2.4 Historical Overview 70
IV.3 Deﬁnition of Cross-lingual Explicit Semantic Analysis (CL-ESA) . 72
IV.3.1 Deﬁnition ........................... 72
IV.3.2 CL-ESA applied to CLIR/MLIR ............... 74
IV.3.3 Example for CL-ESA 78
IV.4 Design Choices ............................ 78
IV.4.1 Dimension Projection .................... 80
IV.4.2 Association Strength ..................... 81
IV.4.3 Relevance Function...................... 82
IV.4.4 Concept Spaces ........................ 83CONTENTS vii
IV.5 Experiments .............................. 87
IV.5.1 Methodology and Evaluation Measures ........... 88
IV.5.2 Test Datasets ......................... 90
IV.5.3 Reference Corpus....................... 94
IV.5.4 Evaluation of ESA Model Variants.............. 95
IV.5.5 Concept Spaces for Multilingual Scenarios ......... 99
IV.5.6 External vs. intrinsic Concept Deﬁnitions ..........104
IV.5.7 Experiments on the CLEF Ad-hoc Task ...........108
V Category-based LMs for Multilingual ER 115
V.1 Expert Retrieval ............................116
V.2 Language Models for IR .......................117
V.2.1 Theory of Language Models .................118
V.2.2 Smoothing ..........................119
V.2.3 Language Models for Expert Search .............120
V.2.4 Historical Overview .....................121
V.3 Extensions of Language Models ...................122
V.3.1 Language Models for MLIR122
V.3.2 for Category Systems ...........123
V.4 Combining Sources of Evidence125
V.4.1 Mixture Language Models ..................126
V.4.2 Discriminative Models ....................128
V.4.3 Learning to Rank .......................131
V.5 Experiments ..............................133
V.5.1 Yahoo! Answers134
V.5.2 Dataset ............................136
V.5.3 Evaluation Measures .....................141
V.5.4 Baselines ...........................142
V.5.5 Results of Baselines144
V.5.6 Results of the Mixture Language Models ..........148
V.5.7 Feature Analysis .......................154
V.6 Summary of Results..........................155
V.6.1 Results of the Experiments ..................155
V.6.2 Lessons Learned156
VI Enriching the CL Structure of Wikipedia 159
VI.1 Motivation...............................160
VI.1.1 Statistics about German/English Cross-Language Links . . . 160
VI.1.2 Chain Link Hypothesis ....................162
VI.2 Classiﬁcation-based Approach165
VI.2.1 Feature Design ........................165
VI.3 Evaluation167
VI.3.1 Baseline............................167
VI.3.2 Evaluation of the RAND1000 Dataset ............168viii CONTENTS
VI.3.3 Learning New Cross-language Links.............171
VI.4 Discussion...............................172
VI.4.1 Self-correctiveness of Wikipedia ...............172
VI.4.2 Quality and Future of Web 2.0 Resources ..........175
VIIConclusion 177
VII.1 Summary177
VII.2 Outlook ................................179
VII.2.1 Open Questions........................180
VII.2.2 Future of the Web 2.0.....................181
References 185
List of Figures 195
List of Tables 199Chapter I
Introduction
The ﬁeld of Information Retrieval (IR) is concerned with satisfying information
needs of users. The IR approach therefor