Exploiting Social Semantics for Multilingual Information Retrieval [Elektronische Ressource] / Philipp Sorg. Betreuer: R. Studer
209 pages
English

Exploiting Social Semantics for Multilingual Information Retrieval [Elektronische Ressource] / Philipp Sorg. Betreuer: R. Studer

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
209 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Description

Exploiting Social Semantics for MultilingualInformation RetrievalZur Erlangung des akademischen Grades einesDoktors der Wirtschaftswissenschaften(Dr. rer. pol.)von der Fakultat¨ fur¨Wam Karlsruher Institut fur¨ Technologievorgelegte DissertationvonDipl.-Inf. Philipp SorgTag der mundlic¨ hen Prufung:¨ 22. Juli 2011Referent: Prof. Dr. Rudi StuderKorreferent: Prof. Dr. Philipp CimianoPrufer:¨ Prof. Dr. Andreas OberweisVorsitzender der Prufungsk¨ ommission: Prof. Dr. Christof WeinhardtiiAbstractInformation Retrieval (IR) deals with delivering relevant information items giventhe specific information needs of users. As retrieval problems are defined in variousenvironments such as the World Wide Web, corporate knowledge bases or even per-sonal desktops, IR is an every day problem that concerns almost everybody in oursociety. In this thesis, we present research results on the problem of MultilingualIR (MLIR), which defines retrieval scenarios that cross language borders. MLIR is areal-world problem which we motivate using different application scenarios, for ex-ample search systems having users with reading skills in several languages or expertretrieval.As the main topic of this thesis, we consider how user-generated content thatis assembled by different popular Web portals can be exploited for MLIR. Theseportals, prominent examples are Wikipedia or Yahoo! Answers, are built from thecontributions of millions of users.

Sujets

Informations

Publié par
Publié le 01 janvier 2011
Nombre de lectures 23
Langue English
Poids de l'ouvrage 2 Mo

Extrait

Exploiting Social Semantics for Multilingual
Information Retrieval
Zur Erlangung des akademischen Grades eines
Doktors der Wirtschaftswissenschaften
(Dr. rer. pol.)
von der Fakultat¨ fur¨
W
am Karlsruher Institut fur¨ Technologie
vorgelegte Dissertation
von
Dipl.-Inf. Philipp Sorg
Tag der mundlic¨ hen Prufung:¨ 22. Juli 2011
Referent: Prof. Dr. Rudi Studer
Korreferent: Prof. Dr. Philipp Cimiano
Prufer:¨ Prof. Dr. Andreas Oberweis
Vorsitzender der Prufungsk¨ ommission: Prof. Dr. Christof WeinhardtiiAbstract
Information Retrieval (IR) deals with delivering relevant information items given
the specific information needs of users. As retrieval problems are defined in various
environments such as the World Wide Web, corporate knowledge bases or even per-
sonal desktops, IR is an every day problem that concerns almost everybody in our
society. In this thesis, we present research results on the problem of Multilingual
IR (MLIR), which defines retrieval scenarios that cross language borders. MLIR is a
real-world problem which we motivate using different application scenarios, for ex-
ample search systems having users with reading skills in several languages or expert
retrieval.
As the main topic of this thesis, we consider how user-generated content that
is assembled by different popular Web portals can be exploited for MLIR. These
portals, prominent examples are Wikipedia or Yahoo! Answers, are built from the
contributions of millions of users. We define the knowledge that can be derived
from such portals as Social Semantics. Further, we identify important features of
Social Semantics, namely the support of multiple languages, the broad coverage of
topics and the ability to adapt to new topics. Based on these features, we argue that
Social Semantics can be exploited as background knowledge to support multilingual
retrieval systems.
Our main contribution is the integration of Social Semantics into multilingual re-
trieval models. Thereby, we present Cross-lingual Explicit Semantic Analysis, a se-
mantic document representation that is based on interlingual concepts exploited from
Wikipedia. Further, we propose a mixture language model that integrates different
sources of evidence, including the knowledge encoded in the category structure of
Yahoo! Answers.
For evaluation, we measure the benefit of the proposed retrieval models that ex-
ploit Social Semantics. In our experiments, we apply these models to different es-
tablished datasets, which allows for the comparison to standard IR baselines and to
related approaches that are based on different kinds of background knowledge. As
standardized settings were not available for all the scenarios we considered, in partic-
ular for multilingual Expert Retrieval, we further organized an international retrieval
challenge that allowed the evaluation of our proposed retrieval models which were
not covered by existing challenges.
iiiivContents
I Introduction 1
I.1 Multilingual Retrieval Scenario.................... 2
I.2 Definition of Semantics ........................ 5
I.2.1 Historical Overview ..................... 5
I.2.2 Social Semantics defined by Category Systems ....... 6
I.3 Definition of Information Retrieval .................. 8
I.3.1 Multilingual IR 9
I.3.2 Entity Search ......................... 10
I.4 Research Questions .......................... 10
I.5 Overview of the Thesis 13
II Preliminaries of IR 15
II.1 Document Preprocessing ....................... 17
II.1.1 Document Syntax and Encoding ............... 18
II.1.2 Tokenization 20
II.1.3 Normalization ........................ 21
II.1.4 Reference to this Thesis ................... 23
II.2 Monolingual IR ............................ 23
II.2.1 Document Representation .................. 23
II.2.2 Index Structures 25
II.2.3 Retrieval Models ....................... 25
II.2.4 Query Expansion 28
II.2.5 Document a priori Models 29
II.2.6 Reference to this Thesis ................... 29
II.3 Cross-lingual IR 30
II.3.1 Translation-based Approaches ................ 30
II.3.2 Machine Translation ..................... 32
II.3.3 Interlingual Document Representations ........... 33
II.3.4 Reference to this Thesis 34
II.4 Multilingual IR ............................ 34
II.4.1 Language Identification ................... 35
II.4.2 Index Construction for MLIR ................ 35
vvi CONTENTS
II.4.3 Query Translation ...................... 36
II.4.4 Aggregation Models ..................... 37
II.4.5 Reference to this Thesis ................... 38
II.5 Evaluation in IR............................ 39
II.5.1 Experimental Setup 39
II.5.2 Relevance Assessments.................... 40
II.5.3 Evaluation Measures 40
II.5.4 Established Datasets 42
II.5.5 References to this Thesis ................... 44
II.6 Tools, Software and Resources 44
III Semantics in IR 47
III.1 Semantic Vector Spaces........................ 47
III.1.1 Generalized Vector Space Model............... 48
III.1.2 Latent Semantic Indexing .................. 50
III.1.3 Latent Dirichlet Allocation 52
III.1.4 Semantic Smoothing Kernels................. 53
III.1.5 Explicit Semantic Analysis 54
III.1.6 Ontology-based Document Representations ......... 54
III.2 Semantic Relatedness ......................... 55
III.2.1 Classification of Semantic Relatedness ........... 56
III.2.2 Structured Knowledge Sources................ 57
III.2.3 Text Corpora as Knowledge Source ............. 58
III.3 Semantic Retrieval Models ...................... 59
III.3.1 Language....................... 59
III.3.2 Learning to Rank 60
III.3.3 Extended Query Models ................... 60
IV Cross-lingual Explicit Semantic Analysis 63
IV.1 Preliminaries ............................. 64
IV.2 Explicit Semantic Analysis 67
IV.2.1 Definition of Concepts .................... 67
IV.2.2 Original ESA Model ..................... 68
IV.2.3 Explicit Semantic Analysis (ESA) applied to IR....... 68
IV.2.4 Historical Overview 70
IV.3 Definition of Cross-lingual Explicit Semantic Analysis (CL-ESA) . 72
IV.3.1 Definition ........................... 72
IV.3.2 CL-ESA applied to CLIR/MLIR ............... 74
IV.3.3 Example for CL-ESA 78
IV.4 Design Choices ............................ 78
IV.4.1 Dimension Projection .................... 80
IV.4.2 Association Strength ..................... 81
IV.4.3 Relevance Function...................... 82
IV.4.4 Concept Spaces ........................ 83CONTENTS vii
IV.5 Experiments .............................. 87
IV.5.1 Methodology and Evaluation Measures ........... 88
IV.5.2 Test Datasets ......................... 90
IV.5.3 Reference Corpus....................... 94
IV.5.4 Evaluation of ESA Model Variants.............. 95
IV.5.5 Concept Spaces for Multilingual Scenarios ......... 99
IV.5.6 External vs. intrinsic Concept Definitions ..........104
IV.5.7 Experiments on the CLEF Ad-hoc Task ...........108
V Category-based LMs for Multilingual ER 115
V.1 Expert Retrieval ............................116
V.2 Language Models for IR .......................117
V.2.1 Theory of Language Models .................118
V.2.2 Smoothing ..........................119
V.2.3 Language Models for Expert Search .............120
V.2.4 Historical Overview .....................121
V.3 Extensions of Language Models ...................122
V.3.1 Language Models for MLIR122
V.3.2 for Category Systems ...........123
V.4 Combining Sources of Evidence125
V.4.1 Mixture Language Models ..................126
V.4.2 Discriminative Models ....................128
V.4.3 Learning to Rank .......................131
V.5 Experiments ..............................133
V.5.1 Yahoo! Answers134
V.5.2 Dataset ............................136
V.5.3 Evaluation Measures .....................141
V.5.4 Baselines ...........................142
V.5.5 Results of Baselines144
V.5.6 Results of the Mixture Language Models ..........148
V.5.7 Feature Analysis .......................154
V.6 Summary of Results..........................155
V.6.1 Results of the Experiments ..................155
V.6.2 Lessons Learned156
VI Enriching the CL Structure of Wikipedia 159
VI.1 Motivation...............................160
VI.1.1 Statistics about German/English Cross-Language Links . . . 160
VI.1.2 Chain Link Hypothesis ....................162
VI.2 Classification-based Approach165
VI.2.1 Feature Design ........................165
VI.3 Evaluation167
VI.3.1 Baseline............................167
VI.3.2 Evaluation of the RAND1000 Dataset ............168viii CONTENTS
VI.3.3 Learning New Cross-language Links.............171
VI.4 Discussion...............................172
VI.4.1 Self-correctiveness of Wikipedia ...............172
VI.4.2 Quality and Future of Web 2.0 Resources ..........175
VIIConclusion 177
VII.1 Summary177
VII.2 Outlook ................................179
VII.2.1 Open Questions........................180
VII.2.2 Future of the Web 2.0.....................181
References 185
List of Figures 195
List of Tables 199Chapter I
Introduction
The field of Information Retrieval (IR) is concerned with satisfying information
needs of users. The IR approach therefor

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents