La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Partagez cette publication

On Social Semantics
In Information Retrieval
Faculty of Technology
University of Bielefeld
In partial fulfilment of the requirements
to obtain the academic degree of
Dr. rer. nat.
submitted dissertation
of
Ulli Waltinger
from
Regensburg
Date of disputation: 2010/10/08
Reviewer:
Prof. Dr. Alexander Mehler
Prof. Dr. Ipke WachsmuthiiAcknowledgement
This work would not have been possible without the support of many peo-
ple. I deeply thank Prof. Dr. Alexander Mehler for supervising this thesis and
for supporting me for the past three years. He provided an inspiring and sti-
mulating environment, resulting in many fruitful discussions that helped and
improved my research ambitions. I am grateful and in dept not only for his
guidance and mentorship, but also his patience and support. I want to express
my gratitude to Prof. Dr. Ipke Wachsmuth who agreed to review this thesis in
a very tight schedule and who provided useful comments and suggestions for
further improvements. I would also like to thank the board of examiners: Dr.-
Ing. Britta Wrede and PD Dr. Katharina J. Rohlfing. Many thanks are due to
all of my colleagues at the Text Technology Department at Bielefeld Univer-
sity, in particular Dr. Armin Wegner for valuable discussions, not to mention
Rüdiger Gleim, Alexandra Ernst, Olga Pustylnikov, Dietmar Esch and Tobias
Feith. Finally, I would like to thank my family, Liese, Magdalena and Theresa
- my beloved girls, for the enormous support and encouragement they offered
me. Thanks for the endless patience over the past few years.
Ulli Waltinger
iiiAbstract
In this thesis we analyze the performance of social semantics in textual infor-
mation retrieval. By means of collaboratively constructed knowledge derived
from web-based social networks, inducing both common-sense and domain-
specific knowledge as constructed by a multitude of users, we will establish an
improvement in performance of selected tasks within different areas of infor-
mation retrieval. This work connects the concepts and the methods of social
networks and the semantic web to support the analysis of a social semantic
web that combines human intelligence with machine learning and natural lan-
guage processing. In this context, social networks, as instances of the social
web, are capable in delivering social network data and document collections
on a tremendous scale, inducing thematic dynamics that cannot be achieved
by traditional expert resources. The question of an automatic conversion, an-
notation and processing, however, is central to the debate of the benefits of
the social semantic web. Which kind of technologies and methods are avail-
able, adequate and contribute to the processing of this rapidly rising flood of
information and at the same time being capable of using the wealth of infor-
mation in this large, but more importantly decentralized internet. The present
work researches the performance of social semantic-induced categorization by
means of different document models. We will shed light on the question, to
which level social networks and social ontologies contribute to selected areas
within the information retrieval area, such as automatically determining term-
and text associations, identifying topics, text and web genre categorization,
and also the domain of sentiment analysis. We will show in extensive evalua-
tions, comparing the classical apparatus of text categorization – Vector Space
Model, Latent Semantic Analysis and Support Vector Maschine – that signifi-
cant improvements can be obtained by considering the collaborative knowledge
derived from the social web.
Keywords: Social Semantics, Information Retrieval, Machine Learning,
Text Technology, Text Categorization, Topic Identification, Text Clustering,
Sentiment Analysis, Web Genre ClassificationContents
1 Introduction 1
1.1 Moving from Text to the Web . . . . . . . . . . . . . . . . . . . 2
1.2 Classification, Categorization and Clustering . . . . . . . . . . . 7
1.3 About the Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Ontology vs. Knowledge Base . . . . . . . . . . . . . . . . . . . 10
1.5 Words, Concepts And Topics . . . . . . . . . . . . . . . . . . . . 11
1.6 Open And Closed Content Models . . . . . . . . . . . . . . . . . 12
1.7 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Document Representation and Text Classification 19
2.1 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Token vs. Words . . . . . . . . . . . . . . . . . . . . . . 19
2.1.2 Sentence Segmentation . . . . . . . . . . . . . . . . . . . 20
2.1.3 PoS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.4 Lemmatization and Stemming . . . . . . . . . . . . . . . 25
2.1.5 Named Entity Recognition . . . . . . . . . . . . . . . . . 26
2.1.6 Document Structure Processing . . . . . . . . . . . . . . 28
2.2 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Vector Space Models . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . 31
2.2.3 Similarity Coefficients . . . . . . . . . . . . . . . . . . . 33
2.2.4 Index Term Selection . . . . . . . . . . . . . . . . . . . . 35
2.3 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . 38
2.3.2 K-Nearest Neighbor Classifier . . . . . . . . . . . . . . . 39
iii2.3.3 Support Vector Machine Classification . . . . . . . . . . 40
2.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 42
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Social Semantics in Information Retrieval 45
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . 45
3.1.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.3 Social Semantics . . . . . . . . . . . . . . . . . . . . . . 49
3.2 The Knowledge Acquisition Bottleneck . . . . . . . . . . . . . . 54
3.2.1 Utilizing Feature Construction . . . . . . . . . . . . . . . 55
3.2.2 Towards the BOW by Topic Concepts . . . . . . . . . . 56
3.3 Social Semantic Concept Clouds . . . . . . . . . . . . . . . . . . 57
3.3.1 Constructing Concept Knowledge . . . . . . . . . . . . . 59
3.3.2 Inducing the Space . . . . . . . . . . . . . . . . 61
3.4 From Social Networks To Social Semantic Vectors . . . . . . . . 62
3.4.1 Graph Structure of Social Networks . . . . . . . . . . . . 64
3.4.2 Constructing Social Semantic Vectors . . . . . . . . . . . 68
3.4.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . 78
4 Evaluation of Social Network-induced Content Models 79
4.1 Social Semantic Relatedness . . . . . . . . . . . . . . . . . . . . 80
4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1.2 The Measure of Wiki Semantic Relatedness . . . . . . . 84
4.1.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . 87
4.1.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . 89
4.2 Social Network-induced Topic Identification . . . . . . . . . . . 91
4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.2 A Method for Open Topic Identification . . . . . . . . . 93
4.2.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . 97
4.2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . 101
4.3 On Social Semantics-induced Closed Topic Categorization . . . . 102
4.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.2 Generalized Topic Concepts for Text Categorization . . 104
4.3.3 Document-based Experimental Evaluation . . . . . . . . 1044.3.4 OAI-based Experimental Evaluation . . . . . . . . . . . 109
4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 114
5 Application Scenarios of Social Semantics 115
5.1 Named Entity Instance Recognition . . . . . . . . . . . . . . . . 116
5.1.1 The Algorithm to Named Entity Instance Recognition . 117
5.1.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . 121
5.1.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . 123
5.2 Social Semantics-induced Sentiment Analysis . . . . . . . . . . . 124
5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2.2 The Social Network-induced Polarity Enhancement . . . 128
5.2.3 The German Polarities Clues . . . . . . . . . . . . . . . 131
5.2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . 133
5.2.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . 138
5.3 Web Genre Classification . . . . . . . . . . . . . . . . . . . . . . 138
5.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 140
5.3.2 Hypertext Type Classification Algorithm . . . . . . . . . 141
5.3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . 146
5.3.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6 Conclusion 155
6.1 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159List of Figures
1.1 Thepotentialevolutionscenarioofwebtechnology, movingfrom
the Web 1:0 to a Web 4:0, by knowledge connectivity and social
interaction after Mills [214] . . . . . . . . . . . . . . . . . . . . . 3
1.2 ContentintelligencebyuserinteractionafterBlumbergandAtre
[30] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outline of an RDF-graph connecting article contribution and
authorship. (cf. Burleson [39]) . . . . . . . . . . . . . . . . . . . 6
1.4 Different text classification methods by automation granularity
(cf. Blumberg and Atre [30]). . . . . . . . . . . . . . . . . . . . 8
2.1 Vector representation of document space as visualized by Salton
et al. [273] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Cosine angle ( ; ) between query and two document vectors. 351 2
2.3 SVM-hyperplane with maximal margin using linear and non-
linear kernel constructed by the support vector machine algo-
rithm. (a)linearseparable,(b)non-linearseparable,(c)schematic
transformation of the input data (non-linear separable) in the
higher dimensional feature space. . . . . . . . . . . . . . . . . . 41
3.1 The three disciplines of semiotics after Morris [218, pp. 94]. . . . 48
3.2 Semantic relationships between the component of a lexical se-
mantic model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Application flow of an enhancement of a document representa-
tion by means of concepts derived from a reference ontology.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Outline of the hyperlink structure of a Wikipedia article contri-
butions and the associated category taxonomy. . . . . . . . . . . 62
vii

Un pour Un
Permettre à tous d'accéder à la lecture
Pour chaque accès à la bibliothèque, YouScribe donne un accès à une personne dans le besoin