Machine learning for text indexing [Elektronische Ressource] : concept extraction, keyword extraction and tag recommendation / vorgelegt von Hendri Murfi

technische_universitat_berlin

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

108 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	technische_universitat_berlin
Publié le	01 janvier 2010
Nombre de lectures	15
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Machine Learning for Text Indexing
Concept Extraction, Keyword Extraction and Tag Recommendation
vorgelegt von
Master of Science
Hendri Mur
aus Jakarta, Indonesien
Von der Fakultat IV { Elektrotechnik und Informatik
der Technischen Universitat Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
Dr. rer. nat.
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. rer. nat. Volker Markl
Berichter: Prof. Dr. rer. nat. Klaus Obermayer
Prof. Dr. -Ing. Sahin Albayrak
Tag der wissenschaftlichen Aussprache: 31. August 2010
Berlin 2010
D 83Abstract
Due to some drawbacks, mainly because of semantic issues such as synonymy
and polysemy, people consider some approaches to improve the performance
of full-text indexing. The alternative approaches include latent semantic in-
dexing, keyword indexing, social indexing (web 2.0) and linked data-based
indexing (semantic web). The aim of this dissertation is to investigate the
applicationsofmachinelearningmethodsforthealternativeapproaches. The
app areas are concept extraction, keyword extraction and tag recom-
mendation.
Firstly,weproposeanewlearningmethodcalledtwo-level learning hierar-
chy (TLLH) to extract concepts from tagged textual contents. This learning
method executes separately the existing textual sources, i.e. the user-created
tags and the textual contents. At the lower level, concepts and concept-
document relationships are discovered by non-negative matrix factorization
(NMF) algorithm based on the user-created tags. Having these relationships,
theconceptsarepopulatedbytermsexistinginthetextualcontentsathigher
level. We expect this method to be successful because the hidden document
structures are discovered based on tags collectively created by users who un-
derstand the semantic content of documents. Another advantage is that the
NMFalgorithmexecutesmorecompactandcleanerdatarepresentations. On
the other hand, concept extraction from the textual contents is handled by
non-negative least squares (NNLS) algorithm which is much more e cien t
than the NMF algorithm. Moreover, the TLLH approach may have richer
vocabularies because it can combine vocabularies from the user-created tags
and the textual contents. Therefore, this approach is not only more reliable
butalsomoree cien tthanthestandardone-level learning hierarchy (OLLH)
which extracts concepts only from the textual contents. Next, we apply the
extracted concepts for a keyword extraction method. In other words, we
propose a new keyword extraction method called concept-based keyword ex-
traction (CBKE). Its basic idea is that a term of a document is important if
the term is associated to important concepts of the document and important
itself in the document. The exibility regarding the characteristics of learn-
ing data is one of the advantages of the method. This method can operate
on learning data either with or without manually assigned keywords. Finally,
we apply our proposed CBKE methods to content-based tag recommenda-
tions in folksonomy. The results show that the tag recommendations have
competitive performances in ICML PKDD Discovery Challenge 2009.
iZusammenfassung
Aufgrund einiger Nachteile, vor allem wegen semantischer Fragen wie Syn-
onymie und Polysemie, betrachtet man einige Ansatze, um die Leistung der
Volltextindexierung zu verbessern. Der alternative Ansatz umfasst latent
semantic indexing, keyword indexing, social indexing (Web 2.0) und linked
data-based (Semantisches Web). Das Ziel dieser Dissertation ist es,
Methoden des Maschinelles Lernen fur die alternativen Ansatze zu unter-
suchen. Die Einsatzgebiete sind concept extraction, keyword extraction und
tag recommendation.
ErstenswirdeineneueLernmethodevorgestellt,mitderKonzepteTextin-
halten, welche durch vom Benutzer eingegebene Stichworte begleitet werden,
extrahiert werden konnen. Das Lernen besteht aus zwei Ebenen, welche die
beiden Arten von Textquellen separat ausfuhren. Auf der unteren Ebene
werden die Konzepte und die Konzept-Dokument Beziehungen von der vom
Benutzer erstellten Stichworte durch Nicht-negative Matrix Faktorisierung
(NMF) entdeckt. Aufgrund dieser Beziehungen sind die Konzepte durch
Worter von anderen Textinhalten auf einer hoh eren Ebene angesiedelt. Es
wird erwartet, dass diese Methode erfolgreich ist, weil die verborgenen Doku-
ment Strukturen auf Stichwortern basieren, die von Benutzern kreiert wur-
den, welcher die semantischen Inhalte der Dokumente versteht. Ein weit-
erer Vorteil dieses Ansatzes ist, dass das NMF zu einer kompakten und
sauberen Dokument Darstellung fuhrt. Andererseits ist die Konzept Ex-
traktion aus Textinhalten durch die Methode der Nicht-negative kleinsten
Quadrate (NNLS) sehr viel e zien ter als die Methode der NMF. Daher ist
diese Two-Level Learning Hierarchy (TLLH) nicht nur sicherer sondern auch
e zien ter als One-Level Learning Hierarchy (OLLH), das die Konzepte nur
aus dem Textinhalt extrahiert. Daruber hinaus kann die Methode reicheren
Wortschatz besitzen, weil Vokabeln aus den vom Benutzer erstellten Stich-
worten mit textlichen Inhalten kombiniert werden. Als nac hstes wenden
wir die extrahierten Konzepte fur die Stichwort Extraktion an. Mit an-
deren Worten stellen wir ein neues Stichwort Extraktion Verfahren genannt
Concept-Based Keyword Extraction (CBKE) vor. Die Grundidee der Meth-
ode ist, dass ein Terminus des Dokuments wichtig wird, wenn dieser Ter-
minus auf wichtige Konzepte des Dokuments zugeordnet wird und an sich
fur das Dokument wichtig ist. Die Flexibilitat in Bezug auf die Merkmale
der Lerndaten ist ein Vorteil der Methode. Es kann auf Trainingsdaten ar-
beiten entweder mit oder ohne manuell zugewiesen Stichwort. SchlieSSlich
wird sich dem CBKE auf Inhalt basierten Tag Empfehlungen im folksonomy
iiiiv Zusammenfassung
zugewandt. Die Ergebnisse zeigen, dass die Tag Empfehlungen wettbewerb-
sfahige Leistungen in ICML PKDD Discovery Challenge 2009 besitzt.Acknowledgment
I would like to thank my adviser Prof. Klaus Obermayer for his great sup-
port and for giving me the opportunity to conduct my research in the multi-
discipline environment of Neural Information Processing (NI) group of Tech-
nische Universitat Berlin. He gave me the opportunity to work in the DFG-
funded Advance Learning Framework (ALF) project providing the basis for
my research. I would also like to thank Prof. Sahil Albayrak from DAI-Labor
of Technische Universitat Berlin for his support for my scholarship.
Also, I would like to thank my colleague and my roommate, Nicolas
Neubauer, for the valuable discussions and fruitful collaboration. His sug-
gestions enriched and improved the quality of this work. I learned a lot from
him. I also would like to thank to Andre Paus who introduced me to the
ALF project. Other members of the project, Dr. -Ing Dragan Milosevic and
Christian Schee, have given many valuable feedbacks to my work as long
as the project period. Thanks also to all the NI group members, for their
cooperation, help and understanding.
Special thanks goes to my parents for their constant support and en-
couragement throughout my study. I would like also to express my special
thanks to my wife Munaya Fauziah who has always been besides me as long
as my study. I would like to express my gratitude to the German Academic
Exchange Service (DAAD) for funding the years of my research and thus
provide me the opportunity to nish my doctoral study in Germany. Finally,
there are many other people have given big support and help relating to my
study or my stay in Berlin. So a big thank you to them I do not mention
here name by name.
vContents
Abstract i
Zusammenfassung iii
Acknowledgment v
Contents vii
1 Introduction 1
2 Machine Learning for Text Indexing 7
2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Text Indexing . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Text Ranking . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Semantic Issues . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Concept Extraction . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Latent Semantic Analysis . . . . . . . . . . . . . . . . 12
2.2.2t Semantic Indexing . . . . . . . . . . . . . . . . 13
2.3 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Candidate Selection . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Keyword Indexing . . . . . . . . . . . . . . . . . . . . 16
2.4 Tag Recommendation . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Folksonomy . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Social Indexing Limitations . . . . . . . . . . . . . . . 19
2.4.3 Purposes . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.5 Social Linked Data-based Indexing . . . . . . . . . . . 23
3 NMF-Based Soft Clustering for Optimizing Concept Index-
ing 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vii