Anytime algorithms for stream data mining [Elektronische Ressource] / Philipp Kranen

rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen - Philipp Kranen

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

323 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen
Publié le	01 janvier 2011
Nombre de lectures	9
Langue	English
Poids de l'ouvrage	8 Mo

Extrait

Anytime Algorithms
for Stream Data Mining
Von der Fakultat¨ fur¨ Mathematik, Informatik und Naturwissenschaften
der RWTH Aachen University zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften genehmigte Dissertation
vorgelegt von
Diplom-Informatiker
Philipp Kranen
aus Willich, Deutschland
Berichter: Universitatsprofessor¨ Dr. rer. nat. Thomas Seidl
Visiting Professor Michael E. Houle, PhD
Tag der mundlichen¨ Prufung:¨ 14.09.2011
Diese Dissertation ist auf den Internetseiten
der Hochschulbibliothek online verfugbar¨ .Contents
Abstract / Zusammenfassung 1
I Introduction 5
1 The Need for Anytime Algorithms 7
1.1 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Knowledge Discovery from Data 17
2.1 The KDD process and data mining tasks . . . . . . . . . . . 17
2.2 Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Stream Data Mining 43
3.1 General Tools and Techniques . . . . . . . . . . . . . . . . . 43
3.2 Stream Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . 59
II Anytime Stream Classiﬁcation 69
4 The Bayes Tree 71
4.1 Introduction and Preliminaries . . . . . . . . . . . . . . . . . 72
4.2 Indexing density models . . . . . . . . . . . . . . . . . . . . 76
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
iii CONTENTS
5 The MC-Tree 99
5.1 Combining Multiple Classes . . . . . . . . . . . . . . . . . . 100
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6 Bulk Loading the Bayes Tree 117
6.1 Bulk loading mixture densities . . . . . . . . . . . . . . . . . 117
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7 The Classiﬁer Family: Learn from your Relatives 129
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Learning from Relatives . . . . . . . . . . . . . . . . . . . . 131
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8 Application: Anytime Classiﬁcation in HealthNet Scenarios 151
8.1 Scenario and Prototype . . . . . . . . . . . . . . . . . . . . . 152
8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9 Anytime Algorithms on Constant Streams 159
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.2 Novel Approaches for Constant Data Streams . . . . . . . . 161
9.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10 Future Work 181
III Anytime Stream Clustering 183
11 Self-adaptive Anytime Stream Clustering 185
11.1 The ClusTree Algorithm . . . . . . . . . . . . . . . . . . . . 186
11.2 Analysis and experiments . . . . . . . . . . . . . . . . . . . 196
11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205CONTENTS iii
12 Exploiting additional time in the ClusTree 207
12.1 Alternative descent strategies . . . . . . . . . . . . . . . . . 207
12.2 Evaluation of descent . . . . . . . . . . . . . . . . 213
12.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
13 Robust Anytime Stream Clustering 217
13.1 The LiarTree . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
13.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
13.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
14 Application: Using Modeling for Anytime Outlier Detection 237
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
14.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 238
14.3 Detecting outliers in streaming data . . . . . . . . . . . . . . 240
14.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
14.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
15 MOA and CMM 251
15.1 The MOA Framework . . . . . . . . . . . . . . . . . . . . . . 252
15.2 Evaluation Measures for Stream Clustering . . . . . . . . . . 260
16 Future Work 263
IV Summary and Outlook 265
V Appendices I
Bibliography III
Statement of Originality XLI
List of Publications XLIII
Curriculum Vitae XLVIIAbstract
Data is collected and stored everywhere, be it images or audio ﬁles on
private computers, customer data in traditional or electronic businesses, per-
formance or control data in production sites, web trafﬁc and click streams
at internet providers, statistical data at government agencies, sensor mea-
surements in scientiﬁc experimentation, surveillance data, etc. There are
countless examples, and the amount of data is tremendous. Data mining is
the process of ﬁnding useful and previously unknown patterns in data. In the
examples listed above, data mining can be used for automated recommen-
dation of audio ﬁles, business analysis and target marketing, or performance
optimization and hazard warnings. While early mining algorithms only con-
sidered static data sets, research and practice in data mining must nowadays
deal with continuous, possible inﬁnite streams of data, which are prevalent
in most real world applications and scenarios.
Anytime algorithms constitute a special type of algorithm that is well
suited to work on data streams. They inherit their name from their ability
to provide a result after any amount of processing time. The amount of
time available is not known to the algorithm in advance: anytime algorithms
quickly compute an initial result and strive to improve it as long as time
remains. When interrupted they deliver the best result obtained until that
point in time.
In this thesis anytime classiﬁcation is studied in depth for the Bayesian
approach. New algorithmic solutions for anytime classiﬁcation are devel-
oped and evaluated in extensive experimentation. The ﬁrst anytime stream
clustering algorithm is proposed, and an application to anytime outlier de-
tection is presented. In addition to the algorithmic contributions, new meta-
approaches are described that signiﬁcantly widen the area of applications for
anytime algorithms. The solutions and results of this thesis contribute to the
state of the art in anytime algorithms and stream data mining research.
1Zusammenfassung
Die rasante Entwicklung der Informationstechnologie hat zur Folge, dass
in allen Bereichen der Gesellschaft und des taglichen¨ Lebens große Mengen
an Daten erzeugt und gespeichert werden. Beispiele reichen von Multimedia-
Daten auf privaten Computern bis hin zu Messdaten in wissenschaftlichen
Experimenten. Data Mining beschreibt die Aufgabe, in solchen Daten neue
und interessante Muster zu ﬁnden. Diese konnen¨ beispielsweise zur automa-
tischen Empfehlung von Filmen genutzt werden oder helfen neue Zusam-
menhange¨ aufzudecken und Prozesse zu verstehen. Seit Beginn der Data
Mining Forschung wachst¨ die Große¨ der zu verarbeitenden Datensatze.¨
¨ ¨ ¨ ¨Wahrend Datensatze zunachst als statisch und vollstandig gegeben angenom-
men wurden, generieren viele Anwendungen heute kontinuierliche und teil-
weise unendliche Datenstrome.¨
Anytime-Algorithmen stellen eine Klasse von Algorithmen dar, welche
sich besonders gut zum Einsatz auf Datenstromen¨ eignet. Ihr Name ruhrt¨
von ihrer Eigenschaft her, zu jeder Zeit ein Ergebnis liefern zu konnen.¨ Die
zur Verfugung¨ stehende Zeit ist dem Algorithmus dabei nicht bekannt: er
berechnet ein initiales Ergebnis und verbessert dieses solange zusatzliche¨
Rechenzeit vorhanden ist. Wird der Algorithmus unterbrochen, so liefert er
das beste Ergebnis zuruck,¨ welches bis zu diesem Zeitpunkt erzielt wurde.
In dieser Dissertation werden neue Anytime-Verfahren fur¨ die Bayes
Klassiﬁkation entwickelt, intensiv untersucht und evaluiert. Der erste
Anytime-Algorithmus zum Clustern von Datenstromen¨ wird vorgestellt und
eine Anwendung fur¨ die Erkennung von Ausreißern wird diskutiert. Neben
¨neuen Algorithmen werden zwei ubergeordnete Verfahren entwickelt, die
den Anwendungsbereich fur¨ Anytime-Algorithmen signiﬁkant erweitern. Die
in dieser Dissertation vorgestellten Ansatze¨ und Resultate tragen zum Stand
der Forschung im Bereich und Data Mining auf Daten-
stromen¨ bei.
3