Efficient density based methods for knowledge discovery in databases [Elektronische Ressource] / vorgelegt von Ralph Krieger

rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

249 pages

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	rheinisch-westfalischen_technischen_hochschule_-rwth-_aachen
Publié le	01 janvier 2008
Nombre de lectures	28
Poids de l'ouvrage	9 Mo

Extrait

E cient density-based methods for
knowledge discovery in databases
Von der Fakult at fur Mathematik, Informatik und Naturwissenschaften
der Rheinisch-Westf alischen Technischen Hochschule Aachen
zur Erlangung des akademischen Grades eines Doktors der
Naturwissenschaften genehmigte Dissertation
vorgelegt von
Diplom-Informatiker
Ralph Krieger
aus K oln
Berichter: Universit atsprofessor Dr. rer. nat. Thomas Seidl
Professor Dr. Bart Goethals
Tag der mundlic hen Prufung: 09. Juli 2008
Diese Dissertation ist auf den Internetseiten
der Hochschulbibliothek online verfugbarContents
Acknowledgements 1
Abstract 2
Zusammenfassung 3
Density-based methods for knowledge discovery in
databases 5
I E cient density-based methods for clustering 13
1 Clustering multi-dimensional data 14
1.1 Full-space clustering . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Problems for multi-dimensional data . . . . . . . . . 16
1.3 Solutions for clustering m data . . . . . . . . . 19
1.4 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Unbiased density-based subspace clustering 31
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Dimensionality bias . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 An unbiased density estimator . . . . . . . . . . . . . . . . . . 36
2.5 DUSC subspace clustering . . . . . . . . . . . . . . . . . . . . 39
2.6 Apriori . . . . . . . . . . . . . . . . . . . . 48
2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3 E cient unbiased subspace clustering 62ii CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Density-based subspace clusters . . . . . . . . . . . . . . . . . 68
3.4 Depth- rst multistep subspace clustering . . . . . . . . . . . . 71
3.5 Indexing subspace clusters . . . . . . . . . . . . . . . . . . . . 83
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4 Visualization of subspace clusters 115
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3 VISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5 E cient subspace clustering for multidimensional sequences 128
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.3 Subspace subsequence cluster model . . . . . . . . . . . . . . . 131
5.4 E cient cluster mining . . . . . . . . . . . . . . . 137
5.5 A subspace subsequence clustering algorithm . . . . . . . . . . 143
5.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . 149
5.7 Conclusion and further work . . . . . . . . . . . . . . . . . . . 158
II E cient density-based methods for classi ca-
tion 159
6 Classifying multi-dimensional data 160
6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2 Basic notations . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3 Bayes classi cation . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4 Nearest neighbor classi ers . . . . . . . . . . . . . . . . . . . . 169
7 Anytime stream classi cation using the Bayes tree 171
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.3 Anytime stream classi cation . . . . . . . . . . . . . . . . . . 175
7.4 The Bayes tree . . . . . . . . . . . . . . . . . . . . . . . . . . 177CONTENTS iii
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8 Classi cation using subspace clusters 203
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.2 Subspace classi cation . . . . . . . . . . . . . . . . . . . . . . 206
8.3 Algorithmic concept . . . . . . . . . . . . . . . . . . . . . . . 211
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Conclusion and future research directions 221
Bibliography 226List of Figures
1 Knowledge Discovery in Databases (KDD) process . . . . . . 7
2 Clustering a data set . . . . . . . . . . . . . . . . . . . . . . 8
3 Classi er learned from data set . . . . . . . . . . . . . . . . . 10
E cient density-based methods for clustering
1.1 Probability of di erent standard normal distributions . . . . 17
1.2 Problems of dimensionality reduction and projected cluster-
ing techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Lattice of subspace clusters . . . . . . . . . . . . . . . . . . . 20
1.4 Categorization of existing subspace clustering concepts . . . . 22
1.5 Projection and area of In uence . . . . . . . . . . . . . . . . 26
1.6 Density distribution within neighborhood . . . . . . . . . . . 27
1.7 Density-based clustering . . . . . . . . . . . . . . . . . . . . . 28
1.8 Densit Subspace Clusters . . . . . . . . . . . . . . . . 29
2.1 Gauss, Rectangular, Epanechnikov and Triangular Kernel . . 36
2.2 Density thresholds and ! . . . . . . . . . . . . . . . . . . . 42
2.3 Redundancy in subspace clusterings . . . . . . . . . . . . . . 43
2.4 Redundant clustering result . . . . . . . . . . . . . . . . . . . 44
2.5 Five-dimensional cluster splitting into two six-dimensional
clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6 Concept of apriori algorithm for DUSC subspace cluster model 50
2.7 Quality vs. density threshold F . . . . . . . . . . . . . . . . . 59
2.8 Quality vs. redundancy parameter r . . . . . . . . . . . . . . 60
2.9 Comparing accuracy of DUSC with other algorithms . . . . . 60
2.10 F1-value on real world data using di erent density measures . 61
3.1 Breadth rst and depth rst subspace clustering . . . . . . . 65
3.2 Rectangle, Triangular, and Epanechnikov kernels . . . . . . . 69LIST OF FIGURES v
3.3 Multistep architecture . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Traditional grid discretization, with connectivity borders . . . 74
3.5 Cells, connectivity borders and hypercubes projected to di-
mension 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6 Multistep eDSUC algorithm . . . . . . . . . . . . . . . . . . . 78
3.7 Induced and performed merges . . . . . . . . . . . . . . . . . 79
3.8 Extending a MICH to next dimension . . . . . . . . . . . . . 80
3.9 Example data set: transforming objects to cell representations 84
3.10 Creating the initial SC-tree for the example data set . . . . . 88
3.11 Restricting the SC-tree . . . . . . . . . . . . . . . . . . . . . 91
3.12 Inducing merges using the SC-tree . . . . . . . . . . . . . . . 93
3.13 Processing the next cell using the SC-tree . . . . . . . . . . . 94
3.14 Merging cells using the SC-tree . . . . . . . . . . . . . . . . . 95
3.15 Pruning based on the . . . . . . . . . . . . . . . . . . 96
3.16 SC-tree availability during mining . . . . . . . . . . . . . . . 98
3.17 Scalability vs. dimensions . . . . . . . . . . . . . . . . . . . . 103
3.18y vs. database size . . . . . . . . . . . . . . . . . . . 104
3.19 Selectivity vs. dimensions . . . . . . . . . . . . . . . . . . . . 105
3.20 MinPoints vs. runtime . . . . . . . . . . . . . . . . . . . . . . 106
3.21 F parameter vs. runtime . . . . . . . . . . . . . . . . . . . . 107
3.22 Gridsize vs. runtime . . . . . . . . . . . . . . . . . . . . . . . 107
3.23 Three kernels vs. SUBCLU and SCHISM . . . . . . . . . . . 108
3.24 Runtimes for sorting heuristic . . . . . . . . . . . . . . . . . . 109
3.25 Tree size for heuristic . . . . . . . . . . . . . . . . . . 110
3.26 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.27 Quality . . . . . . . . . . . . . . . . . . . . . . . 112
3.28 Quality on real data . . . . . . . . . . . . . . . . . . . . . . . 113
3.29 Runtime on real data . . . . . . . . . . . . . . . . . . . . . . 113
4.1 Subspace clustering overview . . . . . . . . . . . . . . . . . . 120
4.2 Detailed view for one subspace cluster . . . . . . . . . . . . . 121
4.3 Bracketing redundancy in MDS images . . . . . . . . . . . . 122
4.4 Matrix of subspace clusters groups . . . . . . . . . . . . . . . 123
4.5 User Interface for a Subspace Clusters Matrix . . . . . . . . . 125
4.6 Subspace clusters matrix for di erent settings . . . . . . . . . 126
5.1 GIS illustration of river attributes . . . . . . . . . . . . . . . 129vi LIST OF FIGURES
5.2 Example pattern . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3 for a subspace subsequence cluster . . . . . . . . . . 136
5.4 Hierarchical index structure for nding subsequence clusters . 141
5.5 Subsequence clustering algorithm . . . . . . . . . . . . . . . . 144
5.6 Creating Candidate Clusters . . . . . . . . . . . . . . . . . . 144
5.7 Transformation e