New techniques for clustering complex objects [Elektronische Ressource] / von Karin Kailing

ludwig-maximilians-universitat_munchen - Kailing , Karin

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

229 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	ludwig-maximilians-universitat_munchen
Publié le	01 janvier 2004
Nombre de lectures	14
Langue	English
Poids de l'ouvrage	3 Mo

Extrait

New Techniques for Clustering
Complex Objects
Dissertation im Fach Informatik
an der Fakulatt fur Mathematik, Informatik und Statistik
der Ludwig-Maximilians-Universit at Munchen
von
Karin Kailing
Tag der Einreichung: 18.08.2004
Tag der mund lichen Pruf ung: 15.11.2004
Berichterstatter:
Prof. Dr. Hans-Peter Kriegel, Ludwig-Maximilians-Universit at Munchen
Prof. Dr. Johannes Gehrke, Cornell University (USA)iiAcknowledgment
A lot of people supported and encouraged me while I was working at my
thesis. I am very grateful for all the help I got during this time. Unfortu-
nately, I can only mention some of them here, but my dearest thanks goes,
of course, to all of them.
First of all, I want to express my warmest thanks to my supervisor
Professor Dr. Hans-Peter Kriegel who convinced me to come to his group
andthenmadesurethatIneverregrettedthisstep. Withouttheproductive
and inspiring working atmosphere he created, this work could never have
been done. I also want to thank Professor Johannes Gehrke who readily
agreed to act as the second referee to this thesis.
This work was inspired by many fruitful discussions and cooperations
with my colleagues. Without them this work could never have grown. I’m
verygratefulforallthesupportIgotduringthepastyearsandofcourse,not
to forget, for all the fun we had. A special thanks goes to my colleagues Dr.
Peer Kr oger and Dr. Stefan Schonauer who were always willing to answer
all my questions.
ThebackgroundhelpofSusanneGrienberger,whomanagedmuchofthe
administrative burden, was another reason that working in this group was
so comfortable.
Furthermore I want to thank Franz Krojer who was always on the spot
whenIhadtroublewithtechnicalthings. Hedidamarvellousjobinkeeping
our computers running and additionally gave me a lot of new insights into
astrophysics.
I also want to thank all the students whose work helped to prove and
establish my ideas. To name just a few of them: Alex Bosshammer, Adham
iiiiv 0 Acknowledgment
Haddad, Heribert Muh lberger, Claudia Plant and Thomas Muller.
Last but not least, I like to express my deepest thanks to my husband,
my family and my friends who always believed in me, sometimes more than
I did.
Karin Kailing
Munich, August 2004.Abstract
The tremendous amount of data produced nowadays in various application
domains such as molecular biology or geography can only be fully exploited
by e cient and e ective data mining tools. One of the primary data mining
tasks is clustering, which is the task of partitioning points of a data set into
distinct groups (clusters) such that two points from one cluster are similar
to each other whereas two points from distinct clusters are not.
Due to modern database technology, e.g. object relational databases, a
huge amount of complex objects from scienti c, engineering or multimedia
applications is stored in database systems. Modelling such complex data
often results in very high-dimensional vector data (”feature vectors”). In
the context of clustering, this causes a lot of fundamental problems, com-
monly subsumed under the term ”Curse of Dimensionality”. As a result,
traditional clustering algorithms often fail to generate meaningful results,
because in such high-dimensional feature spaces data does not cluster any-
more. But usually, there are clusters embedded in lower dimensional sub-
spaces, i.e. meaningful clusters can be found if only a certain subset of fea-
turesisregardedforclustering. Thesubsetoffeaturesmayevenbedierent
for varying clusters.
In this thesis, we present original extensions and enhancements of the
density-based clustering notion to cope with high-dimensional data. In
particular, we propose an algorithm called SUBCLU (density-connected
Subspace Clustering) that extends DBSCAN (Density-Based Spatial Clus-
tering of Applications with Noise) to the problem of subspace clustering.
SUBCLU e ciently computes all clusters of arbitrary shape and size that
would have been found if DBSCAN were applied to all possible subspaces
vvi 0 Abstract
ofthefeaturespace. TwosubspaceselectiontechniquescalledRIS(Ranking
InterestingSubspaces)andSURFING(SUbspacesRelevantForclusterING)
are proposed. They do not compute the subspace clusters directly, but gen-
erate a list of subspaces ranked by their clustering characteristics. A hier-
archical clustering algorithm can be applied to these interesting subspaces
in order to compute a hierarchical (subspace) clustering. In addition, we
propose the algorithm 4C (Computing Correlation Connected Clusters)
that extends the concepts of DBSCAN to compute density-based correla-
tion clusters. 4C searches for groups of objects which exhibit an arbitrary
but uniform correlation.
Often, the traditional approach of modelling data as high-dimensional
feature vectors is no longer able to capture the intuitive notion of similar-
ity between complex objects. Thus, objects like chemical compounds, CAD
drawings, XML data or color images are often modelled by using more com-
plex representations like graphs or trees. If a metric distance function like
the edit distance for graphs and trees is used as similarity measure, tradi-
tional clustering approaches like density-based clustering are applicable to
those data. However, we face the problem that a single distance calculation
can be very expensive. As clustering performs a lot of distance calculations,
approaches like lter and renement and metric indices get important. The
second part of this thesis deals with special approaches for clustering in
application domains with complex similarity models. We show, how appro-
priate lters can be used to enhance the performance of query processing
and, thus, clustering of hierarchical objects. Furthermore, we describe how
the two paradigms of ltering and metric indexing can be combined. As
complex objects can often be represented by using di erent similarity mod-
els, a new clustering approach is presented that is able to cluster objects
that provide several di erent complex representations.Abstract (in German)
InnovativeTechnologienundneuesteMethodenzurDatengewinnunginver-
schiedenenTeilbereichenderWissenschaft,wiez.B.Biowissenschaften,Medi-
zin, Astronomie, Geographie, erzeugen eine wahre Flut an Rohdaten. Um
dieser Flut Herr werden zu k onnen, werden dringend e ziente und e ek-
tive Methoden zur automatischen Datenanalyse und zur Wissensextraktion
(Data Mining) ben otigt. Ein au erst wichtiges Teilproblem des Data Min-
ing ist das Clustering. Beim Clustering sollen die Objekte einer Datenbank
in (a priori unbekannte) Gruppen, auch Cluster genannt, eingeteilt werden,
so dass zwei Objekte aus einem gleichen Cluster m oglichst ahnlich zueinan-
der und zwei Objekte aus unterschiedlichen Clustern moglichst un ahnlich
zueinander sind.
Dank moderner Datenbanktechnologien, z.B. objekt-relationale Daten-
banken, lassen sich heute beliebig komplexe Objekte in gro en Mengen ver-
walten. Die Modellierung solch komplexer Objekte fuhr t h au g zu sehr
hochdimensionalen Merkmalsvektoren. Dies verursacht im Kontext Clus-
tering eine Menge von grundsatzlic hen Problemen, die mit dem Ausdruck
”Curse of Dimensionality” (Fluch der Dimensionalitat) zusammengefasst
werden. Die Folge ist, dass die Objekte in hochdimensionalen R aumen
stark streuen, und nicht mehr sinnvoll zu clustern sind. Dennoch gibt es
meistens Cluster in verschiedenen Teilraume n niedrigerer Dimensionalit at
(Unterraume), d.h. die Daten clustern, wenn man gewisse (teilweise unter-
schiedliche) Attribute ausblendet.
Im ersten Teil dieser Arbeit werden spezielle Verfahren vorgestellt, die
auf das Problem des Clusterings hochdimensionaler Daten zugeschnitten
sind. EswirdeindichtebasiertesUnterraum-Clusteringverfahrenpaser ntiert,
viiviii 0 Abstract (in German)
das in der Lage ist, Cluster beliebiger Form und Gr o e in Teilaumenr zu
nden. Zusatzlich werden zwei Verfahren vorgestellt, die die Unterraum-
cluster nicht direkt berechnen, sondern eine Liste geeigneter Unterr aume
erzeugen. IndieseninteressantenUnteraumenr kanndanneinhierarchischer
Clustering-Algorithmuseingesetztwerden,umeinhierarchischesUnterraum-gzuerzeugen. DahochdimensionaleDatensehrhau gauchstarken
Korrelationenunterliegen,wirdau erdemeinClusteringverfahreneingefuhrt,
das gezielt nach Gruppen von Objekten mit einer einheitlichen Korrelation
sucht.
Oft reicht die traditionelle Modellierung der Daten als Merkmalsvek-
toren nicht mehr aus, um die intuitive Ahnlichkeit der Objekte adaquat
auszudruc ken. Daher verwendet man fur komplexe Daten wie Molekule,
CAD-Zeichnungen,XML-Daten,WebsitesoderFarbbilderh au gkomplexere
ModellierungsformenwiezumBeispielGraph-oderBaumdarstellungen, um
diese Datengeeignetzurepaser ntieren. L asstsichaufdieserRepras entation
eine metrische Abstandsfunktion nden, so konnen fur ”punktartige” Ob-
jekte konzipierte Clustering-Algorithmen weiter genutzt werden. Einzelne
Distanzberechnungen konnen in diesem Fall jedoch sehr teuer sein und
muss en daher geeignet unterstutzt werden. Da beim Clustering meist sehr
viele Distanzberechnungen notig sind, spielen Ansatze wie Filterverfeine-
rungstechniken und metrische Indizes hier eine gro e Rolle. Im Rahmen
der Arbeit wurden spezielle Verfahren fur komplex modellierte Objekte en-
twickelt. Es wird gezeigt, wie d