Similarity search and data mining techniques for advanced database systems [Elektronische Ressource] / von Alexey Pryakhin

ludwig-maximilians-universitat_munchen - Alexey Pryakhin

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

318 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Informatik

Informations

Publié par	ludwig-maximilians-universitat_munchen
Publié le	01 janvier 2006
Nombre de lectures	19
Langue	English
Poids de l'ouvrage	7 Mo

Extrait

Similarity Search and Data
Mining Techniques for
Advanced Database Systems.
Dissertation im Fach Informatik
an der Fakulatt fur Mathematik, Informatik und Statistik
der Ludwig-Maximilians-Universiatt Mun chen
von
Alexey Pryakhin
Tag der Einreichung: 24.11.2006
Tag der mundlichen Prufung: 21.12.2006
Berichterstatter:
Prof. Dr. Hans-Peter Kriegel, Ludwig-Maximilians-Universit at Mun chen
Prof. Dr. Daniel A. Keim, Universit at KonstanziiAcknowledgement
Iwouldliketoexpressmywarmestgratitudetoallthepeoplewhosupported
me during the past three years while I have been working on this thesis. I
avail myself of the opportunity to thank them, even if I cannot mention all
of their names here.
Firstofall,Iwouldliketoexpressmywarmestandsincerestthankstomy
supervisor, Professor Dr. Hans-Peter Kriegel, who provided the productive
and inspiring environment and created a great working atmosphere within
our group. I warmly thank Professor Dr. Daniel Keim for his immediate
willingness to act as a second referee for my thesis.
Thisworkcouldnothavegrownandmaturedwithoutthediscussionswith
my colleagues in the database research group. In particular, I would like to
give my thanks to Elke Achtert, Johannes A falg, Karsten Borgwardt, Pro-
fessor Dr. Christian B ohm, Stefan Brecheisen, Dr. Karin Kailing, Dr. Peer
Kr oger,PeterKunath,Dr. MatthiasSchubert,MatthiasRenz,ArthurZimek
for their help, support, interesting hints, constructive and productive team-
work. Furthermore, I want to thank Alexander Harhurin, Otmar Hilliges,
Florian Vorberger for other fruitful multidisciplinary discussions about soft-
ware engineering, similarity of multimedia objects, and music genres which
were useful for this work. Last, but not least, I had the pleasure to supervise
and to work with several students who supported my work and who have
been bene cial for this work. In particular, I would like to mention here
Oleg Galimov, Franz Graf, Michael Gruber, Georg Straub, Michael Kats,
Sergey Wetzstein, Andrew Zherdin, and Karina Z ohrer.
iiiiv
I would like to express my deep appreciations to Susanne Grienberger.
Besides bearing much of the administrative burdens for me, she helped me
a lot by carefully reading the thesis, and by polishing the English. I wish
to specially thank Franz Krojer for taking care of our hard- and software
environment and for his invaluable technical hints that allowed me to save a
lot of time during experimental evaluation.
I owe special thanks to my wife Anna for her love, care, and patience
during the period of my PhD thesis. Without her encouragement and un-
derstanding, it would have been impossible for me to complete this work. I
would also like to thank the rest of my family and my friends.
Alexey Pryakhin
Munich, October 2006.Abstract
Modern automated methods for measurement, collection, and analysis of
data in industry and science are providing more and more data with drasti-
cally increasing structure complexity. On the one hand, this growing com-
plexity is justi ed by the need for a richer and more precise description of
real-world objects, on the other hand it is justi ed by the rapid progress
in measurement and analysis techniques that allow the user a versatile ex-
ploration of objects. In order to manage the huge volume of such complex
data, advanced database systems are employed. In contrast to conventional
database systems that support exact match queries, the user of these ad-
vanceddatabasesystemsfocusesonapplyingsimilaritysearchanddatamin-
ing techniques.
Based on an analysis of typical advanced database systems — such as
biometrical, biological, multimedia, moving, and CAD-object database sys-
tems — the following three challenging characteristics of complexity are de-
tected: uncertainty (probabilistic feature vectors), multiple instances (a set
of homogeneous feature vectors), and multiple representations (a set of het-
erogeneous feature vectors). Therefore, the goal of this thesis is to develop
similarity search and data mining techniques that are capable of handling
uncertain, multi-instance, and multi-represented objects.
The rst part of this thesis deals with similarity search techniques. Ob-
ject identi cation is a similarity search technique that is typically used for
therecognitionofobjectsfromimage,video,oraudiodata. Thus,wedevelop
a novel probabilistic model for object identi cation. Based on it, two novel
typesofidenti cationqueriesarede ned. Inordertoprocessthenovelquery
vvi 0 Abstract
types e ciently, we introduce an index structure called Gauss-tree. In addi-
tion, we specify further probabilistic models and query types for uncertain
multi-instance objects and uncertain spatial objects. Based on the index
structure, we develop algorithms for an e cient processing of these query
types. Practical bene ts of using probabilistic feature vectors are demon-
strated on a real-world application for video similarity search. Furthermore,
a similarity search technique is presented that is based on aggregated multi-
instance objects, and that is suitable for video similarity search. This tech-
nique takes multiple representations into account in order to achieve better
e ectiveness.
The second part of this thesis deals with two major data mining tech-
niques: clustering and classi cation. Since privacy preservation is a very
important demand of distributed advanced applications, we propose using
uncertaintyfordataobfuscationinordertoprovideprivacypreservationdur-
ing clustering. Furthermore, a model-based and a density-based clustering
method for multi-instance objects are developed. Afterwards, original exten-
sionsandenhancementsofthedensity-basedclusteringalgorithmsDBSCAN
and OPTICS for handling multi-represented objects are introduced. Since
several advanced database systems like biological or multimedia database
systems handle prede ned, very large class systems, two novel classi cation
techniques for large class sets that bene t from using multiple representa-
tions are de ned. The rst classi cation method is based on the idea of
a k-nearest-neighbor classi er. It employs a novel density-based technique
to reduce training instances and exploits the entropy impurity of the lo-
cal neighborhood in order to weight a given representation. The second
technique addresses hierarchically-organized class systems. It uses a novel
hierarchical, supervised method for the reduction of large multi-instance ob-
jects, e.g. audio or video, and applies support vector machines for ecient
hierarchical classi cation of multi-represented objects. User bene ts of this
technique are demonstrated by a prototype that performs a classi cation of
large music collections.
The e ectiveness and e ciency of all proposed techniques are discussed
and veri ed by comparison with conventional approaches in versatile exper-vii
imental evaluations on real-world datasets.viii 0 AbstractZusammenfassung
Moderne Methoden zur automatischen Sammlung, Messung und Analyse
von Daten in allen Bereichen der Industrie und Forschung liefern immer
mehr Daten, deren Struktur darub er hinaus eine zunehmende Komplexit at
aufweist. Diese Komplexit atszunahme ist durch die folgenden zwei Aspekte
begrundet: erstens der Bedarf an pr aziseren Beschreibungen von Objekten
derrealenWelt,zweitensdurcheinenrapidenFortschrittinMess-undAnaly-
setechniken, die eine vielseitigere Untersuchung von Objekten erm oglichen.
Um sehr gro e Mengen solcher komplexen Objekte zu verwalten, werden
hochentwickelte Datenbanksysteme eingesetzt. Im Gegensatz zu herkomm-
lichen Datenbanksystemen, die exakte Anfragen auf Objekten bearbeiten,
konzentrieren sich die Benutzer von hochentwickelten Datenbanksystemen
auf Ahnlichkeitssuche und Data Mining.
AusgehendvoneinerAnalysedertypischenhochentwickeltenDatenbank-
systeme,diebiometrische,biologische,mobile,Multimedia-undCAD-Objek-
te verwalten, werden die folgenden drei grundlegenden Charakteristika fest-
gestellt: Unsicherheit(probabilistischeMerkmalsvektoren),multipleInstanz-
en (Mengen von homogenen Merkmalsvektoren) und multiple Repr asenta-
tionen (Mengen von heterogenen Merkmalsvektoren). Das Ziel dieser Dok-
torarbeitist,MethodenfurAhnlichkeitssucheundDataMiningzuentwickeln,
die mit unsicheren, multiinstantiierten und multirepr asentierten Objekten
arbeiten k onnen.
DerersteTeilderArbeitbesch aftigtsichmitMethodender Ahnlichkeits-
suche. Objektidenti zierung, wie z.B. Personenidenti zierung anhand von
biometrischen Merkmalen, ist eine Methode der Ahnlichkeitssuche, die typ-
ixx 0 Zusammenfassung
ischerweise zur Erkennung von Objekten in Bild-, Video- und Audiodaten
eingesetzt wird. Wir entwickeln ein neues Wahrscheinlichkeitsmodell fur
Objektidenti zierung, das zwei neuartige Typen von Anfragen unterstut zt.
Zur e zienteren Bearbeitung dieser neuartigen Anfragetypen wird eine In-
dexstruktur eingefuhrt. Zusatzlich werden weitere Wahrscheinlichkeitsmod-
ellesowieAnfragetypenfur probabilistischeMultiinstanzobjekteundfur pro-
babilistische Beschreibungen von ar umlichen Objekten spezi ziert. Unter
Benutzung der Indexstruktur werden Algorithmen vorgestellt, die eine ef-
ziente Bearbeitung dieser Anfragetypen erlauben. Die Praxisrelevanz von
probabilistischen Objektbeschreibungen wird in einer realen Anwendung zur
Ahnlichkeitssuche a