Efficient algorithms for robust pattern mining on structured objects with applications to structure-based drug design [Elektronische Ressource] / von Nils Weskamp
177 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Efficient algorithms for robust pattern mining on structured objects with applications to structure-based drug design [Elektronische Ressource] / von Nils Weskamp

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
177 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Efficient Algorithms for Robust Pattern Miningon Structured Objects with Applications toStructure-Based Drug DesignDissertationzur Erlangung des Doktorgradesder Naturwissenschaften(Dr. rer. nat.)dem Fachbereich Mathematik und Informatikder Philipps-Universität MarburgvorgelegtvonNils WeskampausSiegenMarburg/Lahn, 2006Vom Fachbereich Mathematik und Informatik der Philipps-Universität Marburgals Dissertation angenommen am: 16.02.2007Erstgutachter: Prof. Dr. Eyke HüllermeierZweitgutachter: Prof. Dr. Gerhard KlebeTag der mündlichen Prüfung: 23.02.2007ContentsI Foundations 11 Introduction 22 Related Work 102.1 Graph Matching and Pattern Recognition . . . . . . . . . . . . . 102.2 Graph Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Frequent Subgraph Mining . . . . . . . . . . . . . . . . . 122.2.2 Inductive Logic Programming . . . . . . . . . . . . . . . . 132.2.3 Kernel-based Methods . . . . . . . . . . . . . . . . . . . . 132.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Structural Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Sequence-based Methods . . . . . . . . . . . . . . . . . . . 152.3.2 Fold-based Methods . . . . . . . . . . . . . . . . . . . . . 162.3.3 (Sub)structure-based Methods . . . . . . . . . . . . . . . 172.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 18II Methods 193 Graph Alignments 203.1 Graphts . . . . . . . . . . . . . . .

Informations

Publié par
Publié le 01 janvier 2007
Nombre de lectures 26
Langue English
Poids de l'ouvrage 7 Mo

Extrait

Efficient Algorithms for Robust Pattern Mining
on Structured Objects with Applications to
Structure-Based Drug Design
Dissertation
zur Erlangung des Doktorgrades
der Naturwissenschaften
(Dr. rer. nat.)
dem Fachbereich Mathematik und Informatik
der Philipps-Universität Marburg
vorgelegt
von
Nils Weskamp
aus
Siegen
Marburg/Lahn, 2006Vom Fachbereich Mathematik und Informatik der Philipps-Universität Marburg
als Dissertation angenommen am: 16.02.2007
Erstgutachter: Prof. Dr. Eyke Hüllermeier
Zweitgutachter: Prof. Dr. Gerhard Klebe
Tag der mündlichen Prüfung: 23.02.2007Contents
I Foundations 1
1 Introduction 2
2 Related Work 10
2.1 Graph Matching and Pattern Recognition . . . . . . . . . . . . . 10
2.2 Graph Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Frequent Subgraph Mining . . . . . . . . . . . . . . . . . 12
2.2.2 Inductive Logic Programming . . . . . . . . . . . . . . . . 13
2.2.3 Kernel-based Methods . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Structural Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Sequence-based Methods . . . . . . . . . . . . . . . . . . . 15
2.3.2 Fold-based Methods . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 (Sub)structure-based Methods . . . . . . . . . . . . . . . 17
2.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
II Methods 19
3 Graph Alignments 20
3.1 Graphts . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 The Concept of Graph Alignment . . . . . . . . . . . . . 21
3.1.2 Evaluation of Graphts . . . . . . . . . . . . . . 24
3.2 Characteristic Patterns . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Derivation of Consensus Graphs . . . . . . . . . . . . . . 26
3.2.2 Classification based upon Consensus Graphs . . . . . . . 30
3.3 Discriminative Patterns . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Automated Identification of Outliers, Errors and Exceptions . . 37
4 Algorithms 41
4.1 Heuristic Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iCONTENTS ii
4.2 Continuous Problem Formulation . . . . . . . . . . . . . . . . . 47
4.3 Incremental Construction of Multiple Graph Alignments . . . . . 52
4.4 Evolutionary Strategies . . . . . . . . . . . . . . . . . . . . . . . 54
5 Simulations & Experiments 62
5.1 Synthetic Graph-Alignment Problems . . . . . . . . . . . . . . . 62
5.2 Empirical Comparison of the Graph Alignment Strategies . . . . 64
5.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
III Applications 77
6 Graph Alignments in Drug Design 78
6.1 Cavbase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Generation of Random Cavities . . . . . . . . . . . . . . . . . . . 85
6.3 Structure-based Mapping of Protein Binding Pocket Space . . . . 97
6.4 Classification and Function Prediction . . . . . . . . . . . . . . . 114
6.5 Decomposition of Cavities into Subpockets . . . . . . . . . . . . . 128
7 Conclusions and Outlook 144
7.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.1.1 Motivation of the Work . . . . . . . . . . . . . . . . . . . 144
7.1.2 Results obtained using Graph Alignments . . . . . . . . . 146
7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 General Applicability . . . . . . . . . . . . . . . . . . . . 147
7.2.2 Use of Heuristics . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . 149
7.2.4 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . 150Acknowledgements
I would like to thank a number of people who contributed to this work in many
different ways and made the last three years in Marburg very stimulating and
enjoyable for me:
† My first advisor, Prof. Dr. Eyke Hüllermeier for his support during the
last three years, for numerous suggestions, answers and his dedication to
methodological soundness;
† My second advisor, Prof. Dr. Gerhard Klebe also for his support during
the last three years, for keeping an eye on practical applicability and for
opening some heavy doors for me;
† My colleague Dr. Katrin Kupas for the very nice collaboration, for many
fruitful discussions during the last three years, for the proofreading of this
thesis and for a great time in binding pocket space and elsewhere;
† My colleague Dr. Daniel Kuhn for his support in the early phase of this
work, his introduction into Relibase and many fruitful discussions;
† My colleague Dr. Matthias Zentgraf for many fruitful discussions about
science, for common fights against nfs-kernel-server and for things we
could never sell as science at all;
† My colleague Katrin Silber for just being herself, for a great common time
and for far too many other things to name them all;
† My other colleagues in the Klebe Group and in the Department of Math-
ematics and Computer Science for the nice working atmosphere and for
many non-scientific activities;
† The Deutsche Forschungsgemeinschaft (DFG) and the Studienstiftung des
Deutschen Volkes for funding;
† My parents, Petra and Dr. Peter Weskamp for their continuous and com-
pletely non-scientific kind of support.
iiiSummary
Many complex real-world objects cannot be mapped onto “flat” feature vector
representations without a significant loss of information. Recently, the use of
graph-based models gained increasing attention in the field of data mining. This
thesis presents efficient methods for fuzzy pattern mining in databases of such
graph representations. It is assumed that relatively homogeneous sets of graphs
originating from common classes or clusters are given. The class assignment
can either result from background knowledge or from a previously performed
cluster analysis.
Theaimofthedevelopedmethodsistoderivepatternsthatarecharacteristic
for a given cluster (i.e., that occur in all or most of the members of a cluster).
Additionally, it is of interest to derive patterns that are discriminative among
different clusters (i.e., that occur in most of the members of one cluster but are
missing in most of the members of a different cluster). As a certain variability
can be observed also among members of the same cluster, one cannot expect
that a characteristic pattern is present in all members of the cluster in identical
form. The methods presented in this thesis thus allow also for variations of a
characteristic pattern in its different occurrences. This is achieved by arranging
thedifferentgraphrepresentationsinthecommonframeworkofamultiplegraph
alignment. In a graph alignment, nodes with different labels can be assigned
to each other (“mismatch”) and dummy nodes can be inserted into the graphs
to compensate “missing” nodes. This increases the robustness of the approach
against variations in the considered graphs. A scoring system is used to penalize
such constellations as they should be allowed only if no “valid” match for a
certain node exists in the remaining graph instances. The calculation of a
graph alignment can thus be interpreted as an optimization problem and a
numberofdifferentstrategiesforthecalculationofoptimalgraphalignmentsare
examined. Once an optimal graph alignment has been calculated, characteristic
and discriminative patterns can be derived and a number of different analyses
can be performed. In particular, the concept of multiple graph alignment yields
a mechanism for mapping the aligned graphs onto “flat” feature vectors that
ivCONTENTS v
preserves the correspondence among the individual features of the graph and
vector representation.
Relibase+/Cavbase, a huge database of putative protein binding sites serves
as the main motivation for the methods developed in this thesis. In Cavbase,
the protein binding pockets are represented through pseudocenters that describe
the spatial position of certain physicochemical properties. In a first-order ap-
proximation, graphs can be used as descriptors for the Cavbase entries. The
methods presented in this thesis can therefore be used to analyze families of
related protein binding pockets. The derived patterns can be used to guide
the development of novel and selective inhibitors. Regions of a binding pocket
that are common to the members of a protein family (i.e., characteristic pat-
terns) could be addressed to gain affinity for the whole protein family. Regions
that vary among the different members of a protein family (i.e., discriminative
patterns) indicate features that can be addressed to gain selectivity within the
examined protein family.Zusammenfassung
Viele Objekte aus der realen Welt können nicht auf „flache“ Eigenschaftsvek-
toren abgebildet werden, ohne dass es dabei zu einem erheblichen Verlust an
Information kommt. In den letzten Jahren sind daher graphbasierte Modelle in
den Fokus der Forschung im Bereich des Data Mining geraten. In dieser Arbeit
werden effiziente Methoden zur Entdeckung unscharfer Muster in Datenbanken
aus solchen Graphrepräsentationen vorgestellt. Dabei wird angenommen, dass
eine relativ homogene Menge von Graphen gegeben ist, die zum gleichen Clus-
ter beziehungsweise zur gleichen Klasse gehören. Die Klassenzuordnung ka

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents