Efficient algorithms for robust pattern mining on structured objects with applications to structure-based drug design [Elektronische Ressource] / von Nils Weskamp

philipps-universitat_marburg

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

177 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Informations

Publié par	philipps-universitat_marburg
Publié le	01 janvier 2007
Nombre de lectures	26
Langue	English
Poids de l'ouvrage	7 Mo

Extrait

Eﬃcient Algorithms for Robust Pattern Mining
on Structured Objects with Applications to
Structure-Based Drug Design
Dissertation
zur Erlangung des Doktorgrades
der Naturwissenschaften
(Dr. rer. nat.)
dem Fachbereich Mathematik und Informatik
der Philipps-Universität Marburg
vorgelegt
von
Nils Weskamp
aus
Siegen
Marburg/Lahn, 2006Vom Fachbereich Mathematik und Informatik der Philipps-Universität Marburg
als Dissertation angenommen am: 16.02.2007
Erstgutachter: Prof. Dr. Eyke Hüllermeier
Zweitgutachter: Prof. Dr. Gerhard Klebe
Tag der mündlichen Prüfung: 23.02.2007Contents
I Foundations 1
1 Introduction 2
2 Related Work 10
2.1 Graph Matching and Pattern Recognition . . . . . . . . . . . . . 10
2.2 Graph Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Frequent Subgraph Mining . . . . . . . . . . . . . . . . . 12
2.2.2 Inductive Logic Programming . . . . . . . . . . . . . . . . 13
2.2.3 Kernel-based Methods . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Structural Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Sequence-based Methods . . . . . . . . . . . . . . . . . . . 15
2.3.2 Fold-based Methods . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 (Sub)structure-based Methods . . . . . . . . . . . . . . . 17
2.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
II Methods 19
3 Graph Alignments 20
3.1 Graphts . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 The Concept of Graph Alignment . . . . . . . . . . . . . 21
3.1.2 Evaluation of Graphts . . . . . . . . . . . . . . 24
3.2 Characteristic Patterns . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Derivation of Consensus Graphs . . . . . . . . . . . . . . 26
3.2.2 Classiﬁcation based upon Consensus Graphs . . . . . . . 30
3.3 Discriminative Patterns . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Automated Identiﬁcation of Outliers, Errors and Exceptions . . 37
4 Algorithms 41
4.1 Heuristic Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iCONTENTS ii
4.2 Continuous Problem Formulation . . . . . . . . . . . . . . . . . 47
4.3 Incremental Construction of Multiple Graph Alignments . . . . . 52
4.4 Evolutionary Strategies . . . . . . . . . . . . . . . . . . . . . . . 54
5 Simulations & Experiments 62
5.1 Synthetic Graph-Alignment Problems . . . . . . . . . . . . . . . 62
5.2 Empirical Comparison of the Graph Alignment Strategies . . . . 64
5.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
III Applications 77
6 Graph Alignments in Drug Design 78
6.1 Cavbase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Generation of Random Cavities . . . . . . . . . . . . . . . . . . . 85
6.3 Structure-based Mapping of Protein Binding Pocket Space . . . . 97
6.4 Classiﬁcation and Function Prediction . . . . . . . . . . . . . . . 114
6.5 Decomposition of Cavities into Subpockets . . . . . . . . . . . . . 128
7 Conclusions and Outlook 144
7.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.1.1 Motivation of the Work . . . . . . . . . . . . . . . . . . . 144
7.1.2 Results obtained using Graph Alignments . . . . . . . . . 146
7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 General Applicability . . . . . . . . . . . . . . . . . . . . 147
7.2.2 Use of Heuristics . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.3 Statistical Signiﬁcance . . . . . . . . . . . . . . . . . . . . 149
7.2.4 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . 150Acknowledgements
I would like to thank a number of people who contributed to this work in many
diﬀerent ways and made the last three years in Marburg very stimulating and
enjoyable for me:
† My ﬁrst advisor, Prof. Dr. Eyke Hüllermeier for his support during the
last three years, for numerous suggestions, answers and his dedication to
methodological soundness;
† My second advisor, Prof. Dr. Gerhard Klebe also for his support during
the last three years, for keeping an eye on practical applicability and for
opening some heavy doors for me;
† My colleague Dr. Katrin Kupas for the very nice collaboration, for many
fruitful discussions during the last three years, for the proofreading of this
thesis and for a great time in binding pocket space and elsewhere;
† My colleague Dr. Daniel Kuhn for his support in the early phase of this
work, his introduction into Relibase and many fruitful discussions;
† My colleague Dr. Matthias Zentgraf for many fruitful discussions about
science, for common ﬁghts against nfs-kernel-server and for things we
could never sell as science at all;
† My colleague Katrin Silber for just being herself, for a great common time
and for far too many other things to name them all;
† My other colleagues in the Klebe Group and in the Department of Math-
ematics and Computer Science for the nice working atmosphere and for
many non-scientiﬁc activities;
† The Deutsche Forschungsgemeinschaft (DFG) and the Studienstiftung des
Deutschen Volkes for funding;
† My parents, Petra and Dr. Peter Weskamp for their continuous and com-
pletely non-scientiﬁc kind of support.
iiiSummary
Many complex real-world objects cannot be mapped onto “ﬂat” feature vector
representations without a signiﬁcant loss of information. Recently, the use of
graph-based models gained increasing attention in the ﬁeld of data mining. This
thesis presents eﬃcient methods for fuzzy pattern mining in databases of such
graph representations. It is assumed that relatively homogeneous sets of graphs
originating from common classes or clusters are given. The class assignment
can either result from background knowledge or from a previously performed
cluster analysis.
Theaimofthedevelopedmethodsistoderivepatternsthatarecharacteristic
for a given cluster (i.e., that occur in all or most of the members of a cluster).
Additionally, it is of interest to derive patterns that are discriminative among
diﬀerent clusters (i.e., that occur in most of the members of one cluster but are
missing in most of the members of a diﬀerent cluster). As a certain variability
can be observed also among members of the same cluster, one cannot expect
that a characteristic pattern is present in all members of the cluster in identical
form. The methods presented in this thesis thus allow also for variations of a
characteristic pattern in its diﬀerent occurrences. This is achieved by arranging
thediﬀerentgraphrepresentationsinthecommonframeworkofamultiplegraph
alignment. In a graph alignment, nodes with diﬀerent labels can be assigned
to each other (“mismatch”) and dummy nodes can be inserted into the graphs
to compensate “missing” nodes. This increases the robustness of the approach
against variations in the considered graphs. A scoring system is used to penalize
such constellations as they should be allowed only if no “valid” match for a
certain node exists in the remaining graph instances. The calculation of a
graph alignment can thus be interpreted as an optimization problem and a
numberofdiﬀerentstrategiesforthecalculationofoptimalgraphalignmentsare
examined. Once an optimal graph alignment has been calculated, characteristic
and discriminative patterns can be derived and a number of diﬀerent analyses
can be performed. In particular, the concept of multiple graph alignment yields
a mechanism for mapping the aligned graphs onto “ﬂat” feature vectors that
ivCONTENTS v
preserves the correspondence among the individual features of the graph and
vector representation.
Relibase+/Cavbase, a huge database of putative protein binding sites serves
as the main motivation for the methods developed in this thesis. In Cavbase,
the protein binding pockets are represented through pseudocenters that describe
the spatial position of certain physicochemical properties. In a ﬁrst-order ap-
proximation, graphs can be used as descriptors for the Cavbase entries. The
methods presented in this thesis can therefore be used to analyze families of
related protein binding pockets. The derived patterns can be used to guide
the development of novel and selective inhibitors. Regions of a binding pocket
that are common to the members of a protein family (i.e., characteristic pat-
terns) could be addressed to gain aﬃnity for the whole protein family. Regions
that vary among the diﬀerent members of a protein family (i.e., discriminative
patterns) indicate features that can be addressed to gain selectivity within the
examined protein family.Zusammenfassung
Viele Objekte aus der realen Welt können nicht auf „ﬂache“ Eigenschaftsvek-
toren abgebildet werden, ohne dass es dabei zu einem erheblichen Verlust an
Information kommt. In den letzten Jahren sind daher graphbasierte Modelle in
den Fokus der Forschung im Bereich des Data Mining geraten. In dieser Arbeit
werden eﬃziente Methoden zur Entdeckung unscharfer Muster in Datenbanken
aus solchen Graphrepräsentationen vorgestellt. Dabei wird angenommen, dass
eine relativ homogene Menge von Graphen gegeben ist, die zum gleichen Clus-
ter beziehungsweise zur gleichen Klasse gehören. Die Klassenzuordnung ka