Applicability domain of QSAR models [Elektronische Ressource] / Iurii Sushko

technische_universitat_munchen - Iurii Sushko

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

149 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Sujets

TECHNISCHE UNIVERSITÄT MÜNCHEN
Lehrstuhl für Genomorientierte Bioinformatik
Applicability domain of QSAR models
Iurii Sushko
Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung
und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften
genehmigten Dissertation.
Vorsitzender: Univ.-Prof. Dr. Langosch
Prüfer der Dissertation:
1. Univ.-Prof. Dr. H.-W. Mewes
2. Univ.-Prof. Dr. K. Suhre
(Ludwig-Maximilians-Universität München)
Die Dissertation wurde am 24.11.2010 bei der Technischen Universität München eingereicht und durch die
Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt am 17.02.2011
angenommen.Acknowledgements
I would like to express my gratitude to my research colleagues, who were always
willing to help me with an advice and gave me an opportunity to work in a friendly and
creative atmosphere: Anil Pandey, Robert Körner, Sergii Novotarskyi, Matthias Rupp,
Simona Kovarich, Stefan Brandmaier, Wolfram Teetz, Eva Schlosser, Vlad Kholodovych and
Ahmed Abdelaziz. I thank Benoit Mathieu for his advices.
I would like to thank my thesis advisor, Dr. Igor Tetko, for his help and creativeness,
for having introduced me the scientific way of thinking and for his ideas that have
significantly contributed to my thesis work.
I am very grateful to my supervisor Prof. Hans-Werner Mewes for giving me an
opportunity to work on my thesis in Institute of Bioinformatics and Systems Biology and for
supporting my research. I am also very grateful to Prof. Karsten Suhre for his interest in my
work.
I would like to thank my family back in Ukraine, my mother Valentyna Sushko, my
father Alexander Sushko and my brother Ievgenii Sushko for supporting me.
Iurii SushkoAbstract
In recent decades, computational models have gained popularity for predictions of
biological activities and physicochemical properties. This new and rapidly developing field
of research is referred to as QSAR/QSPR (Quantitative Structure-Activity/Property
Relationship) and is especially applicable in drug design and in environmental risk
assessment (ecotoxicology), where screening of large datasets of compounds is required.
The major limiting point of computational models is questionable reliability of
predictions. Computational models are not guaranteed to give equally accurate predictions on
the whole chemical space; in other words, the computational models have limited domain of
applicability. At present, the lack of a proper definition for the applicability domain (AD) of
a model is one of the major issues restraining the practical application of computational
models. The problem of the AD assessment is addressed in this work.
The work introduces the methodology for the AD assessment and conveys a
comprehensive benchmarking analysis of existing and new approaches. The practical AD
assessment is demonstrated in a number of studies on prediction of such properties as
mutagenicity (Ames test), toxicity (inhibition growth concentration), lipophilicity and
cytochromes inhibition. It is shown that the AD approaches allow to estimate the prediction
accuracy for every compound individually and, thereby, to discriminate highly accurate
predictions with the accuracy close to that of experimental measurements. All the introduced
AD methods are implemented as a part of a new platform for chemical modeling (OCHEM)
and are publicly available online at http://ochem.eu.Table of Contents
1 Introduction..........................................................................................................................1
1.1 Motivation.......................................................................................................................1
1.2 Thesis roadmap...............................................................................................................3
2 Methodology5
2.1 QSAR research5
2.1.1 Overview.................................................................................................................5
2.1.2 Molecular descriptors..............................................................................................5
2.1.3 Machine learning methods......................................................................................7
2.1.4 Meta-learning techniques........................................................................................9
A. Model ensembles and bagging...............................................................................9
B. LIBRARY model correction...................................................................................9
2.1.5 Validation of models...............................................................................................9
2.1.6 Prediction accuracy11
A. Regression models................................................................................................11
B. Classification models............................................................................................12
2.1.7 Detection of statistical significancy......................................................................12
2.1.8 Representation of molecules.................................................................................13
2.2 Applicability domain of QSAR models........................................................................14
2.2.1 Basic definitions....................................................................................................14
2.2.2 Distances to models..............................................................................................15
A. Leverage ..............................................................................................................16
B. Standard deviation of the ensemble predictions (STD)........................................16
C. Tanimoto similarity...............................................................................................18
D. Correlation of prediction vectors (CORREL)......................................................18
E. Rounding effect (CLASS-LAG) ..........................................................................18
F. Concordance of a classification ensemble.............................................................19
G. Rounding effect and standard deviation combined (STD-PROB).......................20
H. Descriptor-based and property-based DMs..........................................................21
2.2.3 Analysis of prediction accuracy............................................................................22
A. Accuracy averaging..............................................................................................22
B. Estimation of prediction accuracy........................................................................23
2.2.4 Comparison of applicability domains...................................................................24
A. Discriminative power of DM................................................................................24
B. Fitness of probability distribution.........................................................................26
2.2.5 Interpretation of applicability domains.................................................................27
2.3 Analyzed datasets..........................................................................................................28
2.3.1 Datasets of experimental measurements...............................................................28
A. Ames test dataset..................................................................................................28
B. T. pyriformis toxicity dataset29
C. Platinum complexes lipophilicity dataset.............................................................29
D. CYP450 inhibitors dataset....................................................................................30
2.3.2 Datasets of chemical compounds..........................................................................30
A. Enamine dataset....................................................................................................30
B. EINECS dataset30
C. HPV dataset30
2.4 Summary.......................................................................................................................313 Online chemical modeling environment – OCHEM.......................................................33
3.1 Motivation.....................................................................................................................33
3.2 The database of experimental measurements...............................................................34
3.2.1 Structure overview................................................................................................34
3.2.2 Sources of information..........................................................................................35
3.2.3 Data access and management................................................................................36
3.3 Modeling framework....................................................................................................37
3.3.1 Overview...............................................................................................................37
3.3.2 Calculation of models................................................................