Learning diagnostic rules with multivariate classification algorithms [Elektronische Ressource] : specific needs and challenges / vorgelegt von Raluca Ilinca Schmitt

technischen_universitat_dortmund

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

179 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Sujets

Statistics

Informations

Publié par	technischen_universitat_dortmund
Publié le	01 janvier 2009
Nombre de lectures	76
Langue	English
Poids de l'ouvrage	121 Mo

Extrait

LEARNING DIAGNOSTIC RULES WITH
MULTIVARIATE CLASSIFICATION
ALGORITHMS
SPECIFIC NEEDS AND CHALLENGES
DISSERTATION
Zur Erlangung des Grades
eines Doktors der Naturwissenschaften
der Universitat¨ Dortmund
Der Fakultat¨ Statistik
der Technischen Universitat¨ Dortmund
vorgelegt von
Raluca Ilinca Schmitt
Dortmund, June, 2009Abstract
Considerable eorts are spent in the diagnostic research on ﬁnding biomarker panels that have
a high potential to accurately identify a complex disease at an early stage.
This thesis addresses the realisability of speciﬁc requirements which a diagnostic rule should
comply with in order to be accepted and useful within diagnostic workﬂows. Major aims in the
process of rule building for diagnostic purposes are beside the high accuracy also the simplicity
and interpretability of diagnostic rules. They have to provide accurate and reproducible results
in order to be reliable. They have to be simple for an easy assessment in the diagnostic practice
and good interpretable for a high acceptance by medical practitioners.
A simultaneous accomplishment of these quality standards is dicult due to the trade-o be-
tween accuracy and model complexity.
For instance Logic Regression might be a suitable method for diagnostic classiﬁcation problems
as it provides very simple and interpretable discriminant rules. These are deﬁned as and-or com-
binations of binary predictors. However a performance loss is expected due to the necessity to
dichotomize continuous predictors.
Advantages and disadvantages of simple and easy interpretable classiﬁcation models (e.g. Logic
Regression) when compared to established but more complex and powerful ones (e.g. Regular-
ized Discriminant Analysis, Random Forests) are highlighted and discussed.
Another major challenge is to ensure the fair comparison of classiﬁcation algorithms and diag-
nostic rules in order to select the most promising candidates. Regarding a general diagnostic
task the algorithm should be selected that leads to the most stable and unbiased results. Regard-
ing some special diagnostic question the most accurate discriminant rule should be selected.
Adequate designs to evaluate and optimize classiﬁcation algorithms and rules are presented.
This thesis deals also with the problem of an accurate estimation of rules and of their perfor-
mance in the context of a heterogeneous target population but non-representative training data.
Learning the diagnostic rule on some excerpt of the target population with dierent observed
subclass prevalences than the true ones might be a source of severe bias regarding both the se-
lected rule and its estimated accuracy.
Four weighting classiﬁcation algorithms that account for the subclass prevalence structure of
the target population during the processes of rule building and rule validation are presented.
Their feasibility over various practical settings is assessed both empirically and theoretically.
iii
All investigated methods are applied on some real data sets of rheumatoid arthritis cases and
controls provided by Roche Diagnostics GmbH, Penzberg. Supplementary information is gained
with simulated data.Acknowledgement
I would like to thank all people supporting me during my dissertation work.
First of all many special thanks to my supervisor Ursula Garczarek and to Andrea Geistanger,
for keeping me motivated, for precious mentoring and fruitful discussions. I appreciate not only
their professional, but also their human support and friendship.
Thanks to my professor Claus Weihs for useful discussions, cooperativeness and interest with
respect to the topic of my dissertation.
Thanks to Veit Peter Grunert, Friedemann Krause and Christoph Berding for giving me the
challenging opportunity to collaborate with Roche Diagnostics GmbH for almost ﬁve years. To
work on interesting projects which have inspired also the subject of my dissertation.
Thanks to my colleagues Geraldine Rauch and Insa Winzenborg for some useful proof reading
to the text. A special thank to my friend Geraldine for being always prepared to lend a helping
hand.
Thanks to the Roche Auslander¨ Community for cheering up my last year of self-chosen Penzberg
exile. Muchas gracias por todo to Micaela Molina Navarro, vielen Dank to Daniela Behling,
bolshoe spasibo to Kirill Bessoonov, and thank you very much to Hillary Workman for review-
ing my English in this work. Thanks for your friendship and patience!
Finally, I want to thank to the really special persons in my life, which have all contributed
more or less to my successful way through the last years. Vielen Dank an Niels Schmitt fur¨
seine Liebe und Hilfsbereitschaft. Un grand merci a Yasser Gaou pour son amour precieux´ et
le support dans les bons et mauvais moments.
Last, but not least, thanks to my greatest fans, my parents Elena and Radu-Grat ¸ian Pepene, and
above all to God. I owe them everything. I dedicate this work to them as well as to the memory
of my grandmother, Dr. Gabriela Balea-Pepene. Mult ¸umesc!
iiiContents
1 Overview 1
2 Adequate designs to optimize and evaluate classiﬁcation algorithms and rules 6
2.1 Formal preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Designs for the comparison of algorithms . . . . . . . . . . . . . . . . . . . . 8
2.3 Simulation designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Interpretability of diagnostic rules 17
3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Logic Regression (LogicR) . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1.1 Logic model and its size . . . . . . . . . . . . . . . . . . . . 19
3.1.1.2 Dichotomization . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1.3 Logic rule optimization . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Regularized Discriminant Analysis (RDA) . . . . . . . . . . . . . . . 23
3.1.3 Random Forests (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Real data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Simulated data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2.1 RDA data structure . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2.2 LogicR data . . . . . . . . . . . . . . . . . . . . . 26
3.2.2.3 Real data structure . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Comparative study design and results . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Comparative study design on real data . . . . . . . . . . . . . . . . . . 28
3.3.2 Comparative study design on simulated data . . . . . . . . . . . . . . . 29
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3.1 Real data results . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
i4 Learning diagnostic rules in case-control studies in the presence of known sub-
classes 35
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Learning RDA classiﬁers with misrepresented subclasses . . . . . . . . . . . . 39
4.2.1 Deﬁnitions and notations . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 RDA classiﬁers in general . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Weighting of parameter estimates for rule building . . . . . . . . . . . 43
4.2.3.1 Corrected CV Training . . . . . . . . . . . . . . . . . . . . 44
4.2.3.2 Inﬂated CV Training . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Weighting of performance estimates for optimization and validation . . 48
4.2.4.1 Corrected CV Test . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.4.2 Test . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Benchmark of algorithms on simulated data . . . . . . . . . . . . . . . . . . . 51
4.3.1 Selected weighting algorithms . . . . . . . . . . . . . . . . . . . . . . 51
4.3.2 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.4 Comparative design on simulated data . . . . . . . . . . . . . . . . . . 59
4.3.5 Application and results . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.5.1 Graphical approach DOE1 . . . . . . . . . . . . . . . . . . 63
4.3.5.2 Linear Model DOE1 . . . . . . . . . . . . . . . . . . . . . . 65
4.3.5.3 Graphical approach DOE2 . . . . . . . . . . . . . . . . . . 68
S
4.3.5.4 CART model DOE1 DOE2 . . . . . . . . . . . . . . . . . 69
S
4.3.5.5 Quadratic model DOE1 DOE2 . . . . . . . . . . . . . . . 73
4.3.5.6 Negative bias discussion . . . . . . . . . . . . . . . . . . . . 75
4.3.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Benchmark of weighting algorithms on real data . . . . . . . . . . . . . . . . . 76
4.4.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.2 Comparative study on real data . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 Validating diagnostic rules