Analyzing biological expression data based on decision tree induction [Elektronische Ressource] / von André Flöter

universitat_potsdam

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

108 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Sujets

Institut fu¨r Informatik Institut de Recherche en Informatique
Wissensverarbeitung und Informationssysteme et Syste`mes Ale´atoires
Analyzing biological expression data based on
decision tree induction
Dissertation
zur Erlangung des akademischen Grades
“Doctor rerum naturalium”
(Dr.rer.nat.)
in der Wissenschaftsdisziplin Bioinformatik
eingereicht an der
Mathematisch-Naturwissenschaftlichen Fakulta¨t
der Universita¨t Potsdam
von
Andre´ Flo¨ter
Potsdam, den 27.05.2005Acknowledgements
Beside my compassionate ﬁance´e Sophie I would like to thank the following people for their
support,
• Torsten Schaub and Jacques Nicolas, as the tutors of this thesis, for their invaluable advice
and vast help in coping with all kinds of technical, formal, and ﬁnancial problems,
• Joachim Selbig for the ﬁnancial aid and his kind assistance in writing publications,
• Joachim Kopka and Georg Leggewie for the biological data and their inexhaustible patience
with my repetitive enquiries,
• Philippe Besnard for his advice and the revision of all French text I had to produce,
• Wolfgang Severin for his substantiated help whenever my JAVA interpreter was of a different
opinion than myself,
• Matthias Mo¨hlig for his cooperation and scientiﬁc enthusiasm which motivated me to the
end of my thesis,
• and all the rest of the staff who has helped me during the last years.
Additionally, I would like to thank Peter-Uwe Zettie´r, Matthias Scholz, Gert Zo¨ller and Carsten
Haustein for comforting me in times of fatigue and doubt.
2Contents
1 Observing life at the molecular level: Systems biology 10
1.1 Available data sources of Systems biology . . . . . . . . . . . . . . . . . . . . . 11
1.1.1 Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2 Transcriptomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.3 Other types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Cleaning biological data: data preprocessing . . . . . . . . . . . . . . . . . . . . 16
1.2.1 Describing empirical data: Statistical standard measures . . . . . . . . . 17
1.2.2 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.4 Discretisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Finding interdependencies between attributes: correlation analysis . . . . . . . . 22
1.3.1 Pearson’s correlation coefﬁcients . . . . . . . . . . . . . . . . . . . . . 22
1.3.2 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 Network induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.2 Putting information in Graphical models . . . . . . . . . . . . . . . . . . 24
1.4.3 Causal networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4 Bayesean approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5 Public databases related to Systems biology . . . . . . . . . . . . . . . . . . . . 26
1.5.1 KEGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.2 BRENDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.3 ExPASy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 A tool for the identiﬁcation of structure in data: Decision trees 28
2.1 Machine learning on attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.2 Types of predictive models . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.3 Graphical and rule-based representations of classiﬁers . . . . . . . . . . 31
2.2 Basics of Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1 Graphical representation of trees . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2 Propositional rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Learning decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.4 ID3/C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.5 Alternative developments of decision tree learners . . . . . . . . . . . . 38
2.3 Advanced issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1 Overﬁtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
32.3.3 Cross-validation, Jackknife, Bootstrapping . . . . . . . . . . . . . . . . 40
2.3.4 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.5 Continuous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.6 Oblique hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.7 Ensemble techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.8 Decision lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.9 Hybrid decision tree approaches . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Previous applications in biological data analysis . . . . . . . . . . . . . . . . . . 49
2.4.1 Classiﬁcation of biological tissue samples . . . . . . . . . . . . . . . . . 49
2.4.2 Reconstruction of gene networks . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 From raw data to biological networks: a contribution to the analysis of dependencies
among sparse and noisy continuous data 51
3.1 Revealing stable states of an organism . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.1 Established Methods considered in this approach . . . . . . . . . . . . . 53
3.1.2 Modelling states of an organism . . . . . . . . . . . . . . . . . . . . . . 54
3.1.3 Growing decision forests . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.4 Threshold extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.5 Parameters of the threshold extraction technique . . . . . . . . . . . . . 56
3.2 Revealing combinatorial dependencies . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.1 Partial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2 Mutual information and conditional mutual information . . . . . . . . . 59
3.2.3 Conditional mutual information on artiﬁcial data . . . . . . . . . . . . . 62
3.2.4 Dependency network inference . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 A heuristic approach: Estimating conditional mutual information through decision
forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1 Exploiting decision tree heuristics . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Making classiﬁers robust with decision forests . . . . . . . . . . . . . . 68
3.3.3 An illustrative example: Interpreting a decision forest . . . . . . . . . . . 69
3.3.4 Characteristics and discussion of the output structure . . . . . . . . . . . 71
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 An experiment in the analysis of metabolite concentration data for potatoes 73
4.1 A new tool: the provided software package . . . . . . . . . . . . . . . . . . . . 73
4.1.1 Implementation details on the state identiﬁer . . . . . . . . . . . . . . . 73
4.1.2 Implementation details on tools for the calculation of MI . . . . . . . . . 74
4.1.3 Implementation details on the dependency inducer . . . . . . . . . . . . 74
4.1.4 User scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Application of the introduced techniques on metabolite concentration data . . . . 78
4.2.1 Metabolite concentration data of transgenic potato . . . . . . . . . . . . 78
4.2.2 Interpreting the discovered stable states . . . . . . . . . . . . . . . . . . 81
4.2.3 Interpreting the dependency structure . . . . . . . . . . . . . . . . . . . 89
4.2.4 Summary of the analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5 Appendix 104
5.1 Novel tools for the application of the introduced techniques . . . . . . . . . . . . 104
5.2 Complexity of the calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Code for generating artiﬁcial data . . . . . . . . . . . . . . . . . . . . . . . . . 108
4Re´sume´
Les techniques modernes d’analyse biologique fournissent aux scientiﬁques diverses formes de
donne´es. Une cate´gorie parmi d’autres est appele´e “ donne´es d’expression ”. Ces donne´es
indiquent les quantite´s des compose´s biochimiques pre´sents dans les tissus, e´chantillon par
e´chantillon.
Depuis peu, les “ donne´es d’expression ” peuvent eˆtre ge´ne´re´es tre`s rapidement. Ce qui aboutit
a` des masses de donne´es qui ne sont plus analysables par les techniques statistiques classiques. Le
“ Systems biology ” est un nouveau domaine de´die´ a` la mode´lisation de ces informations.
Actuellement, il y a diverses approches qui sont appli