La lecture en ligne est gratuite
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
Télécharger Lire

Algorithms and a software system for comparative genome analysis [Elektronische Ressource] / vorgelegt von Mohamed Mostafa Mohamed Ibrahim Abouelhoda

201 pages
Universit at UlmAbteilung Theoretische InformatikArbeitsgruppehe BioinformatikLeiter: Prof. Dr. Uwe Sch oningALGORITHMS AND A SOFTWARE SYSTEMFORCOMPARATIVE GENOME ANALYSISDISSERTATIONzur Erlangung des Doktorgrades Dr. rer. nat.der Fakult at fur Informatik der Universit at Ulmvorgelegt vonMOHAMED MOSTAFA MOHAMED IBRAHIM ABOUELHODAUlm, 2005Amtierender Dekan: Prof. Dr. H. PartschGutachter: Prof. Dr. Enno OhlebuschProf. Dr. Uwe Sch oningProf. Dr. Robert GiegerichTag der Promotion: 13 Juli, 2005University of UlmTheoretical Computer Science DepartmentTheoretical Bioinformatics GroupChairman: Prof. Dr. Uwe Sch oningALGORITHMS AND A SOFTWARE SYSTEMFORCOMPARATIVE GENOME ANALYSISA thesis submitted to theFaculty of Computer Science of University of Ulmin ful llmen t of the requirements forthe degree of Doctor of Science Dr. rer. nat.byMOHAMED MOSTAFA MOHAMED IBRAHIM ABOUELHODAUlm, Germany2005Dean: Prof. Dr. H. PartschReviewers: Prof. Dr. Enno OhlebuschProf. Dr. Uwe Sch oningProf. Dr. Robert GiegerichDay of Conferral of Doctorate: 13 July 2005AcknowledgmentsForemost, I am grateful to my supervisor, Enno Ohlebusch. His con-tinuous support and guidance have greatly in uenced my doctoral studies,research, and this work. I am sure that his teachings will always accompanyme in the trip of scienti c research. I am grateful to Robert Giegerich forintroducing the problem to me, and for refereeing the thesis.
Voir plus Voir moins

Universit at Ulm
Abteilung Theoretische Informatik
Arbeitsgruppehe Bioinformatik
Leiter: Prof. Dr. Uwe Sch oning
ALGORITHMS AND A SOFTWARE SYSTEM
FOR
COMPARATIVE GENOME ANALYSIS
DISSERTATION
zur Erlangung des Doktorgrades Dr. rer. nat.
der Fakult at fur Informatik der Universit at Ulm
vorgelegt von
MOHAMED MOSTAFA MOHAMED IBRAHIM ABOUELHODA
Ulm, 2005Amtierender Dekan: Prof. Dr. H. Partsch
Gutachter: Prof. Dr. Enno Ohlebusch
Prof. Dr. Uwe Sch oning
Prof. Dr. Robert Giegerich
Tag der Promotion: 13 Juli, 2005University of Ulm
Theoretical Computer Science Department
Theoretical Bioinformatics Group
Chairman: Prof. Dr. Uwe Sch oning
ALGORITHMS AND A SOFTWARE SYSTEM
FOR
COMPARATIVE GENOME ANALYSIS
A thesis submitted to the
Faculty of Computer Science of University of Ulm
in ful llmen t of the requirements for
the degree of Doctor of Science Dr. rer. nat.
by
MOHAMED MOSTAFA MOHAMED IBRAHIM ABOUELHODA
Ulm, Germany
2005Dean: Prof. Dr. H. Partsch
Reviewers: Prof. Dr. Enno Ohlebusch
Prof. Dr. Uwe Sch oning
Prof. Dr. Robert Giegerich
Day of Conferral of Doctorate: 13 July 2005Acknowledgments
Foremost, I am grateful to my supervisor, Enno Ohlebusch. His con-
tinuous support and guidance have greatly in uenced my doctoral studies,
research, and this work. I am sure that his teachings will always accompany
me in the trip of scienti c research. I am grateful to Robert Giegerich for
introducing the problem to me, and for refereeing the thesis. His invalu-
able discussions and suggestions have been re ected in many places in this
research. Thanks to Uwe Sch oning for reading the whole thesis and referee-
ing it. My gratitude to Stefan Kurtz for the fruitful discussions that always
attracted my attention to many practical aspects of sequence analysis.
I am thankful to Dirk Strothmann for his fruitful observations that have
improved parts of this research, and to Janina Reeder for providing me with
pieces of codes. I also thank Michael Beckstette and Alexander Sczyrba for
discussions about the su x array and cDNA/EST mapping. My thanks go
also to Kathrin Hockel, Jan Stallkamp, Randall Pruim, and Jakir H. Ullah
for their careful readings and the helpful comments to improve this script.
I am always indebted to my teachers, from school to university, but un-
fortunately the few pages I am restricted to are not enough to mention them
all. Although they were not directly involved in this work, their teachings
have in uenced and will in uence my research and life.
Finally, I am indebted to my parents, my wife, and my family for their
love, encouragement, and support.
This work was supported by Graduiertenkolleg Bioinformatik, Bielefeld,
and by DFG-grant Oh 53/4-1.Contents
1 Introduction 1
1.1 Comparative genomics . . . . . . . . . . . . . . . . . . . . . . 1
1.2 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 The enhanced su x arrays . . . . . . . . . . . . . . . . 3
1.2.2 The chaining algorithms . . . . . . . . . . . . . . . . . 4
1.2.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Implementations . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Thesis Organization . . . . . . . . . . . . . . . . . . . 6
2 The Genome 9
2.1 DNA: The molecule of life . . . . . . . . . . . . . . . . . . . . 9
2.2 Structure of eukaryotic genomes . . . . . . . . . . . . . . . . . 11
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Chromosomes . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Intergenic regions . . . . . . . . . . . . . . . . . . . . . 18
2.3 Structure of prokaryotic genomes . . . . . . . . . . . . . . . . 20
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 The main genome . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4 Intergenic regions . . . . . . . . . . . . . . . . . . . . . 21
2.4 Genome dynamics and evolution . . . . . . . . . . . . . . . . . 21
2.5 Comparative genomics and perspectives . . . . . . . . . . . . . 23
3 Comparative Genomics Tools: A Survey 27
3.1 Sequence alignment algorithms . . . . . . . . . . . . . . . . . 28
3.2 Classi cation scheme of the methods . . . . . . . . . . . . . . 31
3.2.1 The semi-manual strategy . . . . . . . . . . . . . . . . 31
3.2.2 The window-based . . . . . . . . . . . . . . . 32
3.2.3 The anchor-based strategy . . . . . . . . . . . . . . . . 33
3.3 Techniques in the anchor-based strategy . . . . . . . . . . . . 33
iCONTENTS
3.3.1 Fragment generation techniques . . . . . . . . . . . . . 33
3.3.2 Anchors nding techniques . . . . . . . . . . . . . . . . 38
3.4 The tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 WABA . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 PipMaker and BLASTZ . . . . . . . . . . . . . . . . . 43
3.4.3 LSH-ALL-PAIRS . . . . . . . . . . . . . . . . . . . . . 45
3.4.4 ASSIRC . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.5 DIALIGN . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.6 GLASS . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.7 AVID . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.8 CHAOS and LAGAN . . . . . . . . . . . . . . . . . . . 49
3.4.9 MUMmer . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.10 MGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.11 Very recent tools: EMAGEN and Mauve . . . . . . . . 53
3.5 Remarks on the surveyed techniques . . . . . . . . . . . . . . 54
3.5.1 Choice of fragments . . . . . . . . . . . . . . . . . . . . 54
3.5.2 Setting of parameters . . . . . . . . . . . . . . . . . . . 55
3.5.3 The multiple alignment methods . . . . . . . . . . . . 55
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 The Enhanced Su x Array 61
4.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 The su x tree and the su x array . . . . . . . . . . . . . . . 61
4.3 The enhanced su x array . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Basic concepts and de nitions . . . . . . . . . . . . . . 64
4.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Algorithms based on properties of the enhanced su x arrays . 66
4.4.1 Repeat analysis and genome comparison . . . . . . . . 66
4.4.2 The lcp-intervals . . . . . . . . . . . . . . . . . . . . . 67
4.4.3 Computation of supermaximal repeats . . . . . . . . . 68
4.4.4 of maximal unique matches . . . . . . . . 69
4.5 Bottom-up traversals . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.1 The lcp-interval tree of a su x array . . . . . . . . . . 71
4.5.2 Computation of maximal repeated pairs . . . . . . . . 73
4.5.3 of exact matches . . . . . . . . 75
4.5.4 of maximal multiple exact matches . . . 78
4.5.5 Other applications of the bottom-up traversal . . . . . 78
4.6 Top-down traversals . . . . . . . . . . . . . . . . . . . . . . . 83
4.6.1 Exact pattern matching . . . . . . . . . . . . . . . . . 84
4.6.2 Other applications of the top-down traversal . . . . . . 90
4.7 Incorporating su x links . . . . . . . . . . . . . . . . . . . . . 92
iiCONTENTS
4.7.1 Su x links . . . . . . . . . . . . . . . . . . . . . . . . 92
4.7.2 Space e cien t computation of MEMs . . . . . . . . . . 97
4.8 Implementation details . . . . . . . . . . . . . . . . . . . . . . 102
4.8.1 The lcp-table . . . . . . . . . . . . . . . . . . . . . . . 102
4.8.2 The child-table . . . . . . . . . . . . . . . . . . . . . . 103
4.8.3 The su x link table . . . . . . . . . . . . . . . . . . . 103
4.9 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 104
4.9.1 Algorithms based on properties of enhanced su x arrays105
4.9.2 based on bottom-up traversal . . . . . . . . 106
4.9.3 Algorithms based on top-down traversal . . . . . . . . 108
4.9.4 for traversals using su x links . . . . . . . 110
4.10 Conclusions and future perspectives . . . . . . . . . . . . . . . 111
5 Chaining Algorithms 115
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Basic concepts and de nitions . . . . . . . . . . . . . . . . . . 116
5.3 The global chaining problem . . . . . . . . . . . . . . . . . . . 118
5.3.1 Graph based formulation . . . . . . . . . . . . . . . . . 118
5.3.2 Geometric based formulation . . . . . . . . . . . . . . . 119
5.4 Two dimensional global chaining without gap costs . . . . . . 121
5.4.1 The Basic chaining algorithm . . . . . . . . . . . . . . 121
5.4.2 Answering RMQ with activation . . . . . . . . . . . . . . 122
5.5 Higher dimensional chaining without gap costs . . . . . . . . . 126
5.5.1 Chaining with range trees . . . . . . . . . . . . . . . . 127
5.5.2 with kd-trees . . . . . . . . . . . . . . . . . . 132
5.6 Incorporating gap costs into the algorithm . . . . . . . . . . . 135
5.6.1 Costs in the L metric . . . . . . . . . . . . . . . . . . 1361
5.6.2 The sum-of-pairs gap cost . . . . . . . . . . . . . . . . 138
5.7 Local chaining: a basic variation . . . . . . . . . . . . . . . . . 143
5.8 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.8.1 Including gap constrains . . . . . . . . . . . . . . . . . 146
5.8.2 cDNA/EST mapping . . . . . . . . . . . . . . . . . . . 148
5.9 Related problems . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.9.1 1-dimensional chaining . . . . . . . . . . . . . . . . . . 149
5.9.2 A more general problem . . . . . . . . . . . . . . . . . 151
5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6 CHAINER: the software system 155
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2 Finding regions of high similarity . . . . . . . . . . . . . . . . 157
6.2.1 Detecting transposition events . . . . . . . . . . . . . . 157
iiiCONTENTS
6.2.2 Detecting inversions . . . . . . . . . . . . . . . . . . . 158
6.2.3 A large scale comparison between the human and mouse
genomes . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3 Multiple global alignment . . . . . . . . . . . . . . . . . . . . 162
6.4 Comparing draft genomes . . . . . . . . . . . . . . . . . . . . 162
6.5 cDNA mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.6 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 166
6.6.1 Parameters for comparing genomes . . . . . . . . . . . 167
6.6.2 P for cDNA/EST mapping . . . . . . . . . . 169
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7 Conclusions 171
Bibliography 176
iv

Un pour Un
Permettre à tous d'accéder à la lecture
Pour chaque accès à la bibliothèque, YouScribe donne un accès à une personne dans le besoin