La lecture à portée de main
Description
Sujets
Informations
Publié par | universitat_ulm |
Publié le | 01 janvier 2005 |
Nombre de lectures | 88 |
Langue | English |
Poids de l'ouvrage | 2 Mo |
Extrait
Universit at Ulm
Abteilung Theoretische Informatik
Arbeitsgruppehe Bioinformatik
Leiter: Prof. Dr. Uwe Sch oning
ALGORITHMS AND A SOFTWARE SYSTEM
FOR
COMPARATIVE GENOME ANALYSIS
DISSERTATION
zur Erlangung des Doktorgrades Dr. rer. nat.
der Fakult at fur Informatik der Universit at Ulm
vorgelegt von
MOHAMED MOSTAFA MOHAMED IBRAHIM ABOUELHODA
Ulm, 2005Amtierender Dekan: Prof. Dr. H. Partsch
Gutachter: Prof. Dr. Enno Ohlebusch
Prof. Dr. Uwe Sch oning
Prof. Dr. Robert Giegerich
Tag der Promotion: 13 Juli, 2005University of Ulm
Theoretical Computer Science Department
Theoretical Bioinformatics Group
Chairman: Prof. Dr. Uwe Sch oning
ALGORITHMS AND A SOFTWARE SYSTEM
FOR
COMPARATIVE GENOME ANALYSIS
A thesis submitted to the
Faculty of Computer Science of University of Ulm
in ful llmen t of the requirements for
the degree of Doctor of Science Dr. rer. nat.
by
MOHAMED MOSTAFA MOHAMED IBRAHIM ABOUELHODA
Ulm, Germany
2005Dean: Prof. Dr. H. Partsch
Reviewers: Prof. Dr. Enno Ohlebusch
Prof. Dr. Uwe Sch oning
Prof. Dr. Robert Giegerich
Day of Conferral of Doctorate: 13 July 2005Acknowledgments
Foremost, I am grateful to my supervisor, Enno Ohlebusch. His con-
tinuous support and guidance have greatly in uenced my doctoral studies,
research, and this work. I am sure that his teachings will always accompany
me in the trip of scienti c research. I am grateful to Robert Giegerich for
introducing the problem to me, and for refereeing the thesis. His invalu-
able discussions and suggestions have been re ected in many places in this
research. Thanks to Uwe Sch oning for reading the whole thesis and referee-
ing it. My gratitude to Stefan Kurtz for the fruitful discussions that always
attracted my attention to many practical aspects of sequence analysis.
I am thankful to Dirk Strothmann for his fruitful observations that have
improved parts of this research, and to Janina Reeder for providing me with
pieces of codes. I also thank Michael Beckstette and Alexander Sczyrba for
discussions about the su x array and cDNA/EST mapping. My thanks go
also to Kathrin Hockel, Jan Stallkamp, Randall Pruim, and Jakir H. Ullah
for their careful readings and the helpful comments to improve this script.
I am always indebted to my teachers, from school to university, but un-
fortunately the few pages I am restricted to are not enough to mention them
all. Although they were not directly involved in this work, their teachings
have in uenced and will in uence my research and life.
Finally, I am indebted to my parents, my wife, and my family for their
love, encouragement, and support.
This work was supported by Graduiertenkolleg Bioinformatik, Bielefeld,
and by DFG-grant Oh 53/4-1.Contents
1 Introduction 1
1.1 Comparative genomics . . . . . . . . . . . . . . . . . . . . . . 1
1.2 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 The enhanced su x arrays . . . . . . . . . . . . . . . . 3
1.2.2 The chaining algorithms . . . . . . . . . . . . . . . . . 4
1.2.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Implementations . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Thesis Organization . . . . . . . . . . . . . . . . . . . 6
2 The Genome 9
2.1 DNA: The molecule of life . . . . . . . . . . . . . . . . . . . . 9
2.2 Structure of eukaryotic genomes . . . . . . . . . . . . . . . . . 11
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Chromosomes . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Intergenic regions . . . . . . . . . . . . . . . . . . . . . 18
2.3 Structure of prokaryotic genomes . . . . . . . . . . . . . . . . 20
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 The main genome . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4 Intergenic regions . . . . . . . . . . . . . . . . . . . . . 21
2.4 Genome dynamics and evolution . . . . . . . . . . . . . . . . . 21
2.5 Comparative genomics and perspectives . . . . . . . . . . . . . 23
3 Comparative Genomics Tools: A Survey 27
3.1 Sequence alignment algorithms . . . . . . . . . . . . . . . . . 28
3.2 Classi cation scheme of the methods . . . . . . . . . . . . . . 31
3.2.1 The semi-manual strategy . . . . . . . . . . . . . . . . 31
3.2.2 The window-based . . . . . . . . . . . . . . . 32
3.2.3 The anchor-based strategy . . . . . . . . . . . . . . . . 33
3.3 Techniques in the anchor-based strategy . . . . . . . . . . . . 33
iCONTENTS
3.3.1 Fragment generation techniques . . . . . . . . . . . . . 33
3.3.2 Anchors nding techniques . . . . . . . . . . . . . . . . 38
3.4 The tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 WABA . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 PipMaker and BLASTZ . . . . . . . . . . . . . . . . . 43
3.4.3 LSH-ALL-PAIRS . . . . . . . . . . . . . . . . . . . . . 45
3.4.4 ASSIRC . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.5 DIALIGN . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.6 GLASS . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.7 AVID . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.8 CHAOS and LAGAN . . . . . . . . . . . . . . . . . . . 49
3.4.9 MUMmer . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.10 MGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.11 Very recent tools: EMAGEN and Mauve . . . . . . . . 53
3.5 Remarks on the surveyed techniques . . . . . . . . . . . . . . 54
3.5.1 Choice of fragments . . . . . . . . . . . . . . . . . . . . 54
3.5.2 Setting of parameters . . . . . . . . . . . . . . . . . . . 55
3.5.3 The multiple alignment methods . . . . . . . . . . . . 55
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 The Enhanced Su x Array 61
4.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 The su x tree and the su x array . . . . . . . . . . . . . . . 61
4.3 The enhanced su x array . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Basic concepts and de nitions . . . . . . . . . . . . . . 64
4.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Algorithms based on properties of the enhanced su x arrays . 66
4.4.1 Repeat analysis and genome comparison . . . . . . . . 66
4.4.2 The lcp-intervals . . . . . . . . . . . . . . . . . . . . . 67
4.4.3 Computation of supermaximal repeats . . . . . . . . . 68
4.4.4 of maximal unique matches . . . . . . . . 69
4.5 Bottom-up traversals . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.1 The lcp-interval tree of a su x array . . . . . . . . . . 71
4.5.2 Computation of maximal repeated pairs . . . . . . . . 73
4.5.3 of exact matches . . . . . . . . 75
4.5.4 of maximal multiple exact matches . . . 78
4.5.5 Other applications of the bottom-up traversal . . . . . 78
4.6 Top-down traversals . . . . . . . . . . . . . . . . . . . . . . . 83
4.6.1 Exact pattern matching . . . . . . . . . . . . . . . . . 84
4.6.2 Other applications of the top-down traversal . . . . . . 90
4.7 Incorporating su x links . . . . . . . . . . . . . . . . . . . . . 92
iiCONTENTS
4.7.1 Su x links . . . . . . . . . . . . . . . . . . . . . . . . 92
4.7.2 Space e cien t computation of MEMs . . . . . . . . . . 97
4.8 Implementation details . . . . . . . . . . . . . . . . . . . . . . 102
4.8.1 The lcp-table . . . . . . . . . . . . . . . . . . . . . . . 102
4.8.2 The child-table . . . . . . . . . . . . . . . . . . . . . . 103
4.8.3 The su x link table . . . . . . . . . . . . . . . . . . . 103
4.9 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 104
4.9.1 Algorithms based on properties of enhanced su x arrays105
4.9.2 based on bottom-up traversal . . . . . . . . 106
4.9.3 Algorithms based on top-down traversal . . . . . . . . 108
4.9.4 for traversals using su x links . . . . . . . 110
4.10 Conclusions and future perspectives . . . . . . . . . . . . . . . 111
5 Chaining Algorithms 115
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Basic concepts and de nitions . . . . . . . . . . . . . . . . . . 116
5.3 The global chaining problem . . . . . . . . . . . . . . . . . . . 118
5.3.1 Graph based formulation . . . . . . . . . . . . . . . . . 118
5.3.2 Geometric based formulation . . . . . . . . . . . . . . . 119
5.4 Two dimensional global chaining without gap costs . . . . . . 121
5.4.1 The Basic chaining algorithm . . . . . . . . . . . . . . 121
5.4.2 Answering RMQ with activation . . . . . . . . . . . . . . 122
5.5 Higher dimensional chaining without gap costs . . . . . . . . . 126
5.5.1 Chaining with range trees . . . . . . . . . . . . . . . . 127
5.5.2 with kd-trees . . . . . . . . . . . . . . . . . . 132
5.6 Incorporating gap costs into the algorithm . . . . . . . . . . . 135
5.6.1 Costs in the L metric . . . . . . . . . . . . . . . . . . 1361
5.6.2 The sum-of-pairs gap cost . . . . . . . . . . . . . . . . 138
5.7 Local chaining: a basic variation . . . . . . . . . . . . . . . . . 143
5.8 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.8.1 Including gap constrains . . . . . . . . . . . .