Communication theory applied to selected problems in computational genetics [Elektronische Ressource] / Janis Dingel
215 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Communication theory applied to selected problems in computational genetics [Elektronische Ressource] / Janis Dingel

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
215 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Informations

Publié par
Publié le 01 janvier 2009
Nombre de lectures 30
Langue English
Poids de l'ouvrage 2 Mo

Extrait

¨ ¨TECHNISCHE UNIVERSITAT MUNCHEN
Lehrstuhl fu¨r Nachrichtentechnik
Communication theory applied to selected problems
in computational genetics
Janis Dingel
Vollst¨andiger Abdruck der von der Fakult¨at fu¨r Elektrotechnik und Informationstechnik
der Technischen Universit¨at Mu¨nchen zur Erlangung des akademischen Grades eines
Doktor–Ingenieurs
genehmigten Dissertation.
Vorsitzender: Univ.–Prof. Dr.–Ing. J. Ebersp¨acher
Pru¨fer der Dissertation:
1. Univ.–Prof. Dr.–Ing. Dr.–Ing. E.h. J. Hagenauer (i.R.)
2. Prof. O. Milenkovic, Ph.D.,
University of Illinois at Urbana Champaign, USA
DieDissertationwurdeam25.06.2009beiderTechnischenUniversit¨atMu¨ncheneingereicht
unddurchdieFakult¨atfu¨rElektro-undInformationstechnikam06.10.2009angenommen.DEDICATED TO
LONI ELISABETH DINGEL
b d23.05.1950 - 23.01.2010Preface
This thesis is the result of my work as a research and teaching assistant at the Institute
for Communications Engineering of the Technische Universit¨at Mu¨nchen. There are a
number of people I would like to thank.
First of all, I would like to thank Professor Dr.–Ing. Dr.–Ing. E.h. Joachim Hagenauer for
givingmetheopportunitytoworkinhisgroupandforhiscontinuoussupportthroughout
the course of my research. His interest and advice have been essential to this work.
Interdisciplinary, collaborative work can only be successful with strong partners from
the other discipline involved. Therefore, I am very grateful to PD Dr. Jakob Mu¨ller,
Dr. Vladimir Kuryshev, and Dr. Ju¨rgen Zech for their invaluable comments and our
fruitful discussions during our regular ComInGen group meetings, and their patience in
explaining biology to me.
Chapter 7 of this work is the result of my collaboration with Professor Olgica Milenkovic
fromtheUniversityofIllinoisatUrbanaChampaign. MyvisitattheCoordinatedScience
Laboratory was an unforgettable experience from which I have benefited a lot. I would
like to thank Professor Milenkovic for this invitation, her advice, and for serving as the
co-examiner of this thesis.
I am very thankful to all my former colleagues and students for the enjoyable atmosphere
and the inspiring working environment at the LNT. I thank Pavol Hanus and Dr. Jo-
hanna Weindl for our scientific discussions and for proofreading this work. Special thanks
to Stehphan Hellerbrand, Manfred Danzer, Martin Kontny and Professor Dr.-Ing. ha-
bil. Gu¨nter S¨oder for making the work in the system administration group enjoyable. I
would also like to thank Professor Dr.–Ing. Norbert Hanik for the possibility to finish my
work at the Institute for Communications Engineering.
Finally I would like to thank my family, my friends, and especially Laura for her love and
support.
Janis DingelMunich, February 2010Contents
1 Introduction 1
2 Information theory and statistical methods 5
2.1 Mathematical notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Entropy, divergence and mutual information . . . . . . . . . . . . . 7
2.2.2 Transmission channels . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Example: insertion and deletion channels . . . . . . . . . . . . . . . 13
2.3 Coding theory and error correction . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Example: convolutional codes . . . . . . . . . . . . . . . . . . . . . 17
2.4 Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Discrete time Markov process (DTMP) . . . . . . . . . . . . . . . . 18
2.4.2 Continuous time Markov process (CTMP) . . . . . . . . . . . . . . 21
2.4.3 Properties of continuous time Markov processes . . . . . . . . . . . 24
2.4.4 Example: continuous evolution of a quarternary symbol . . . . . . . 26
2.4.5 Hidden Markov processes . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Basic principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Example: ML distance of temporally diverged sequences . . . . . . 31
2.5.3 Properties of the MLE . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Expectation maximization algorithm . . . . . . . . . . . . . . . . . . . . . 33
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36vi CONTENTS
3 Communication principles of the living cells 37
3.1 The physical layer - DNA and RNA . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Genomic material as digital signals . . . . . . . . . . . . . . . . . . 37
3.1.2 From DNA to RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Transmission and maintenance of DNA . . . . . . . . . . . . . . . . . . . . 41
3.2.1 DNA replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Error control mechanisms in the cell . . . . . . . . . . . . . . . . . 42
3.2.3 Characterization of genetic errors . . . . . . . . . . . . . . . . . . . 43
3.3 Information packets - the genes . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 What is a gene? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Protein coding genes and the genetic code . . . . . . . . . . . . . . 45
3.4 Network layer: gene interaction . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Bioinformatics 51
4.1 Analysis of two DNA sequences . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Sequence homology and scoring schemes . . . . . . . . . . . . . . . 52
4.1.2 Models of sequence evolution . . . . . . . . . . . . . . . . . . . . . 54
4.1.3 Gaps and pairwise sequence alignment . . . . . . . . . . . . . . . . 57
4.2 Analysis of multiple sequences . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Evolution model and phylogenetic trees . . . . . . . . . . . . . . . . 58
4.2.2 Felsenstein algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Multiple sequence alignment . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 Identification of highly conserved DNA sequences 65
5.1 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Identification of phylogenetic systems . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 Substitution process estimation . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Tree reconstruction and evolutionary distance . . . . . . . . . . . . 70
5.3 Extended models of evolution . . . . . . . . . . . . . . . . . . . . . . . . . 72CONTENTS vii
5.3.1 A space-time process accounts for variable rates . . . . . . . . . . . 72
5.3.2 Models of rate heterogeneity . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Detection of potentially functional elements . . . . . . . . . . . . . . . . . 74
5.4.1 Identification problem and overview of existing methods . . . . . . 76
5.4.2 KuLcons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.3 Run time of the algorithm . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.1 Small sample sizes and error distribution . . . . . . . . . . . . . . . 84
5.5.2 Sliding window estimation of a Markov gamma process . . . . . . . 85
5.5.3 In silico comparison of methods . . . . . . . . . . . . . . . . . . . . 87
5.5.4 Application to ENCODE data . . . . . . . . . . . . . . . . . . . . . 89
5.5.5 Extendingψ to model insertion and deletion events . . . . . . . . . 93
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 On the DNA code reverse engineering problem 101
6.1 Is there an error correcting code in the DNA? . . . . . . . . . . . . . . . . 101
6.1.1 DNA error correction from a coding theoretic perspective . . . . . . 102
6.1.2 Theoretical and experimental evidence . . . . . . . . . . . . . . . . 104
6.1.3 Limitations of Battail’s model . . . . . . . . . . . . . . . . . . . . . 105
6.1.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 The code reverse engineering problem . . . . . . . . . . . . . . . . . . . . . 107
6.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2.2 EM encoder tap estimation . . . . . . . . . . . . . . . . . . . . . . 109
6.2.3 Transformation into the log-likelihood domain . . . . . . . . . . . . 111
6.2.4 Derivation of the gradient . . . . . . . . . . . . . . . . . . . . . . . 113
6.2.5 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2.6 Inference via global optimization . . . . . . . . . . . . . . . . . . . 116
6.2.7 Extension to quarternary alphabet . . . . . . . . . . . . . . . . . . 117viii CONTENTS
6.2.8 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Application to DNA sequence data . . . . . . . . . . . . . . . . . . . . . . 119

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents