Selected communications theoretic aspects in genetics [Elektronische Ressource] / Pavol Hanus
174 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Selected communications theoretic aspects in genetics [Elektronische Ressource] / Pavol Hanus

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
174 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

¨ ¨TECHNISCHE UNIVERSITAT MUNCHENLehrstuhl fu¨r NachrichtentechnikSelected Communications Theoretic Aspectsin GeneticsPavol HanusVollst¨andiger Abdruck der von der Fakult¨at fu¨r Elektrotechnik und Informationstechnikder Technischen Universit¨at Mu¨nchen zur Erlangung des akademischen Grades einesDoktor–Ingenieursgenehmigten Dissertation.Vorsitzender: Univ.–Prof. Dr.–Ing. Jo¨rg Ebersp¨acherPru¨fer: 1. Univ.–Prof. Dr.–Ing. Dr.–Ing. E.h. Joachim Hagenauer2. Univ.–Prof. Dr.–Ing. Martin Bossert, Universit¨at UlmDie Dissertation wurde am 21.04.2010 bei der Technischen Universit¨at Mu¨nchen einge-reicht und durch die Fakult¨at fu¨r Elektrotechnik und Informationstechnik am 13.07.2010angenommen.PrefaceThis thesis was written during my time as a research assistant at the Institute for Com-munications Engineering (LNT) at the Technische Universit¨at Mu¨nchen (TUM). Manypeople, whom I am very grateful to, have contributed in different ways to the success ofthis work.Inthe firstplace, Iwould like tothankmy supervisor Prof.Dr.-Ing.Dr.-Ing.E.h.JoachimHagenauer forhisguidance andsupport. Hegave me the opportunitytoconduct researchin the novel interdisciplinary field of information and communication theory in molecularbiology. The work at his institute was a very fruitful and enriching experience. Theinterdisciplinary and open nature of my research gave me the opportunity to be creativeand learn a lot.

Sujets

Informations

Publié par
Publié le 01 janvier 2010
Nombre de lectures 24
Langue English
Poids de l'ouvrage 6 Mo

Extrait

TECHNISCHEUNIVERSITA¨TMU¨NCHEN
Lehrstuhlfu¨rNachrichtentechnik

SelectedCommunicationsTheoreticAspects
inGenetics

PavolHanus

Vollsta¨ndigerAbdruckdervonderFakulta¨tfu¨rElektrotechnikundInformationstechnik
derTechnischenUniversita¨tMu¨nchenzurErlangungdesakademischenGradeseines

genehmigtenDissertation.

Doktor–Ingenieurs

Vorsitzender:Univ.–Prof.Dr.–Ing.Jo¨rgEberspa¨cher
Pru¨fer:1.Univ.–Prof.Dr.–Ing.Dr.–Ing.E.h.JoachimHagenauer
2.Univ.–Prof.Dr.–Ing.MartinBossert,Universita¨tUlm

DieDissertationwurdeam21.04.2010beiderTechnischenUniversita¨tMu¨ncheneinge-
reichtunddurchdieFakulta¨tfu¨rElektrotechnikundInformationstechnikam13.07.2010
angenommen.

-

oT

ym

parents

-

iv

Contents

3.3.1DNADamageRepairandErrorCorrection..............22
3.3.2DNAMismatchRepairandErrorCorrection.............22
3.4MolecularBiology-GeneticInformationProcessing.............23
3.4.1Transcription..............................24
3.4.2Splicing.................................25
3.4.3Translation...............................26
3.5GeneticInformationProcessingandEngineering...............27
3.5.1TranscriptionandErrorCorrection..................27
3.5.2TranslationandErrorCoding.....................27
3.5.3GeneExpressionandMarkerSynchronization............28
3.6EvolutionaryGenetics-GeneticInformationTransmission.........28
3.6.1GenomeEvolution...........................29
3.6.2TypesofMutations...........................29
3.7GeneticInformationTransmissionandEngineering.............32
3.7.1SubstitutionChannel-ContinuousTimeMarkovProcess......32
3.7.2NucleotideEvolutionisSIMOTransmission.............36
3.7.3EstimationontheTree-FelsensteinAlgorithm...........37
3.7.4SubstitutionSpaceTimeModel....................38
3.7.5SummaryoftheNucleotideEvolutionModel.............39
3.8Summary....................................39

ISourceCodingAspectsinGenetics

14

4CompressioninEngineering43
4.1EntropyCoding.................................44
4.1.1HumanCoding............................45
4.1.2ArithmeticCoding...........................46
4.2UniversalStatisticalPredictionBasedCoding................49
4.2.1AdaptiveArithmeticCoding......................49

Contents

vii

4.2.2ContextTreeWeighting........................51
4.2.3DecompositionContextTreeWeighting................55
4.3UniversalDictionaryBasedCoding......................56
4.3.1SlidingWindowLempelZiv-LZ77..................56
4.3.2TreeStructuredLempelZiv-LZ78..................57
4.4Summary....................................58

5CompressioninGenetics59
5.1DNASequenceCompression..........................60
5.1.1DNACompressionAlgorithms.....................60
5.1.2PerformanceComparisonofDNACompressors...........62
5.2MultipleGenomeAlignment..........................63
5.2.1MotivationforMultipleGenomeAlignments.............64
5.2.2TheMultipleGenomeAlignmentProblem..............64
5.2.3PairwiseAlignment-SmithWatermanAlgorithm..........65
5.2.4MultipleSequenceAlignment(MSA).................67
5.2.5WholeGenomeAlignmentDatasets..................68
5.3MultipleSequenceAlignmentCompression..................68
5.3.1CompressionofNucleotides......................69
5.3.2CompressionofGaps..........................76
5.3.3MultipleSequenceAlignmentCompressor(MSAc)..........78
5.3.4PerformanceandComparison.....................79
5.4MultipleAlignmentFormatFilesCompression................81
5.4.1MultipleAlignmentFormat(MAF)..................81
5.4.2CompressionofMAFSupplementaryData..............81
5.4.3PerformanceandComparison.....................84
5.5Summary....................................87

iiiv

IIMarkerSynchronizationinGenetics

Contents

98

6MarkerSynchronizationinEngineering91
6.1SynchronizationModel.............................91
6.1.1ThresholdSynchronizer.........................92
6.1.2MaximizationSynchronizer......................92
6.2LikelihoodFunctionsforMarkerSynchronization..............93
6.2.1OptimalGeneralLogLikelihoodRatioFunction...........93
6.2.2LikelihoodFunctionforAWGNChannel...............96
6.2.3LikelihoodFunctionsforDiscreteMemorylessChannels.......99
6.2.4LikelihoodFunctionsforSubstitutionChannelModels.......99
6.3MarkerPerformanceandSyncwordChoice..................100
6.3.1ThresholdBasedSynchronization...................101
6.3.2MaximizationBasedSynchronization.................110
6.4Summary....................................112

7MarkerSynchronizationinGenetics115
7.1SequenceSpecicBinding...........................116
7.1.1ProteinDNABinding.........................116
7.1.2BindingMotifs.............................116
7.1.3PositionwiseIndependenceofBindingSites..............120
7.2BindingSiteInferenceandSynchronization..................121
7.2.1LikelihoodFunctionforBindingSiteInference............121
7.2.2ComparisontotheSynchronizationLikelihoodFunction......123
7.2.3VicinityExtendedBindingSiteInference...............124
7.2.4LimitationsofBindingSiteInference.................126
7.3MolecularSynchronization...........................127
7.3.1BindingSiteDetection.........................127
7.3.2ParallelstoThresholdSynchronization................128

Contents

xi

7.3.3GeneralPropertiesofMolecularMarkers...............128
7.3.4SynchronizationPerformanceofTranscriptionMarkers.......130
7.4Summary....................................133

8Conclusion

APublications

153

931

BDerivations143
B.1MatrixExponential...............................143
B.2SynchronizationSuccessProbability......................143

Nomenclature

Bibliography

541

151

Zusammenfassung

EswerdenneueAnwendungenderKommunikations-undInformationstheorieaufProb-
lemeindermolekularenBiologiebehandelt.ImerstenTeilwirdQuellencodierunggenutzt
umdieSpeicheranforderungenvongenomweitenSequenzalignmentDatensa¨tzenzuver-
ringern.EinhochezienterKompressionsalgorithmusbasierendaufstatistischenMod-
ellenderEvolutionundTechnikenausderbina¨renBildcodierungwirdvorgeschlagen.
ImzweitenTeilwerdenParallelenzwischenderMarkerSynchronisationu¨berverrauschte
Kana¨leundderProtein-DNABindungsstellensuchestudiert.StatistischeBindungsstellen
ModelleundInferenztechnikenwerdenausinformationstheoretischerSichtanalysiertund
erweitert.Synchronisationseigenschaftenvonausgewa¨hltenmolekularenMarkernwerden
evaluiertundEvidenzfu¨rSelektionsdruckzugunstenezienterMarkergefunden.

Abstract

Thisthesiscoversnovelapplicationsofconceptsfromcommunicationsengineeringtoprob-
lemsinmolecularbiology.Intherstpartthefocusisplacedonapplyingsourcecoding
techniquestoreducethestoragerequirementofmultiplegenomealignmentdatasetsused
incomparativegenomics.Ahighlyecientlosslesscompressionalgorithmusingwell
establishedmodelsofgenomeevolutionandbinaryimagecompressiontechniquesisin-
troduced.Thesecondpartstudiesparallelsbetweensequencespecicproteinbindingon
themolecularlevelandmarkersynchronization.Theengineeringconceptofthreshold
basedmarkersynchronizationovernoisychannelsisrevisedandextended.Bindingsite
modelsandinsilicoinferencetechniquesarereviewedusinginformationtheory.Synchro-
nizationpropertiesofselectedmolecularmarkersareanalysedandevidencefornatural
selectionpressuretowardsgoodmarkersisfound.

2

Chapter1Introduction

arethatforalongtimeitwasbelievedthatproteincodinggenesaremoreorlessthe
onlyfunctionalregionsinthegenomeandthefactthatonlyaverylimitedamountof
sequencedatawasavailable.Recentadvancesinhigh-throughputsequencingtechnology
havemadeitpossibletosequencewholegenomesofcomplexspecies.Thecompletion
oftherstdraftofthehumangenomein2001[LLB+01]hasrevealedthatonlyseveral
percentofitisproteincodingandthattherearelessproteincodinggenesthanexpected.
Thesequencingofothervertebrategenomescouldbecompletedsoonafterandcompar-
ativegenomicsstudieshaveuncoveredthatthereexistmanyhighlyconservedputatively
functionalregionsunderstrongevolutionarypressurenotcodingforproteins[BPM+04].
Theirfunctionisstilllargelyunknown,butithasbeensuggestedthattheymightplay
aroleingeneregulation[Mat07],whichgovernswhenandinwhatamountsparticular
genesgetexpressedintoproteins.
Therecentndingshaveraisedmanynewopenquestionsaboutthegeneticinformation
storedinthegenomes,itsfunctionality,storageandprocessing.Togetherwiththeex-
ponentialgrowthofsequencedataavailableinpublicdatabases,thishasreignitedthe
interestofinformationtheoristsandcommunicationsengineersformolecularbiology.To
nameafewcontributions:G.Battailhaspostulatedthehypothesisthaterrorcorrecting
codesareusedontheDNAsequencelevel[Bat97].J.Hagenaueretal.appliedinforma-
tiontheorytogenemappingofcomplexdiseases[HDG+04,SGD+07]andtoclassica-
tionofgeneticsequencedata[DHHM05]aswellastocomparativegenomics[DHL+08].
O.Milenkovicetal.introducedmethodsfromchannelcodingtogeneregulation[MV04].
W.Szpankowskietal.usedsourcecodingtechniquestoidentifyproteincodingre-
gions[SRS05].Theinitialresearchresultsconrmthatmodellingandanalysingtheway
naturedealswithgeneticinformationfromacommunicationsengineeringperspective
canleadtoitsbetterunderstanding.Thenewinsightsgeneratedbythisinterdisciplinary
researchmightevenhavethepotentialtohelpadvancefuturecommunicationssystems.
Furthermore,methodsfrom

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents