La lecture à portée de main
Découvre YouScribe en t'inscrivant gratuitement
Je m'inscrisDécouvre YouScribe en t'inscrivant gratuitement
Je m'inscrisDescription
Sujets
Informations
Publié par | technische_universitat_munchen |
Publié le | 01 janvier 2010 |
Nombre de lectures | 24 |
Langue | English |
Poids de l'ouvrage | 6 Mo |
Extrait
TECHNISCHEUNIVERSITA¨TMU¨NCHEN
Lehrstuhlfu¨rNachrichtentechnik
SelectedCommunicationsTheoreticAspects
inGenetics
PavolHanus
Vollsta¨ndigerAbdruckdervonderFakulta¨tfu¨rElektrotechnikundInformationstechnik
derTechnischenUniversita¨tMu¨nchenzurErlangungdesakademischenGradeseines
genehmigtenDissertation.
Doktor–Ingenieurs
Vorsitzender:Univ.–Prof.Dr.–Ing.Jo¨rgEberspa¨cher
Pru¨fer:1.Univ.–Prof.Dr.–Ing.Dr.–Ing.E.h.JoachimHagenauer
2.Univ.–Prof.Dr.–Ing.MartinBossert,Universita¨tUlm
DieDissertationwurdeam21.04.2010beiderTechnischenUniversita¨tMu¨ncheneinge-
reichtunddurchdieFakulta¨tfu¨rElektrotechnikundInformationstechnikam13.07.2010
angenommen.
-
oT
ym
parents
-
iv
Contents
3.3.1DNADamageRepairandErrorCorrection..............22
3.3.2DNAMismatchRepairandErrorCorrection.............22
3.4MolecularBiology-GeneticInformationProcessing.............23
3.4.1Transcription..............................24
3.4.2Splicing.................................25
3.4.3Translation...............................26
3.5GeneticInformationProcessingandEngineering...............27
3.5.1TranscriptionandErrorCorrection..................27
3.5.2TranslationandErrorCoding.....................27
3.5.3GeneExpressionandMarkerSynchronization............28
3.6EvolutionaryGenetics-GeneticInformationTransmission.........28
3.6.1GenomeEvolution...........................29
3.6.2TypesofMutations...........................29
3.7GeneticInformationTransmissionandEngineering.............32
3.7.1SubstitutionChannel-ContinuousTimeMarkovProcess......32
3.7.2NucleotideEvolutionisSIMOTransmission.............36
3.7.3EstimationontheTree-FelsensteinAlgorithm...........37
3.7.4SubstitutionSpaceTimeModel....................38
3.7.5SummaryoftheNucleotideEvolutionModel.............39
3.8Summary....................................39
ISourceCodingAspectsinGenetics
14
4CompressioninEngineering43
4.1EntropyCoding.................................44
4.1.1HumanCoding............................45
4.1.2ArithmeticCoding...........................46
4.2UniversalStatisticalPredictionBasedCoding................49
4.2.1AdaptiveArithmeticCoding......................49
Contents
vii
4.2.2ContextTreeWeighting........................51
4.2.3DecompositionContextTreeWeighting................55
4.3UniversalDictionaryBasedCoding......................56
4.3.1SlidingWindowLempelZiv-LZ77..................56
4.3.2TreeStructuredLempelZiv-LZ78..................57
4.4Summary....................................58
5CompressioninGenetics59
5.1DNASequenceCompression..........................60
5.1.1DNACompressionAlgorithms.....................60
5.1.2PerformanceComparisonofDNACompressors...........62
5.2MultipleGenomeAlignment..........................63
5.2.1MotivationforMultipleGenomeAlignments.............64
5.2.2TheMultipleGenomeAlignmentProblem..............64
5.2.3PairwiseAlignment-SmithWatermanAlgorithm..........65
5.2.4MultipleSequenceAlignment(MSA).................67
5.2.5WholeGenomeAlignmentDatasets..................68
5.3MultipleSequenceAlignmentCompression..................68
5.3.1CompressionofNucleotides......................69
5.3.2CompressionofGaps..........................76
5.3.3MultipleSequenceAlignmentCompressor(MSAc)..........78
5.3.4PerformanceandComparison.....................79
5.4MultipleAlignmentFormatFilesCompression................81
5.4.1MultipleAlignmentFormat(MAF)..................81
5.4.2CompressionofMAFSupplementaryData..............81
5.4.3PerformanceandComparison.....................84
5.5Summary....................................87
iiiv
IIMarkerSynchronizationinGenetics
Contents
98
6MarkerSynchronizationinEngineering91
6.1SynchronizationModel.............................91
6.1.1ThresholdSynchronizer.........................92
6.1.2MaximizationSynchronizer......................92
6.2LikelihoodFunctionsforMarkerSynchronization..............93
6.2.1OptimalGeneralLogLikelihoodRatioFunction...........93
6.2.2LikelihoodFunctionforAWGNChannel...............96
6.2.3LikelihoodFunctionsforDiscreteMemorylessChannels.......99
6.2.4LikelihoodFunctionsforSubstitutionChannelModels.......99
6.3MarkerPerformanceandSyncwordChoice..................100
6.3.1ThresholdBasedSynchronization...................101
6.3.2MaximizationBasedSynchronization.................110
6.4Summary....................................112
7MarkerSynchronizationinGenetics115
7.1SequenceSpecicBinding...........................116
7.1.1ProteinDNABinding.........................116
7.1.2BindingMotifs.............................116
7.1.3PositionwiseIndependenceofBindingSites..............120
7.2BindingSiteInferenceandSynchronization..................121
7.2.1LikelihoodFunctionforBindingSiteInference............121
7.2.2ComparisontotheSynchronizationLikelihoodFunction......123
7.2.3VicinityExtendedBindingSiteInference...............124
7.2.4LimitationsofBindingSiteInference.................126
7.3MolecularSynchronization...........................127
7.3.1BindingSiteDetection.........................127
7.3.2ParallelstoThresholdSynchronization................128
Contents
xi
7.3.3GeneralPropertiesofMolecularMarkers...............128
7.3.4SynchronizationPerformanceofTranscriptionMarkers.......130
7.4Summary....................................133
8Conclusion
APublications
153
931
BDerivations143
B.1MatrixExponential...............................143
B.2SynchronizationSuccessProbability......................143
Nomenclature
Bibliography
541
151
Zusammenfassung
EswerdenneueAnwendungenderKommunikations-undInformationstheorieaufProb-
lemeindermolekularenBiologiebehandelt.ImerstenTeilwirdQuellencodierunggenutzt
umdieSpeicheranforderungenvongenomweitenSequenzalignmentDatensa¨tzenzuver-
ringern.EinhochezienterKompressionsalgorithmusbasierendaufstatistischenMod-
ellenderEvolutionundTechnikenausderbina¨renBildcodierungwirdvorgeschlagen.
ImzweitenTeilwerdenParallelenzwischenderMarkerSynchronisationu¨berverrauschte
Kana¨leundderProtein-DNABindungsstellensuchestudiert.StatistischeBindungsstellen
ModelleundInferenztechnikenwerdenausinformationstheoretischerSichtanalysiertund
erweitert.Synchronisationseigenschaftenvonausgewa¨hltenmolekularenMarkernwerden
evaluiertundEvidenzfu¨rSelektionsdruckzugunstenezienterMarkergefunden.
Abstract
Thisthesiscoversnovelapplicationsofconceptsfromcommunicationsengineeringtoprob-
lemsinmolecularbiology.Intherstpartthefocusisplacedonapplyingsourcecoding
techniquestoreducethestoragerequirementofmultiplegenomealignmentdatasetsused
incomparativegenomics.Ahighlyecientlosslesscompressionalgorithmusingwell
establishedmodelsofgenomeevolutionandbinaryimagecompressiontechniquesisin-
troduced.Thesecondpartstudiesparallelsbetweensequencespecicproteinbindingon
themolecularlevelandmarkersynchronization.Theengineeringconceptofthreshold
basedmarkersynchronizationovernoisychannelsisrevisedandextended.Bindingsite
modelsandinsilicoinferencetechniquesarereviewedusinginformationtheory.Synchro-
nizationpropertiesofselectedmolecularmarkersareanalysedandevidencefornatural
selectionpressuretowardsgoodmarkersisfound.
2
Chapter1Introduction
arethatforalongtimeitwasbelievedthatproteincodinggenesaremoreorlessthe
onlyfunctionalregionsinthegenomeandthefactthatonlyaverylimitedamountof
sequencedatawasavailable.Recentadvancesinhigh-throughputsequencingtechnology
havemadeitpossibletosequencewholegenomesofcomplexspecies.Thecompletion
oftherstdraftofthehumangenomein2001[LLB+01]hasrevealedthatonlyseveral
percentofitisproteincodingandthattherearelessproteincodinggenesthanexpected.
Thesequencingofothervertebrategenomescouldbecompletedsoonafterandcompar-
ativegenomicsstudieshaveuncoveredthatthereexistmanyhighlyconservedputatively
functionalregionsunderstrongevolutionarypressurenotcodingforproteins[BPM+04].
Theirfunctionisstilllargelyunknown,butithasbeensuggestedthattheymightplay
aroleingeneregulation[Mat07],whichgovernswhenandinwhatamountsparticular
genesgetexpressedintoproteins.
Therecentndingshaveraisedmanynewopenquestionsaboutthegeneticinformation
storedinthegenomes,itsfunctionality,storageandprocessing.Togetherwiththeex-
ponentialgrowthofsequencedataavailableinpublicdatabases,thishasreignitedthe
interestofinformationtheoristsandcommunicationsengineersformolecularbiology.To
nameafewcontributions:G.Battailhaspostulatedthehypothesisthaterrorcorrecting
codesareusedontheDNAsequencelevel[Bat97].J.Hagenaueretal.appliedinforma-
tiontheorytogenemappingofcomplexdiseases[HDG+04,SGD+07]andtoclassica-
tionofgeneticsequencedata[DHHM05]aswellastocomparativegenomics[DHL+08].
O.Milenkovicetal.introducedmethodsfromchannelcodingtogeneregulation[MV04].
W.Szpankowskietal.usedsourcecodingtechniquestoidentifyproteincodingre-
gions[SRS05].Theinitialresearchresultsconrmthatmodellingandanalysingtheway
naturedealswithgeneticinformationfromacommunicationsengineeringperspective
canleadtoitsbetterunderstanding.Thenewinsightsgeneratedbythisinterdisciplinary
researchmightevenhavethepotentialtohelpadvancefuturecommunicationssystems.
Furthermore,methodsfrom