Bioinformatic approaches for genome finishing [Elektronische Ressource] / Peter Husemann. Technische Fakultät - AG Genominformatik

Publié par

Bioinformatic Approaches forGenome FinishingPh. D. Thesissubmitted to theFaculty of Technology,Bielefeld University, Germanyfor the degree of Dr. rer. nat.byPeter HusemannJuly, 2011Referees:Prof. Dr. Jens StoyePD Dr. Andreas TauchGedruckt auf alterungsbeständigem Papier nach DIN-ISO 9706.Printed on non-aging paper according to DIN-ISO 9706.To my brother.Contents1 Introduction 12 Sequencing Technologies – Biological Background 52.1 DNA – The Backbone of Life on Earth . . . . . . . . . . . . . . . . . . . 52.2 Technologies to Assess the Sequence of DNA . . . . . . . . . . . . . . . 82.2.1 Traditional: Sanger Sequencing . . . . . . . . . . . . . . . . . . . 82.2.2 The ‘Next’ Generation: Massively Parallel Sequencing . . . . . 92.2.3 Third Single Molecule Sequencing . . . . . . . . . 132.3 Genome Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Shotgun Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Assembly Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Genome Finishing . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Efficient Matching of Contigs 233.1 Finding Local Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.2 Seed and Extend Heuristics . . . . . . . . . . . . . . . . . . . . . 273.1.3 Search Space Filtering . . . . . . . . . . . . . . . . . . . . . . . . 283.
Publié le : samedi 1 janvier 2011
Lecture(s) : 27
Tags :
Source : D-NB.INFO/101537087X/34
Nombre de pages : 115
Voir plus Voir moins

maticorBioinf

Approaches

FinishingGenome

DPh.Thesis.

thetosubmitted

FacultyofTechnology,

BielefeldUniversity,Germany

forthedegreeofDr.rer.nat.

yb

PHusemanneter

erees:Ref

2011,ulyJ

Prof.Dr.JensStoye

PDDr.AndreasTauch

orf

Gedruckt

Printed

auf

on

alterungsbeständigem

non-aging

paper

Papier

dingaccor

to

nach

DIN-ISO

DIN-ISO

9706.

9706.

oT

my

br.other

Contents

1

2

3

oductionIntr

SequencingTechnologies–BiologicalBackground
2.1DNA–TheBackboneofLifeonEarth...................
2.2TechnologiestoAssesstheSequenceofDNA...............
2.2.1Traditional:SangerSequencing...................
2.2.2TheNextGeneration:MassivelyParallelSequencing.....
2.2.3ThirdGeneration:SingleMoleculeSequencing.........
2.3GenomeSequencing.............................
2.3.1ShotgunSequencing.........................
2.3.2AssemblyPhase...........................
2.3.3GenomeFinishing..........................

ContigsofhingMatcEfficient3.1FindingLocalSimilarities..........................
3.1.1Smith-Waterman...........................
3.1.2SeedandExtendHeuristics.....................
3.1.3SearchSpaceFiltering........................
3.2MatchingbyFilteringwithq-Grams....................
3.2.1GeneralIdea..............................
3.2.2BuildinganIndexoftheReferenceGenome...........
3.2.3FilteringforSimilarities.......................
3.3r2cat–TheRelatedReferenceContigArrangementTool........
3.3.1Matching................................
3.3.2Visualization.............................
3.3.3SimpleContigMapping.......................

1

558891315151718

23242527282929303133343739

i

4

5

6

7

AdvancedContigLayoutingusingMultipleReferenceGenomes43
4.1TheContigAdjacencyGraph........................44
4.1.1Notation................................44
4.1.2WeightingtheAdjacencyEdges..................47
4.1.3CreatingaBasicContigAdjacencyGraph............48
4.2FindingaLayoutoftheContigs.......................50
4.2.1TravelingSalesmanTourThroughtheGraph..........50
4.2.2FastAdjacencyDiscoveryAlgorithm...............52
4.3EnhancementsoftheGraphCreation...................53
4.3.1IncludingPhylogeneticDistances.................53
4.3.2IntegratingAdditionalInformation................54
4.4VariationsoftheContigLayouting.....................54
4.4.1HandlingRearrangements.....................55
4.4.2Repeat-awareLayouting.......................55
63SoftwaretheofRealization5.1ImplementationalMilestones........................63
5.1.1r2cat..................................63
5.1.2treecat..................................64
5.1.3repcat..................................65
5.1.4htscat..................................66
5.2ExternalSoftwareandLibraries.......................68
5.2.1FreeHEPGraphicsExport......................68
5.2.2Graphviz................................69
5.2.3NetBeansPlatform..........................69
5.2.4PrefuseGraphVisualization....................69
LayoutingCorynebacteriaContigs71
6.1BackgroundandPreparatorySteps.....................71
6.1.1DescriptionoftheDatasets.....................71
6.1.2DeterminingaReferenceLayout..................74
6.1.3ParameterEstimationfortheContigAdjacencyGraph.....74
6.1.4OtherSoftwareforContigLayouting...............77
6.2EvaluationonRealSequencingData....................79
6.2.1SingleReferenceBasedOrdering..................80
6.2.2MultipleReferenceBasedLayouting................83
6.2.3LayoutingRepetitiveContigs....................90
95OutlookandySummar99wledgmentsknoAc101yliographBib

1Chapter

oductionIntr

AdvMolecularancesinbiologybiologyandprovidebioinformormaticseandareotherentangledtypesbyofadatafortunatewhichsyneragaingeticdrivefefect:the
developmentofcomputeraidedanalyses.Advancedanalyses,inturn,helptoin-
terpretandunderstandthedataandthusleadtomoresophisticatedtheories,and
evgenomeentuallyanalysisresult,forinnewexample,prexperimentsogressedprotrvidingemendouslyfurtherthrdata.oughThethisrefesearfect.chfieldof
Genomicsequencesareconsideredhavingahugepotentialtogaindeeperinsight
intoAccorthedinglylife,ofortechniquesganisms,forandthetoshedsequencinglightofonthegenomes–inheritancethatisofthetheirdeterprminationoperties.
oftheirnucleotidesequence–haveevolvedduringthelastdecades.Especially
therecenthighthroughputtechniquesincreasedtheavailabilityandenhancedthe
.considerablysequencingofsimplicityAlthoughthepopularpressoccasionallyconfusesthedeterminationofacomplete
genomesequencewithadecipheringofthecorrespondinggenome,thereisabig
difther,fertheence.plainGenomicsequencedataofarenucleotidescomplexisandnottheiranalysisall-encompassing,isoftenasthechallenging.findingFurof-
epigeneticinformation,suchasmethylationpatterns,demonstrates.
onlyYet,theknovisiblewingthecharacteristicsgenomicofasequencespecies.allowsInthemoreinvdetailedestigationofanalysesanewlythanobsersequencedving
genome,thelocationsofputativegenescanbesearched,oralreadyknowngenes
canbeannotated.Approachesexistwhichtrytocomputationallyfoldtheproteins
thatthesegenesencodeinordertopredicttheirfunctions.
Besidesgenes,otherfunctionalsequencesexistinagenomewhicharesometimes
rdiscovecognizableeryprbyogramsspecialcandetect,nucleotideforpatterexample,ns,sotranscriptioncalledmotifs.factorSbindingophisticatedsiteswhichmotif
playaroleintheregulationofgeneexpression.
Thesequencedsubjectofgenomes.comparativeAsearlyasgenomicsthepr18thofitsfrcenturomy,theCarlincrLinnæuseasingtriednumbertooforalrdereadyand

1

1.ChapterIntroduction

categorizelife.However,theavailabilityofgenomicsequencesincurrenttimes
undoubtedlybooststheaccuracyatwhichatreeoflifeaccordingtoDarwins
theoryofevolutioncanbe(re)constructed.Opposedtoadistinctionofspeciesbased
onvisiblecharacteristics,nucleotidesequencesprovideamuchfinergranularity.
Thisevenresultedintaxonomicreclassificationsofsomespeciesaftersequencing
theirgenomerevealedthattheyactuallybelongtoanothergenus.
Comparingthegenecontentofdifferentspeciescanalsoprovidehintsontheir
evolutionaryrelationship.Tothisend,similaritiesofgenomicregionsaredeter-
minedbyaligningthecorrespondingsequences.Geneswithasimilarsequence
thatoccurindistinctgenomesarecommonlypresumedtobehomologouswhich
meansthattheywereobtainedfromacommonancestor.
Ifhomologousgenesofasetofspeciesaregiven,then,onamoreabstractlevel,
clustersofgenescanbesearchedinwhichthegenesoftenappearinnearproximity
acrossspecies.Thesegeneclustersmightindicateacertainevolutionarypressure
tostaytogether,andpossiblyevenimplyafunctionalrelation.
Duetoevolutionaryprocesses,sometimesgenomesbecomerearranged.Thatis,
blocksofDNAaredisplaced,orreversed,foreignpiecesareincluded,orfragments
ofthegenomegetlostorareduplicated.Theserearrangementscanbestudiedat
thelevelofhomologoussequencestoanswerthequestionwhichoperationsmay
havecausedarearrangementstructurethatisobservedbetweentwospecies.
Rearrangementscanalsobeinvestigatedatnucleotidelevel.Inbacteria,forin-
stance,aninversioncanpossiblydisableagene,renderinganoriginallypathogenic
germcompletelyharmless.Viceversa,atoxinproducinggenecanbeintroduced
species.innocuoususuallyintoComparingthegenomesofindividualsbelongingtothesamespeciescanhelpto
gaindeeperunderstandingaswell.Forexample,inhumansminorchangesinthe
genome,thesocalledsinglenucleotidepolymorphisms(SNPs),aresuspectedtobe
involvedincancerdevelopment.
Inbiotechnology,bacteriaaregrowntoproducedesiredcompoundsorenzymes,
andusuallytheproductionstrainshaveundergoneaprocessofartificialselection
toincreasetheirefficiency.Thecomparisonofsuchaproductionstrainwiththe
wildtypeitoriginatedfrommayhelptocomprehendwhichgenomicchangesled
toanenhancedefficiency.
Firstattemptsweresuccessfultocreateasyntheticminimalgenomethatcon-
tainsonlythegenesnecessarytoproliferate.Thisopensthedoortofabricated
speciescontainingseveralgenesofdifferentspecies,almostlikebuildingblocks,to
generatecertainwantedphenotypesorproperties.
Alltheseexamplesshowanimpressivedevelopment.Nevertheless,theanalysis
ofgenomeshasnottappeditsfullpotential.Comingbacktotheinitiallymen-
tionedsynergeticeffect,maybesomeofcomputeraidedanalyseswillhelptofind
newaspectsworthinvestigatingwhichthenopennewfieldsofresearch.Genomic

2

sequencesaresurelyakeycomponentinthisprocess,andhavingmoreofthem
availablealsodrivesthedevelopmentofmethodstoanalyzethem.
Sequencingitselfisbecomingaroutinetask.Yet,allcurrentsequencingap-
Tprodate,oachesonlystillahafevwehundrlimitations,edbasesandcanthebermosteadstrikingconsecutivoneelyisfrtheomsmallgenomesreadthatlength:are
typicallyseveralordersofmagnitudelarger.
mentsImprovleadementstolongerinrchemistreads,yan,easiermethodologyhandling,andofthemachineryexperiments,intheandlatestadevmassivelop-e
ithasparallelizationnotchangedofthethatinsequencinganassemblyandtherphaseeforemanyalsootovhighererlappingthrreadsoughput.ofaThough,sequenc-
ingTherungoalhaveoftothisbestitchedassemblytogetherphase.istoobtainasingleandcompletenucleotide
species.sequenceoftheUnfortunatelygenome,–veroryofofteneachthischrgoalomosomeisinimpossiblethecasetoofreach,multi-chroratleastomosomalits
calledoutcomecontigsis,unrareeliable.reconstructed.Instead,severalMissingorcontiguousunreliablestretchesreadofinfornucleotides,mationleadsthesoto
contigs.theseparatethatgapstheInthecontigsprtoocesstheofdesirfinishing,edcompletethesegapssequence.arerItisesequencedhelpfulintotheknolabwinorwhichdertoofjointhe
Insteadcontigsarofetryingadjacentalltopossiblespecificallyadjacencies,picktheDNAsometimesinbetwtheeeninitialforfurthersequencingalrsequencing.eady
providesAnotherhintsasestablishedtowhichapprofoachtheistocontigsusearesequencesadjacent.ofothergenomesasaguiding
rthateferareence.assumedThesertoeferhavenceeahighgenomesdegreeoriginateofinsynteny.generalByfrmatchingomcloselythercontigselatedontospeciesthe
referenceandanalyzingthematches,adjacenciesofthecontigscanbeestimated.
nomesThewtoorkprestimateesentedinadjacenciesthisofthesisiscontigs.basedWeonprtheesentconceptalgorithmsofusingandreferapprenceoachesge-
anevapplicablealuationforwithonerorealseveralsequencingreferencedata.Somegenomespartsandofthiscomparthesiseourhaveperforalreadymancebeenin
publishedinadvance[41,42,43].

grOveroundvieofwofthiswthisorkbThesisydescribingThefollotherwingoleofChaptergenomic2prosequencesvidestheandthebiologicalprocessesback-
todures,acquirasewellthem.asaThisdescriptionincludesoftheinfortypicalmationwork-oaboutwrtoecentsequencehighthrwholeoughputgenomes.proce-
andprChapteroposes3givaesmethodanintrwhichoductionweusehowfortomatchingcomputecontigssimilaritiesontorbetwefereenencesequencesgenomes.
plotsAdditionallyandto,thearrangeaapplicationsetofr2catcontigsispraccoresenteddingtothatmatchescanbeontousedartoefercrenceeatesyntenygenome.

3

1.ChapterIntroduction

ThesubsequentChapter4containsthemainmethodicalcontributionofthisthe-
forsis.Itmationdefinesabouttheneighboringconceptofacontigscontigfromradjacencyelatedrgrapheferencewhichwegenomes.usetoWecollectgivethein-
algorithmstocomputethisgraphandaheuristictoextractthemostpromisingad-
jacenciesofthecontigs.Additionally,wediscussvariationstoincludeadditional
informationandprovideenhancementstodealwiththespecialcaseofrepetitive
contigsthatmapnon-uniquelytothereferencegenomes.
andInwhichChapter5featur,wesearedescribecurrhoentlywathevailable.softwareFurther,implementingwebrieyourintrideasoducehaseextervolvnaled
softwarethatweemploy.
Afterthis,weshowanevaluationofoursoftwareinChapter6.Tothisend,we
compareourimplementationstootherrelatedapproachesusingcontigsfromreal
data.sequencingFinally,Chapter7concludesandgivesanoverviewofpossibleimprovementsand
methods.ourofapplicationsunintended

4

2Chapter

BackgrSequencingoundTechnologies–Biological

Inthischapter,wewanttoprovidethebiologicalbackgroundforthecontentsof
thelifersciences,emainingandthesis.goonWewithstartprwithoceduranesintrtooductionaccessthetoDNAinforandmationitsrolecontainedforthein
plainDNAinthismolecules,contextnamelyhowtheirgenomenucleotidesequencingprsequence.ojectswFinallyork,,inandSwhyectionthe2.3y,weusuallyex-
matics.bioinforeolvinvSomebasicbiologicalinformationpresentedinthischapterhavebeenextracted
fromsequencingatextbooktechnologiesonmolecularwereobtainedgeneticsfr[52om],arfurtherecentlyinforpublishedmationPh.aboutD.thethesisne[w78er].

2.1DNA–TheBackboneofLifeonEarth

ingTheorganismsdeoxyribosecontainnucleicthracidead-like(DNA)DNAisakeymoleculescomponentwhichforencodelifeonmost,earth.ifAllnotliv-all,
inheritableinformation.WhilefirstexperimentsandfindingswithrespecttoDNA
canpublishedbedatedinback1944tobytheOsw18thaldAvcentureryy,andtherolecolleaguesasa[8material].Thisofinheritancepublicationwassparkedfirst
theinterestofErwinChargaffandhebegantoinvestigatethispromisingmolecule.
Headeninedisco(vAer)edandthatthyminetheDNA(T)ofasallwellorasganismsthesamefeatureamountsthesameofcytosineamounts(Cof)theandbasesgua-
nine(G)andpostulatedthatthesebasesoccurpairwise[21].Thechemicalstructure
ofthebasesisshownintheupperpartofFigure2.1onthefollowingpage.In1953,
JamesWatsonandFrancisCrick[102]publishedasuggestionforthestructureof
DNAmoleculesthatisacceptedtobevaliduntiltoday.InspiredbyX-Raycrystal-
lographydataofMauriceWilkins[32]andRosalindFranklin[107]–publishedin
thesameissueofNature–WatsonandCrickproposedadoublehelixasshownin

5

Chapter2.SequencingTechnologies–BiologicalBackground

FigureImage2.1:islicensedChemicalstrundeructureCCofBYthe-SAfour3.0nandwucleotidesasadoptedandfrommolecular[106str].uctureofDNA.

thelowerpartofFigure2.1.Thesugar-phosphatebackbonesofthehelixareheld
togetherbyhydrogenbondsofthebasesontheinsideofthehelices,comparable
tothestepsofaladder.Ineachbasepair,theyproposed,apurinebase(AorG)
wouldbindtoapyrimidinebase(CorT).TogetherwithChargaffsobservations,it
wasclearthatAhastopairwithT,andCpairswithG.Insomeunderstatement
theauthorswrote:Ithasnotescapedournoticethatthespecificpairingwehave
postulatedimmediatelysuggestsapossiblecopyingmechanismforgeneticmate-
rial.Thispairingidealaidthefoundationsforseveralimportantdiscoverieslike
theprocessofDNAreplication,thedecipheringofthegeneticcode,andeventually
thefirstsequencingofahumangenomein2001[55,99].
EachcellofanyknownorganismonearthcontainsgenomicDNAthatisoften
giveninseveraldistinctmoleculescalledchromosomes.Thenucleotidesequencesof
thechromosomescanbeseparatedintonon-codingandcodingstretches.Thelatter
aretermedgenesandtheirnucleotidesequencesdescribeblueprintsforproteinsor
otherfunctionalmoleculeswhichcanhaveavarietyoffunctions,liketheenzymatic
digestionofotherproteinsortheregulationoftheexpressionofothergenes.Collec-
tively,thesefunctionsofexpressedgenesdeterminetheproperties,thecapabilities,
andtheappearanceofthatorganism.Inanutshell,thesequenceofnucleotidesen-
codesthepropertiesofitsorganism.Itisthusdesirabletousegenomicsequencesas
adatasourcetopredictgenesandtheirfunctions,todiscoverregulatorypathways,
ortoinvestigateevolutionaryrelationshipsofthespeciesonamolecularlevel.
Theprocessofacquiringthesequenceofbasesiscalledsequencingandwillbe
addressedinthenextsections.Pleasenotethatthetermsequencingisoftenused
ambiguously.Dependingonthecontext,iteitherreferstothetechnicalprocessof
readingsinglebasesfromfragmentsofDNAmoleculesasaddressedinthenext

6

2.1.DNA–TheBackboneofLifeonEarth

Figure2.2:SizeofnucleotidesequencesstoredinGenBank.DatafromSection2.2.8
el.txtv/genbank/gbrftp://ftp.ncbi.nih.goof

section,sequenceorofitanstandsorganismfortheascombineddescribedwinSork-oectionwof2.3.obtainingthecompletegenomic
AlthoughgenomicDNAmoleculesusuallyconsistoftwostrandsofnucleotides,
itsufficestosequenceandstoreoneofthembecausethesecondstrandisthereverse
complementofthefirstone.Thenumberofavailablegenomicsequencesstoredin
shothiswswaytheisgrowthconstantlyofgroGenBankwing.[12This,is104in],aparticularcentraldatabaseconsolidatedforbygeneticFigure2.2sequencesthat
hostedbytheNationalCenterforBiotechnologyInformation(NCBI).
TheinterestofscientistsintheseDNAsequencesisdiverse:Intheprokaryotic
domainexamplethatbacteria,containssomeunicellularspeciesaroreganismssequencedwithouttoincraeasetruetheircellnucleus,biotechnologicallikefor
efficiency.InCorynebacteriumglutamicumthegenomesequencehelpedtounder-
standthemetabolomicpathwaysuchthattheglutamateproductioncouldbeopti-
teriummized[50tuber].culosisFurther[23]morare,einvpathogenicestigatedtobacteriagainlikeinsightBacillusabouttheiranthracis[deathly77]oreffects.Mycobac-
Eukaryoticgenomesfeature,incontrasttoprokaryotes,amuchlargerandmore
complexgenome.Sincetheefforttosequencesuchgenomesisstillratherhigh,only
fewaresequencedcompletely,whichthenforexampleserveasmodelorganisms.
AnalyzingforinstancethegenomicsequenceofthemodelorganismArabidopsis
thaliana[64]hasthepotentialtorevealdeeperinsightsintometabolicpathways
whichmightbecommoninotherplants.Intheevolvingfieldofpersonalized
medicine,itisimaginabletousethegenomeofpatientstodesigndrugsspecifically
needs.theirtoedtailorAby-productoftheavailabilityofgenomicsequencesisanewdimensionof
accuracyinevolutionarytrees.Whileinitiallythetreestoorderandcategorizelife
wereconstructedbasedonphenotypicalobservations,todaytheavailablegenome
sequencesprovideamuchfinergranularityforphylogeneticreconstruction.

7

Chapter2.SequencingTechnologies–BiologicalBackground

2.2TechnologiestoAssesstheSequenceofDNA

Asmentionedbefore,knowingthegenomicsequencesofspecieshasaninvaluable
impactontheunderstandingofbiology.Unfortunately,itisnotpossiblewiththe
currenttechnologiestoacquirethewholegenomeatonce.Allavailablemethodsare
restrictedtosequenceonlyverysmallpartsofaDNAmoleculeintoso-calledreads.
Thesecommonlyhavealengthofupto1,000nucleotidesdependingonboththe
technologyusedandalsoonpropertiesthatarespecifictothenucleotidesequence.
Thissectiondescribes,inhistoricalorder,themostwidelyusedapproachesfor
sequencingDNA,andthefollowingSection2.3explainsthegeneralwaytocope
withthereadsizelimitationsoftheseapproaches.Foramoredetailedoverview,the
interestedreadermaylookat[44]or[47].Sincethereisarapidevolutionregarding
techniquesandmethods,thisoverviewcanonlybeasnapshotofthecurrentstate.

2.2.1Traditional:SangerSequencing
OneofthefirstapproachestosequenceDNAwastheso-calledchain-termination
method,alsoknownasSangersequencing[83].Althoughthetechniquewasdevel-
opedover30yearsago,itis–withsomemodifications–stillusedtoday.Wewill
firstelucidatethetraditionalapproachandthenbrieyaddressthemodifications.
isInfirsttheenrichedtraditionalinaprprocedurocesse,calledtheDNAamplificationfragmentsuchtobethatraead,highalsoternumbermedoftemplatecopies,
exists.Oneachsinglestrandedtemplatecopy,aprimersequenceishybridizedsuch
thatmerase.theStartingcomplementarfromtheystrandprimercanbesequence,insynthesizedabpolymeraseyanchainenzyme,reactiontheDNA(PCR)poly-the
sequenceofcomplementarynucleotidesiselongatedifsufficientdeoxynucleotides
(dNTP:dATP,dGTP,dCTP,anddTTP)areavailable.If,however,adideoxynucleotide
(ddNTP)isincorporated,thechainreactionstopsanditisnotpossibletoinclude
furtherdNTPs.Thiscanbeexploitedinastatisticalwaytoreadthenucleotide
template.theofsequenceTothisend,fourdifferentreactionsetsarepreparedinparallel.Eachonecontains
thefourdNTPswhereusuallyoneisradioactivelymarked.Additionally,onetype
ofnucleotideisprovidedasddNTPinasmallfractiontooneofthefourreaction
sets.WhenaddingthetemplateDNAandthepolymerase,thechainreactionbegins.
SincerandomlyandwithasmallchanceaddNTPisincorporated,acollectionof
theDNAfragmentsdideoxynucleotidewithdifofaferrenteactionsizessetofistheforexamplecomplementarythymine,stranditisisprclearoduced.thattheIf
originalstrandhasacomplementaryadenineateachstopposition.
Thestrandsarethendenaturatedandsubsequentlyseparatedbysizewithgel
electrophoresis,foreachreactionsetonasinglelane.Fragmentsofthesamesize
formabandwheretheshortestfragmentshavetraveledthelongestwayonthe
gel.Thebandsoftheradioactivelymarkedcomplementarystrandscanthenbe

8

2.2.TechnologiestoAssesstheSequenceofDNA

visualizedbyautoradiography.Eachvisiblebandcorrespondstoasinglebaseata
certainposition.Bycomparingtherelativepositionsofthebandsofallfourlanes,
thenucleotidesequenceofthetemplatecanberecovered.
Today,theradioactivemarkershavebeenreplacedbyuorescentdyesattached
totheddNTPs,featuringdifferentcolorsforeachbase.Thisallowstocombine
thedifferentlanestoasinglecapillarywherethenucleotidesarereadwithalaser
scanner.Withallimprovements,thistechniqueisabletoproducereadsofupto
1,000basesanditisthusstillinuseinmanylabstoproducehighqualityreads.
Newertechnologies,however,outperformtheSangermethodintermsoftotally
sequencedbasesperinstrumentrun.Thenexttwosectionsintroducesomeofthese
high-throughputsequencing(HTS)techniquesandelaboratetheiradvantagesanddis-
antages.adv

2.2.2TheNextGeneration:MassivelyParallelSequencing
Inthelastfewyears,severalnewapproachestoreadDNAhavebeenestablishedin
thelabs.Theintentionsforthedevelopmentofthesemethodsweretoreducethe
costandtoincreasethethroughputofasinglesequencingrunincomparisontothe
Sangermethod.Thustheyareoftenreferredtoasnext-generationsequencing(NGS)
technologies.ThistermisadequatewithrespecttoSangersequencingbut,aswe
willseeinSection2.2.3,therearealreadysuccessortechniquesofthenextgenera-
tion.Thatiswhywecallthetechnologiesdescribedinthissectionsecond-generation
oaches.apprsequencingAllofthemodernapproacheshaveincommonthatmanyfragmentsoftheDNA
areprocessedinparallel,yieldingaveryhighnumberofreadspersequencing
run.AccordingtoMagietal.[62],secondgenerationapproachesfollowthegeneral
patternoffirstduplicatingeachfragmentmanytimesinanamplificationstep,while
fixingthecopiestoasurface,andthenreadingthesequencesofthecopiesby
iterativecyclesofenzymaticreactionsand(mostly)imaging-baseddatacollection.
Themaindisadvantageisthatthegeneratedreadsareconsiderablyshorterthan
theonesfromthetraditionalapproach.Butitseemsonlytobeaquestionoftimeun-
tilimprovementsinthechemistryandmethodologyenablethefirstHTSapproaches
tokeepupwithorevenoutperformSangerwithrespecttothereadsize.
Table2.1liststhemainfeaturesofthreewellestablishedsecond-generationap-
proaches.Eachofthemandanewcomertechnology,forwhichthesedatawerenot
availableyet,willbeexplainedinthefollowingparagraphs.

Sequencer454–heRocIntroducedin2005,theGenomeSequencerwasthefirstcommerciallyavailable
second-generationsequencingplatform.Untilnow,althoughwithrefinedchemi-

9

Chapter2.SequencingTechnologies–BiologicalBackground

Table2.1:Comparisonofthemainfeaturesofthreewidelyavailablesecondgenera-
tionsequencingtechnologies;basedonMagietal.[62],updatedaccordingtothe
companywebsitesasofJune2011.
Roche454IlluminaGAABISOLiD
AmplificationmethodEmulsionPCRBridgePCREmulsionPCR
SequencingmethodPyrosequencingReversibledyeSequencingby
ligationminatorsterReadlengths400–800bp35–150bp35–75bp
ThroughputSequencingrperundatimey10.4Gbpd6–255–14Gbpd10–301–7Gbpd
Errorrate0.1%1.5%4%

cals,asequencingbysynthesisapproach[63]isusedthatwasdevelopedby454Life
Science1whichisnowasubsidiaryofRocheAppliedScience.
The454sequencingworksasfollows:FirsttheDNAisfragmentedrandomly
insequencessmallpartsareofthen500–appended1,000bases,toeachforexamplefragmentbysuchthatnebulization.thesecanSpecificbindonadaptorthe
surfaceofspeciallypreparedbeads.Startingwithasinglefragmentonabead,
thefragmentisamplifiedtomillionsofcopiesinaprocesscalledemulsionPCR.
There,thebeadswithfragmentaredistributedwithPCRreagentsinwater-in-oil
microreactorssuchthatallfragmentsareamplifiedseparately.Thebeadsarethen
placed,togetherwithDNApolymeraseandfurtherenzymes,onanopticalarray
offibers.Thearrayconsistsofseveralmillionwellswhereasinglewellhasthe
appropriatedimensionstoholdexactlyonebead.
Afterthispreparatorystep,thereadingofthebasesemploysthepyrosequencing
techniquethatwasdevelopedalreadyinthe1980es[45].Pyrosequencingisbasedon
theprinciplethateachincorporationofanucleotideviaDNApolymerasereleases
pyrophosphate.Thepyrophosphatecanthenbedetectedbyutilizingtheenzymes
sulfurylaseandluciferasethatproduce,inacascadeofreactions,finallyalight
].70[emissionToreadthesequenceoftheDNAtemplatesthatareimmobilizedtothebeads,the
differentnucleotidesareowedcyclicallyinafixedorder(T,A,C,G)overtheoptical
array.IfinacycleanAisprovided,alltemplateDNAswithacomplementaryT
asnextunpairednucleotidewillproducealightash.Theemittedlightfromthe
millionsofcopiesofthetemplateonabeadisstrongenoughtobedetectedwith
aCCDcamera.Whilethepicturetakenisthenusedforbasecalling,theremaining
nucleotidesarewashedfromtheplatesuchthattheydonotinterferewiththenext
cle.cytheinstep

1.454.comhttp://www

10

2.2.TechnologiestoAssesstheSequenceofDNA

gionsPrwheroblematicetheforsamethisapprnucleotideoacharoccurserhomopolymersepeatedly.onForfethewtemplates,nucleotides,whichthearemittedere-
lightisapproximatelyproportionaltothenumberofintegratedbases.Whiletwoor
threebasesareclearlydistinguishable,basecallingisnotreliableforhomopolymers
oflengthsixormorebasesduetoasaturationinthedetector.Thus,454reads
tendtohaveinsertionordeletionerrorsinhomopolymerstretches[63].However,
substitutionerrorsarerare[62].

Illumina/Solexa–GenomeAnalyzer

SOneolexayearwasafteracquir454,edbtheySolexaIllumina,2sequencingwhichstartedplatforanmextensivcameeonadvtheertisingmarket.Incampaign2007,
resultingintheGenomeAnalyzercurrentlybeingthemostfrequentlyusedsecond-
generationsequencingdevice[62].
similarThetosequencing454sbuthasatechnologyfew,essentialbasedondifferthewences.orkofTheTurDNAcattiisetalsoal.[96],fragmentedislargelybut
theamplificationoccursdirectlyonthesurfaceofasolidplanarglassslide,theow
cell,withatechniquecalledbridgePCR[1,31].Duringthisprocess,thetemplates
arecovalentlyattachedtothesurfaceandformbridge-likestructures.Afterseveral
repetitionsofbridgePCR,theowcellconsistsofspotsthatcontainmillionsof
fragment.eachofcopiesThereadingofthebasesisdone,foreachspotoftheowcell,againinahighly
parallelfashion.Thedifferenceisthatareversibleterminatorchemistryisemployed
afordifferentsequencing.color,Forwhicheachallotypewsoftoprbase,ovidetheallterbasesminatorisuorsimultaneouslyescentlyinonelabeledreadingwith
cycle.Afterincorporatingasinglelabeledbase,thepolymerasefunctionstopsdue
totheterminatorwhicheffectivelyavoidstheproblemswithhomopolymersthat
arepresentinthe454technique.Inthenextstep,alluorescentdyesareactivated
byalaserandanimageisrecordedwhichisthenusedtocallthebasesinthe
ardifeferentsubsequentlyspots.rWhileemovedthesuchnucleotidethatthestaprysinocessplace,canbetherlabelepeatedandforthetheternextminatorbase.
Unfortunately,thereisanincreasingrateoferroneousbasecallstowardstheendof
Illuminareadssinceanincompleteremovalofthelabelorterminatorcanhappen,
whichhindersareliabledetectionofthefollowingbases.
Althoughthereadsizesgeneratedwiththisapproacharesubstantiallyshorter
comparedto454,seeTable2.1,IlluminasGenomeAnalyzerfindsavarietyofappli-
cationsandthusalsocustomerssincethesequencingcostsarelowerandthetotal
throughputperrunismuchhigher.Preferredapplicationsincluderesequencing,
geneexpressionanalysisandsinglenucleotidepolymorphism(SNP)detection.

2http://www.illumina.com

11

Chapter2.SequencingTechnologies–BiologicalBackground

Wehavedescribedthetwoabovementionedsequencingtechniquesinlittlemore
felddetailUnivsinceersitythe.yarOurelabsactivelysequenceusedinmostlythetheCenterforgenomesofBiotechnologyprokaryotes(CeBiTbutec)curratBiele-ently
thepurchasedIlluminaand454sequencingdevicesarealsousedtosequencethe
eukaryoticgenomeofaChinesehamster.
TheBielefeld.Nesequencingvertheless,methodswegivedescribedashortinrtheeviewfollotowingaddrareessnotadv(yet)ancesandestablisheduniquein
pointsoftheotheravailableorcurrentlydevelopedmethods.

SOLiD–BiosystemsAppliedTheSOLiDsystem[85]isbasedonsequencingbyligation;itsacronymstandsfor
SupportedOligonucleotideLigationandDetection.TheSOLiDsequencingdevice
iscommerciallyavailablesince2007andliketheprevioustechniqueswidelyspread.
Currently,AppliedBiosystems3thatisapartofLifeTechnologiesperformsthe
marketingandsellstheABISOLiDsequencer.
Like454,anemulsionPCRisusedfortheamplificationofthefragmentsbutin-
steadoftheenzymepolymerase,aDNAligaseisemployedinordertoreadthe
bases.Theligaseisabletoincorporateoligonucleotides–forexampleoctamers
thatconsistofeightnucleotides–intothecomplementarysinglestrandedtem-
plateDNA.Whilethefirsttwobasesofeachoctamerareknownandcolorcoded
withfouruorescentdyesofdifferentcolor,theremainingsixrandombasesare
neededtostabilizethebinding.Althoughthefourcolorsambiguouslyencodethe
16possibledi-nucleotides,theyarechoseninawaythatknowingthefirstbaseand
thecolorofadi-nucleotideuniquelyidentifiesthesecondbase.
Toassessthesequence,thelabeledoctamersareligatedtothetemplateDNAsand
theircolorcodeiscalled.Thishastoberepeatedinseveralroundswithvarying
startingpositionssuchthateverybaseiscoveredtwice.Togetherwiththecolor
coding,thisservesasanerrorcorrectiontoimprovetheaccuracyofthebasecalls.
Comparedtotheothersecond-generationtechniquesinTable2.1,ABIproduces
averyhighthroughputperdaybutalsofeaturestheshortestreads.Theratherhigh
errorrateinthetableisgivenwithrespecttotherawcolorcallsandcanbereduced
viatheimpliciterrorcorrection.

IonTorrent–PGMSequencer
ThecompanyIonTorrent,4alsoacquiredbyLifeTechnologies,launchedtowards
thetionendandofthe2010cyclicthePnucleotideersonalwGenomeashingstepsMachineofthe(PGM)PGM.WhileSequencerthearfragmenteprcomparableepara-
43http://wwwhttp://www.iontorrent.com.appliedbiosystems.com

12

2.2.TechnologiestoAssesstheSequenceofDNA

tothe454technology,theincorporatednucleotidesarenotrecognizedoptically.In-
stead,hydrogenionsaremeasuredthatarereleasedasaby-producteverytimethe
polymeraseincorporatesanucleotide.Insequentialwashingstepsofthedifferent
mentsnucleotideinparalleltypes,ontheanchararrageyofchiptherthateleasedcontainsionsaislayermeasurofedforsemiconductormanyDNAsensors.frag-
Sinceinonewashingstepseveralnucleotidescanbeincludedintothefragment,
thistechniquesuffersalsofromthehomopolymerproblem.

2.2.3ThirdGeneration:SingleMoleculeSequencing
Intheaforementionedprocedures,everypieceofDNAhastobeamplifiedsuchthat
theresultingcopiesproduceasignalstrongenoughtobedetected.However,the
amplificationprocesshassomeseveredrawbacks.Besidesbeingtime-demanding
andalsorequiringcostlychemicals,theamplificationprocessshowsaneffectcalled
amplificationbias.DependingontheprimersequencesandthePCRsettings,some
nucleotidesequencesaremoredifficulttoamplifythanothers.Thisresultsina
varyingabundanceofthecopiedfragmentsthatisalsomanifestedinavarying
coverageofreads.Withfewerreadsitishardertoeliminatesequencingerrors.A
techniqueofreadingthebasesdirectlyfromasinglemoleculeisthusdesirablesince
itavoidsamplificationbiases.FirstideasofsuchaSingleMoleculeSequencing(SMS)
decanvbeelopment.datedbackInthetor1989[emainder49].Neofvthisertheless,section,therweeareprseesentveralthrappreeoachestechniquesstillwithundera
highMilosgivpotentialeamortoebecomedetailedthereviethirdwaboutgenerationSMSoftechniquessequencing[92de].vices.Thompsonand

Heliscope–HelicosisThethefirstHeliscopecommerfrciallyomavHelicosailableBiosciences.sequencing5TheinstrumentsequencingthatusesapprtheoachSMSwastechniqueinitially
describedin2003[16]andhasalreadybeenappliedtosequenceaviralgenome[38]
aswellasanindividualhumangenome[75].
mentsTheofthesequencinggenomicissimilarDNAartoeIlluminasimmobilizedapprtoaoach:glassFirst,slide,millionsthedifofferencerandomishofrag-w-
evverersiblethatternominatorsamplificationareincorisporatedneeded.andAgain,animageuorisescentlytakenwithlabeledabaseshigh-rwithesolutionre-
opticalmicroscope[78].
byReadsdeletionsofthis[75],methodsinceitareiswithlikelyanathatveragetheofemitted32basessignalratherofashortsingleandbasealsoisafmissedfected
.detectortheyb5.helicosbio.comhttp://www

13

Chapter2.SequencingTechnologies–BiologicalBackground

PacificBiosciences–PacBioRS
ThecompanyPacificBiosciences6hasdevelopedasinglemoleculesequencingtech-
nologyincludethatwnucleotidesorksinintoreal-tsingleime(SMRstrandedT),DNAmeaning[30].asThefastdeasaviceDNAiscalledPpolymeraseacBioRScan;
firstcommercialunitsweredeliveredinmid2011.
TheSMRTapproachimmobilizes,incontrasttotheotherHTSprocedures,the
DNApolymeraseinsteadoftheDNAtemplate.Eachpolymeraseisanchoredtothe
withbottomaofDNAatemplate.nanoscalerTheeactionwellswarellecalledoodedzerwithomodeuorescentlywaveguidelabeled(ZMW)andnucleotides,loaded
ofeachthetypeZMWsinadifareferentchosencolorin,andsuchawilluminatedaythatbythealaserlaserfromlightbelocannotw.Thetraveldirdimensionsectly
throughthewellandonlyilluminatesitsbottom.However,thewavelengthsofthe
uorescentdyes,whenactivatedbythelaser,areabletoreachthetopofthewell
andcanberecordedbyacamera.DuringitsincorporationbyDNApolymerase,
difanyfusionlabeledalone.nucleotideThisallostawsystheSMRsignificantlyTtechniquelongerattothetrackbottomtheofanucleotidesZMWthatthanarbey
difincorferentporatedatnucleotides,thesamenorterspeedminatorsoftheareemploneeded.yedThepolymerase.uorescentNeitherlabelsarweashingcleavedof
duringtheincorporationsuchthattheydonotinterferewithnewnucleotides.
Thisratherdistincttechnologytoreadnucleotidesoffersinterestingvariationsto
thebeengeneralloadedtothesequencingpolymerase.process:ItDoingisthispossibleallotowscirtocularizesequencethethetemplatesamephysicalafteritpiecehas
ofDNAseveraltimessuchthatsequencingerrorscaneasierbedetected.Another
novelideaisthesocalledstrobereading.Here,thelaserisswitchedoffatcertaintime
pointswhilethepolymerasecontinuestoincludenucleotides.Althoughobviously
nucleotidesaremissed,theadvantageisthatthetotalprocessedmoleculelength
ismuchlongersincethepolymerasegetslessdamagedbythelaser.Strobereads
prareovidingconsequentlyadditionalasetinforofmationsmallerrabouteadstheiroriginatingorderandfromtheorientation.sametemplate,thus

OxfordNanoporeTechnologies

TheNanoporetechnologydoesneitheruseuorescentlabelsnoranopticalrecog-
nitionsystemlikemostoftheotherapproaches.Instead,itmeasuresindividual
baseselectronicallywhiletheytravelthroughnanoporeswhichareproteinsthatcre-
atenanoscaleholes.Whenanelectricalcurrentisappliedtothenanopore,nu-
cleotidescauseadisruptionofthiscurrentastheypassthroughthepore.Different
nucleotidescausedistinguishablechangesincurrentandcanthusbeclassified.

6.pacificbiosciences.comhttp://www

14

SequencingGenome2.3.

Eventhedistinctionbetweenthefourstandardbasesandmethylatedcytosine
ispossible.Thisisaremarkablefeaturebecausemethylationpatternshavebeen
showntohaveaneffectontheexpressionofgenes[14].
Toprovidethenecessarysinglenucleotidestothenanopore,thissequencingtech-
niqueemploysaDNAexonucleasewhichcleavesindividualbasesfromtheendof
asinglestrandtemplateDNA.Thebasesarethenintroducedtothenanoporeand
theircurrentismeasured.Clarkeetal.describetheprocedureandthechemical
modificationstothenanoporethatareneededforsequencing[22].
OxfordNanoporeTechnologiesLtd.7developedanarraychiptousenanoporesin
aparallelfashion.Thelabelfreedetectionandtheuseofanexonucleasepromises
comparablylongreads.Atthesametime,thesignalprocessingissimplerthan
anopticaldetectionofbasesandtheelectronicsneededcanpossiblybefabricated
cheaplyaftersomefurtherdevelopment.Though,acommerciallaunchisunknown
2011.JuneofasInthefuture,theexistingapproacheswillbedevelopedfurtherandwilllikely
becheaperandproducelongerreads.Maybenewtechnologieswillbeinvented
acceleratingtheadvancesevenfurther.However,asequencingprocedurethatreads
thewholeDNAofagenomeatoncecannotbeexpectedinthenearfuture.It
isthereforestillnecessarytohavemethodstoacquirethecompletesequenceof
agenomedespiteofthelimitationscausedbysmallreads.Thesemethodsare
introducedinthenextsection.

SequencingGenome2.3AsmentionedinSection2.2,allcurrentsequencingtechnologiesfacetheseverelim-
itationthattheproducedreadsoftheDNAmoleculesaremuchshorterthaneven
smallgenomes.Itissurprisingthatcompletegenomescanbeobtainednevertheless.
Themethodtoachievethisiscalledwholegenomeshotgun(WGS)sequencing.Fig-
ure2.3onthenextpagegivesanoverviewofthethreebasicstepsofWGSsequenc-
ing:Firsttheshotgunsequencing(a),thenthecomputationalassemblyphase(b),
andintheendthefinishingofthegenome(c).Theseprocessesareelaboratedin
moredetailinthenextthreesections.

SequencingShotgun2.3.1isTheoverprinciple-sampledofbyshotgungeneratingsequencirngeads[5,at88]israndomthatthepositions.DNAThemoleculenametobestemsfrsequencedomthe
pingrandompartsofpatterthenrthateadscanshotgunbeprusedtoojectilesrpreconstructoducethewhengenomefiredonsequenceatarget.asOvdescribederlap-
.2.3.2ectionSin7etech.com.nanoporhttp://www

15

Chapter2.SequencingTechnologies–BiologicalBackground

Figure2.3:Majorstepsinwholegenomeshotgunsequencing:(a)Generatingasetof
readsduringshotgunsequencing,(b)mergingthereadstocontigsintheassembly
phase,and(c)completingthegenomesequenceinthefinishingstep.

Togeneratethesetofoverlappingreads,firsttheDNAhastobeextractedand
purifiedfromthecellsoftheorganism,seeFigure2.3(a).Severalcopiesofthege-
nomicDNAarethenfragmented–mostlyusingphysicalshearinglikesonification
ornebulization.Followingthat,therandomfragmentsaresequenced,forexample
withoneofthemethodsdescribedinSection2.2.Thereadsgeneratedthiswayare
ideallyevenlydistributedoverthewholegenomeandtheircoverageishighenough
suchthateachbaseofthegenomewillbefoundinseveralreads.Inthiscontext,
currenthigh-throughputsequencingtechniquesfitperfectlysincetheycancheaply
generateahighnumberofreadsandthusprovidethedesiredcoverage.
However,shotgunsequencingwasinventedalreadyearlierintimesofSanger
sequencingwheneachfragmentwasclonallyamplifiedseveraltimessuchthatthe
sequencingwouldwork.Thiswasdoneusingcloningvectorswhicharepieces
ofDNAthatcanbecopiedinhostcellslikeEscherichiacoli.ForeignDNAthatis
includedintothevector,asaso-calledinsert,iscopiedalongwithit.
Thiscloningtechniqueissometimesemployedeventodaywhensequencingcom-
plexgenomes:Inthehierarchicalshotgunsequencing,largefragmentsofthegenome
arestoredinalibraryofvectors,forexampleBACs(bacterialartificialchromo-
somes).TheinsertsofallthoseBACsthatareneededtocoverthewholegenome
arethensequencedindependentlywiththeshotgunapproach.Thiscanhelpto
reduceproblemsintheassemblyphase.
Anotherhelpfulvariationisthedoublebarreledshotgunsequencingthatemploysthe
so-calledpairedendsequencing.Here,thefragmentsaresequencedfrombothends
andthecorrespondingreadsofonefragmentarecalledmatepair.Ifallfragments
havespecifiedanduniformlengths,thenthematepairinformationcanbevery
valuableforthefollowingtwophasesofWGSsequencing.Inthefirstversionsof

16

SequencingGenome2.3.

therecenthigh-throughputdevices,matepairswerenotavailable,howevermostof
thecompaniesdevelopedchangesintheirsequencingprotocolstoprovidethem.

PhaseylAssemb2.3.2Whenthegenomeiscoveredsufficientlybyreads,theiroverlapscanbeusedto
reconstructcontiguousstretchesofthegenomicsequencethatarecalledcontigs.The
processofgeneratingthecontigsfromasetofreads,asdepictedinFigure2.3(b),is
referredtoastheassemblyofthegenome,anditisatypicalexampleforadvancesin
biologygainedthroughbioinformatics.Withoutthedevelopedassemblerprograms,
thesequencingofacompletegenomicsequencewouldmostlikelynothavebeen
successfuluntiltoday.Whileacomprehensivereviewofthechallengesofassembly
andhowcurrentassemblersdealwiththemisgiveninMilleretal.[65],wehere
onlygiveanoverviewoftwodistinctclassesofassemblyalgorithms.
Firstassemblyalgorithmsdatebacktothetimeswhenfewbutrelativelylong
Sangerreadswerepredominant.Theyarecalledoverlap-layout-consensusapproaches
becausetheycomparethereadsall-against-allinordertomergeoverlappingparts
intolongercontiguousconsensussequences.Themergingofreadstocontigsis
oftendoneinagreedyfashion,sometimesbybuildinganoverlapgraph.Some
wellknownprogramsofthiseraaretheCeleraassembler[66]whichwasusedto
assembleoneofthefirstavailablehumangenomes,ARACHNE[11],orBambus[74].
Oneoftheearliestassemblersthatwasabletocopewithhighthroughputdata,
wastheNewblerassembler[63,Supplem.material]thatisshippedwith454se-
quencingdevices.Subsequently,SSAKE[100],VCAKE[48],andSHARCGS[29]
weredevelopedtohandleshortreadsaswell.Themostproblematicissueofthese
assemblersisthecomputationaltimeneededforcomparingthereadsall-against-all.
Adifferentclassofassemblerssolvesthisproblemelegantlybyusingaso-called
deBruijngraph[17]forassembly[46],ormorepreciselyasubgraphofit.Thead-
vantageisthatthegraphcanbebuiltintimelineartotheinputsize,asopposed
tothequadratictimethattheoverlap-layout-consensusapproachesneedingeneral.
AdeBruijnsubgraphconsistsofnodesrepresentingallsubstringsoflengthk,called
k-mers,ofthereads.Thenodesareconnectedbyanedgeiftwooverlappingk-mers
occuradjacentlyinoneofthereads.Thisway,commonsubstringsarecondensed
andtheoverlapsofthereadsarecollectedimplicitlyinthegraph.Ifthesequencing
datawereperfect–thatiswithoutsequencingerrorsandwithreadslongerthanre-
petitiveregionsofthegenome–thenthedeBruijngraphwouldrevealthecomplete
genomicsequence:ByfollowinganEulerianpaththattraverseseveryedgeonce,the
desiredcompletesequencecouldbeobtained.ThedeBruijngraphsgeneratedfrom
realsequencingdataare,however,muchmorecomplex,suchthatheuristicshave
beendevelopedtocopewiththelimitations.
Theincreasingthroughputofcurrentsequencingtechnologiespushedthedevel-
opmentofdeBruijngraphbasedassemblersforwardsincetheyarebettersuitedto

17

Chapter2.SequencingTechnologies–BiologicalBackground

handlerelativelysmallreadsandcancopewithahigheramountofdataduetothe
condensednatureofthegraph.AnincompletelistofwellknowndeBruijngraph
assemblersis:Velvet[108],EULER-SR[20],ALLPATHS[18],andSOAPdenovo[60].
gleNosequencematterthatwhichreprsoftwesentsareistheactuallywholeused,genome.theIntheassemblycasethatideallytheroresultsganisminahassin-
severalchromosomes,ofcourseeachofthemshouldbegivenbyasinglecontig.
Hokenweintover,setheveraldesiredcontigsforungappedavarietysequenceofrofeasons:thegenomeFragmentsisinofthepracticegenomeusuallycouldbro-
ajustsinglebychancestrandednotbefragmentpresentbindsduringtoitselvtheescrsequencingeatingarun.hairpinAlso,loopitthatmighthindersoccurthatthe
readingofthebases.Bothcasesresultinagapofapossibleconsensussequence
sinceno(overlapping)readsdoexist.
Oneofthemostcommonreasonsforafragmentationoftheconsensussequence
arerepetitivepartsofthegenome[89].Readsoriginatingfromdifferentcopiesofa
repetitiveregionwithinthegenomecannotsafelybedistinguishedandthusareas-
sembledtoasinglecontig.Wecallthoserepetitivecontigs.Itisofteneasytodetecta
repetitivecontigbylookingatthecoverageofreadsthatwereusedtoassembleit.A
contigislikelytoberepetitive,ifitscoverageisconsiderablyhigherthantheaverage
(ormedian)ofallcontigs.Sometimes,itcanevenbeestimatedhowoftenarepetitive
contigoccurswithinthegenomebycomparingitscoveragetothemediancoverage.
Unlikesequencesinthecasesrwhenepetitivepiecescontigofbelongssequencetoareisinmissing,generalheregivtheen.inforThedifmationficultywhichfor
theassembleristountangletherepeatingcopiesbecausetheconsensussequences
thatconnecttonon-repetitivecontigsarenotuniquelydistinguishable.However,
matepairinformationofthereads,ifavailable,canbeincludedtodisambiguatethe
occurrencesofthereadsandthustoreducethenumberofcontigs.

FinishingGenome2.3.3toTheobtainassemblythephasecompleteusuallysequenceendsupwithoutinasetgaps,ofseecontigsFigurthate2.3need(c).tobeThisphaseconnectedis
calledthefinishingofthegenomeand,ingeneral,itcomesalongwithadditional
labwork.AccordingtoNagarajanetal.[67],aninitialdraftassemblyofabacterial
genomecanbeproducedinweeks,whileitscompletioncurrentlytakesmonthsor
evenyears.Theprocessoffinishingthegenomeisverycostly,andthustherehave
alreadybeendebateswhethergooddraftgenomesaresufficientformostgenomic
analysesoriftheincreasedeffortforfinishingisworthit.Anargumentinfavorof
theexamplelatterinisthatcomparativfinishedeapprgenomesoachesalloandworadermuchbasedrichergenomicanalysisofstudies.thegenome,Additionallyfor,
thefinishingprocesscanalsohelptoimprovethequalityofthesequences,iffor
examplemis-assembledorlowcoverageregionsareexaminedcloser.

18

2.3.SequencingGenome

Genomefinishingtypicallyconsistsofthreepartswhichmightbeapplieditera-
tively.First,thecontigsneedtobeorderedsuchthatgapsbetweenadjacentcontigs
canbespotted.Inthegapclosurephasemissingsequenceinformationtoclosepo-
tentialgapsissequencedwiththeaimtomergethecorrespondingcontigs.Finally,
andsometimesoptionally,theresultingassemblyisvalidatedinapolishingphase.

deringOrContigInthestepofcontigordering,informationabouttheorderandrelativeorientation
ofthecontigsisgathered.Thisisvaluableforthegapclosurewherethegaps
areselectedbyspecificprimersequencesattheendofthecontigs.Pairsofthese
primersdetermineapartofthegenomethatcanbeamplifiedandsubsequently
besequenced.Givenncontigsandnofurtherinformationabouttheirorder,there
areO(n2)possibilitiesforprimerpairswhichwouldamplifydifferentlysizedparts
ofthegenomewhichdonotnecessarilybelongtothemissinggapsequences.If,
however,theorderofthecontigsisknown,thenumberofnecessaryprimerpairsto
investigatecanbereducedtoO(n)[79].Adecreaseofpairstobeconsideredresults
inlesstimeandmoneyforthelabworkthatisnecessarytofillthegaps.
Inthefollowing,werefertoinformationregardingtheorderandtherelative
orientationofasetofcontigsastheirlayout.Synonymsforlayoutaresupercontig,
metacontigorscaffoldwhichmighthaveasubtledifferenceinmeaningbutalways
refertolinkinginformationofthecontigs.Thereisavarietyofmethodsandsources
thathelptoestimateasuitablelayoutofthecontigs.Here,wedescribetheonesmost
used:commonly•Linkinginformationbetweencontigscanbeextractedfromtheoutputofcer-
tainassemblyprogramslikeNewblerortheCeleraassembler[67].Ifthepar-

•tainticularassemblyassemblyprogramsalgorithmlikebuildsNewbleraorgraphtheofCelerathereadsassemblerand[their67].ovIftheerlapsparin-
ordertomergethemtocontigs,thenthisgraphoftenimplicitlycontainshints
vforeryalayconseroutvofativethefashioncontigs.suchThethattheirassembledconsensuscontigsaresequenceusuallyiscrsupportedeatedinbay
ahighcoverageofreads.However,thegraphmightalsocontainfainthintsfor
theadjacencyofcontigsifoverlappingreadsdoconnectthembutwithacover-
agethatistoolowtobereliable.Whilethecautiousbehavioroftheassembly
algorithmsmakessensetoavoidmisassemblies,insomeofthesimplercases
intheresilicois.Ofactuallycourse,enoughthisapprinforoachmationdoestonotclosehelptheifgapspiecesbetwofeenthesomesequencecontigsare
sequencing.aftermissingcompletely•toInthefindacaselaythatoutofmatethepairscontigsoftheifrtheeadsyspanareavtheailable,gapoftwsomeoofcontigs.themcanbeDependingused
onthegapsizeaswellasthedistanceofthepairedreadstheselocalconnec-

19

Chapter2.SequencingTechnologies–BiologicalBackground

tionshelptofindappropriatecontigadjacencies.Unfortunately,sequencing
withmatepairsisusuallymoreexpensivesuchthatitisoftenomitted.
•Afteraninitialassemblyofthereads,sometimesfosmidlibrariesareemployed
toaccomplishthefinishingofagenome.Fragmentsofthegenomewithasize
between35and40kbpareusedasinsertsforthefosmidsinordertosequence
theendsofeachinsertedfragment.Ifthoseendsequencescanbemappedto
differentcontigs,itispossibletoinferthedistanceandorientationofcontigs
towardseachother.Fosmidlibrarieshavetheadditionaladvantagethatthey
canbeusedforprimerwalkingevenifthegapsbetweenthecontigsexceedthe
usualsizetodoprimerwalkingonthegenome.Butofcoursethisadvantage
ispaidforwithahighamountofworktocreatethelibrary.
•Amethodtoordercontigs[56]thatworksintheglobalcontextofagenomeis
opticalrestrictionmapping[82].Here,justlikeinthesequencingapproaches,
alibraryofoverlappingfragmentsiscreated.Thefragmentsarefixedtoa
glassslide,digestedbyarestrictionenzymeandthenuorescentlylabeled.
Theorderoftherestrictedpiecesismaintainedduringthisprocessandtheir
sizecanbeestimatedwithuorescentmicroscopy.Similartotheassembly
phaseinshotgunsequencing,amapofoverlappingrestrictionsizescanbe
constructedforthewholegenomicsequence.Thisso-calledopticalrestriction
mapisspecifictotheemployedrestrictionenzymethatcutsatdefiniteloca-
tions.Toordercontigs,aninsilicogeneratedrestrictionmapofthecontigs
canbecomparedtotheopticalrestrictionmap.Thisisespeciallyusefulfor
longercontigswhereasverysmallcontigsmightbesmallerthantheaverage
restrictionpiecesthusnotallowinganorderingofthem.
•Moreandmoregenomeshavealreadybeenfinishedandtheirsequencesare
mostlyavailabletothepublic.Ifthegenomeofarelatedspeciesisathand,
itcanbeusedasareferencetolayoutthecontigs.Tothisend,thecontigs
orpartsofthemarematchedontotherelatedreferenceallowingtoestimate
whichofthecontigsareadjacent.Whendoingso,onehastobearinmind
thattherelatedsequencemighthaveundergonelargerrearrangementevents
likeinsertionsordeletionsoreveninversionsofpartsofthegenome.Recently,
therelatedreferencebasedorderinghasbeenappliedusingasingle[79]and
alsomultiple[109]references.
Inthisthesis,webuildontheideaofreferencebasedcontiglayouting.Besides
presentingasimplemethodinChapter3thatusesasinglereferencegenome,we
alsoestablishaformalwaytocollectcontigadjacencyinformationfrommultiple
referencesinChapter4.Additionally,weshowhowtointegratethephylogenyof
thespeciesintoourapproachandprovideanalgorithmtotreatrepetitivecontigs
inaspecialway.However,itisingeneraladvisabletocombinedifferentsourcesof
informationtofindamostlikelylayout.

20

SequencingGenome2.3.

ClosureGapalrTheeadynextstepindicated,inthisgenomecanfinishingpartiallyisbethedoneclosurinesilicoofthebutgapsmostlybetwthiseenrequircontigs.esaddi-As
tionallabwork.Iftwocontigsareknowntobeadjacentitispossibletoamplify
theDNAinbetweenthecontigswithPCRbydesigningspecificprimerpairsthat
enclosethegap.ThepiecesofDNAthatarecopiedthiswaycanthenbesequenced.
Often,thegapislargerthanthereadlengthoftheappliedsequencingprocedure.
Then,atechniquecalledprimerwalkingcanbeapplied.Here,primerpairsare
prdesignedocessisforiteratedthepartthusofwthealkinggapinthatstepshasofbeenthesequencedsequencinginreadthelastlengthrfroundomandprimerthe
.pairprimertopairprWhileoaches,thethecontiglaboriousorwderingorkofcangapbedoneclosureinislittlethetimemainusingbottleneckcomputationalwithrespectap-
toestimatedtimeandinthemoney.Thercomputationalefore,itapprisoachesespeciallyisreliableimportantsuchthatthatthethelaynumberoutofwhichnec-is
essaryPCRreactionsandsequencingstepsisreduced.

olishingPTogetherwiththeformerstepsofcontigorderingandgapclosing,itispossibleto
checktheconsistencyandcorrectnessofthecurrentassembly.Aspecialtoolforthis
taskisBACCardI[9]thatusesBAClibrariestovalidateanassembly.Additionally,
matewhereapairsorcontiguousmatchessequencetorfrelatedomrtheeferenceassemblerisgenomesinfactmightnotrevealcontiguous.Inmisassembliesthelab,
PCRreactionscanamplifyregionswhichhadatoolowcoveragetobeassembled
inthefirstplace.Besidesautomaticsuggestions,itisoftenadvisabletomanually
curateanassembly.AnestablishedtoolthathelpsinthistaskisConsed[35].It
allowstolookatanalignmentofthereadsthatwereusedtobuildthecontigthus
allowingtoassesshowmanyreadssupportsinglebasesandwhetherthereare
contigswhererepetitivepartsoftwooriginsarealigned.Incaseofmistakesinthe
assembly,itispossibletomodifyitbysplittingsinglecontigsorjoiningothers.

21

Chapter

22

2.

Sequencing

echnologiesT

Biological

Bacroundkg

3Chapter

ofhingMatcEfficientContigs

rInitialelativeordersequencingandofaorientationgenomeisnotusuallyknorwn.esultsToinaclosesettheofgapscontigsitisforwhichnecessarytheirto
knowwhichofthecontigsareadjacent.Thisinformationisgiveninalayoutofthe
arecontigs,basedandontherinforearemationseveralprovidedtechniquesbythetoestimatesequencingarun,suitablelaothersyout.relySomeonofexterthemnal
information.Themethodsdescribedinthisthesisusethegenomicsequencesof
relatedspeciestofindalayoutforthecontigs.Ourapproachisbasedonfinding
correspondingregionsofcontigsandreferencesequences,wherecorresponding
prmeansobablyinanderividealedfrcaseomathatcommontheseregionsancestor.sharHoeweanvere,vitisolutionarnotypossiblehistortoyandelucidatewere
thetrueevolutionaryhistoryoftheDNAmolecules.Acommoncircumventionis
tousesequencecomparisonmethodstofindsimilarregionsbetweenthereference
contigs.theandgenomesThegeneraltermsequencecomparisonsubsumesmanydifferentquestionsandalso
applicationsthatdealwithcomputingdistancesorfindingsimilaritiesbetweense-
quences.Whilewewanttogiveashortoverviewhere,amoredetailedintroduction
tosequencecomparison,especiallyinbioinformatics,canbefoundinareviewby
].10[BatzoglouOurbriefhistoricaloverviewstartsin1950withtheHammingdistancethatsim-
plycountsthedifferinglettersoftwoequallysizedsequences[37].In1966,this
conceptwasextendedtotheLevenshtein,oreditdistancewhichisdefinedasthemin-
imumnumberofeditoperations–insertions,deletions,andsubstitutionsofletters–
transformingonesequenceintoanother[57].
Equivalentlytoaminimaldistance,amaximalscorecanbecalculated[84]thatpe-
nalizessubstitutionsandrewardsmatchingcharacters.Afirstpracticalalgorithm
forscoringapairofsequenceswaspublishedin1970byNeedlemanandWun-
sch[68].Althoughtheirdynamicprogrammingapproachwasdesignedtocompare
proteinsequences,itcanbeappliedtosequencesingeneral.

23

Chapter3.EfficientMatchingofContigs

Oneimportantaspectoftheunderlyingdynamicprogrammingdatastructureis
thatitallowstoderiveanalignmentofthecharactersthatisoptimalwithrespectto
theappliedscores.Thusitisnotonlypossibletotellthattwoproteinsequencesare
similar,butalsowhichoftheaminoacidsmatchandwhereinsertionsordeletions
couldhaveoccurred.
TheNeedleman-Wunschalgorithmproducesglobalalignmentsthatconsiderall
substringscharactersofofthehighrespectivsimilarityearesequences.ofinterInest.molecularTheselocalbiology,alignmentshowevercan,besometimesfound
withtheSmith-Watermanalgorithmthatwaspublishedin1981[87].Appliedto
twosequencesoflengthnandmrespectively,thealgorithmshaveatimeandspace
requirementofO(nm)[36].
withEventhegrothoughwingthisbiologicalrunningtimesequenceisefficient,databasesitandbecametheneedunpracticaltofindinapproximatecombinationoc-
currencesofproteinorDNAsequencesinthesedatabases.Tofacethis,twokindsof
fastheuristicsbutmighthavemissbeenrdeelevvanteloped:matches,Seedandandextendsearchspaceheuristicsfiltersthatarwhichearetypicallyruntimevery
heuristicsthatusuallyguaranteetofindallmatchesbelowaspecifiederrorrate.
Althoughthelatterareextremelyfastintheexpectedcase,theirworstcasecom-
plexitystaysO(nm).
Inthischapter,wewillfirstreviewdifferentalgorithmstofindlocalsimilarities
betweenDNAsequences.Theapproachthatwaschosentobeusedformatching
contigsisthenexplainedinmoredetailinSection3.2.Towardstheendofthis
chapter,wewillpresentoursoftwarer2cat,therelatedreferencecontigarrangement
tool,thatusesthismatchingtechniquetovisualizesimilaritiesbetweenthecontigs
andareferencegenomeandthatisabletoarrangethecontigswithrespecttothese
matches.Amoresophisticatedapproachtofindalayoutofthecontigsbasedon
theinformationprovidedbyseveralreferenceswilllaterbegiveninChapter4.

SimilaritiesLocalFinding3.1

Similaritiestoareferencegenomecanhelptofindalayoutforasetofcontigs.In
thefollowing,wewillrefertosimilarregionsbetweencontigandreferencegenome
asmatches,andtheprocessoffindingthesewillbetermedcontigmatching.
Pleasenotethatusuallyincomputerscienceamatchreferstoanexactandcom-
pleteoccurrenceofasearchpatterninatext.Here,weallowapproximateoccur-
rencestoaccountforsmallscaleevolutionaryeventslikeinsertions,deletionsor
singlenucleotidepolymorphisms.Also,wedonotdemandthatthecontigsare
matchedcompletelyonthereferencesincelargescaleeventsliketranslocationsor
inversionsmighthavescrambledthesequences.Amatchishenceequivalenttoa
localalignmentwithahighscoreor,inotherwords,toaverysimilarregionshared
bycontigandreferencegenome.

24

3.1.itiesSimilarLocalFinding

Inbiologicalsequencecomparison,theterminologyisoftentosearchaqueryin
atargetordatabase,insteadofasearchpatterninatextasincomputerscience.
Inthefollowing,thecontigsarethequeriesthataretobematchedontoatarget
referencegenome.Regardlessofthedifferentnames,allsequencesarecomposed
oflettersfromagivenfiniteandorderedalphabetΣ.WedenotebyΣ∗thesetofall
finitestringsoverΣ,by|s|:=thelengthofstrings=s1...s,andbys[i,j]:=si...sj
with1≤i≤j≤thesubstringofsthatstartsatpositioniandendsatpositionj.
Fortwosequencessandt,amatchisatuplem=(i,j),(k,l),ifthesubstrings
s[i,j]andt[k,l]havehighsimilarity.Analignmentisastringoverthealignment
alphabetA(Σ):=Σ∪{–}2\––where–∈/Σisagapcharacter.Eachletterof
A(Σ)impliesaneditoperation:a–and–astandfortheinsertionordeletionof
acharactera∈Σ,andbaisasubstitutionthatisalsocalledmatchifa=band
wise.othermismatch

atermanSmith-W3.1.1TheSmith-Watermanalgorithm[87]isawellknownmethodtosearchlocalalign-
byments.ItNeedlemanisaclevandervWunschariation[68of]theforglobaldynamicpralignments.ogrammingTheideakeythatwdatastructuraspreoposedisa
scorematrixSthatisusedtotabulatetheoptimallocalalignmentscoresforallpairs
ofoptimalpossiblelocalprefixesalignmentoftwscoroegivforentheprsequencesefixesxx[1,andi]yand.y[The1,j].entrAfteryS(i,j)computingstoresthethe
completematrixS,itsentrywiththehighestvaluerevealsthebestlocalalignment.
Thecomputationitselfisbasedonthefollowingrecurrence:
S(0,0):=0
S(i,0):=0
S(0,j):=0
0()
S(i,j):=maxS(i−1,j−1)+score(x[i],y[j])()
S(i−1,j)−γ(↓)
S(i,j−1)−γ(→)
foralli,1≤i≤|x|,
andj,1≤j≤|y|
whereγisthecostforinsertionsordeletionsofsinglecharacters,andscore(a,b)is
ascoresscoringshouldfunctionbechosenforthesuchsimilaritythatofrandomthesequencescharactersonaaandvberage.Torbeesultinsensible,negativthee
es.scoralignmentglobalThethreetopdefinitionsinitializethefirstrowandcolumnofthematrixand
arethebasecasesfortherecursion.Theremainingcellsofthematrixareusually

25

Chapter3.EfficientMatchingofContigs

computedinarow-orcolumn-wisefashionwhereeachentryS(i,j)iscalculatedby
maximizingoverfourdifferentcases:
()Includingzerointhemaximizationpreventsacumulationofnegativescores.
Thisallowstoskipbadscoringprefixesandstartanewalignmentfromthe
indices.sequenceentcurr()Thiscasereferstothesubstitutionofx[i]byy[j].Ifthecharactersmatch,this
willberewardedbyapositivescore.Althoughmismatchesarepunishedbya
negativescore,theycanalsogivethemaximalvalueifS(i−1,j−1)contains
ahighscorevalue.
(↓)Amaximalvalueinthiscaseindicatesthatthedeletionofx[i]providesthe
bestlocalalignmentuptothecurrentprefixes.
(→)Thisissimilartothepreviouscase,exceptforindicatingthattheinsertionof
y[j]isthebestchoice.
Figure3.1showsanexampleofacompletedscorematrix.Thearrowsinthis
picturecorrespondtothoseoftheabovecasesprovidingthemaximalvaluefora
cell.Byfollowingthesearrowsbackwards,fromthehighestentryinthematrixtoa
zeroentry,anoptimallocalalignmentcanberetrieved.Leta=x[i],b=y[j],andthe
arrowendin(i,j).Then,aadiagonalarrowcorrespondstothematchaa,–ifa=b,
otherwisetoamismatchb.Ahorizontalarrowstandsfortheinsertionband
averticaloneforthedeletion–a.
Thedescribedalgorithmonlyfindsabestlocalalignmentoftwosequences.Inour
case,wearealsointerestedinothersimilarregions,forexampleifpartsofthecontig
occurrearrangedinthereference.Alllocalalignmentsthathaveascoregreaterthan

26

-ACATGGj
-0000000
A0101000
T0000210
G0000132
iFigureMatching3.1:Localcharactersalignmentreceivescoreascorematrixoffor+1the,whilesequencesmismatchesx=,ATGinserandtionsy=andACATGGdele-.
tionswherearematches‘punished’arebblacyk−1.andTheallarrootherswsareindicategray.scoremaximizingeditoperations

itiesSimilarLocalFinding3.1.

acertainthresholdcaneasilybefoundbybacktracingallappropriateentriesofthe
scorematrix.However,carehastobetakenonoverlappingalignments.Tothisend,
amoreadvancedalgorithmthatfindsnon-overlappingsuboptimalalignmentswas
proposedbyWatermanandEggert[101].
ThecalculationofthescorematrixneedsO(|x|∙|y|)timesincethemaximum
valueineachcellcanbecalculatedinconstanttime.Theextractionoftheoptimal
localalignmentthenneedstimeproportionaltothelengthofthealignment.Espe-
cially,thematrixcomputationbecomesacomputationalbottleneckifmanysmaller
sequencesneedtobesearchedinalargerdatabase.Toalleviatethetimedemandin
suchcases,heuristicscanbeappliedinsteadoftheexactSmith-Watermanalgorithm.
Thenextsectionintroducestheconceptofseedandextendheuristics,whichtrade
speedforsensitivity.Thismeansthattheymightmisssometruematchesfoundby
theSmith-Watermanalgorithmbutareusuallymuchfaster.

HeuristicsExtendandSeed3.1.2Asphases:theFirstnameseedsalreadyaresearsuggests,chedwhichseedarandehighlyextendsimilarheuristicsregionsareofperforquerymedandintartwgeto
sequence.Inthesecondphase,theseedsareextendedtoalocalalignment.
Inordertoefficientlyretrievehighscoringseeds,differenttechniquescanbe
usedthatingeneralbuildanindexofoneofthesequences.Indexingtechniques
arecalledonlineifonlythequeryisindexed.Ifitisalsopossibletopreprocessthe
targetsequence,theyarereferredtoasindexed.
Aseedisapairofsubstringsofqueryandtargetsequencethathavehighsimi-
oflaritythe.InscortheelocalmatrixwithalignmentmanyincrexampleeasinginFigurentries,e3.1,aindicatedseedcorrbytheespondsblacktoarrows.diagonalsIn
theliterature,differentkindsofseedsareconsideredwhichrequiretosomeextent
methods:indexingentferdif•Exactseedsrefertoanexactoccurrenceofapartofthequeryinthetargetse-
quence.Usuallyaminimumlengthisdemandedtoavoidunspecificmatches.
Ahashindexoffixedsizesubstringscanbeusedtoefficientlylookupmatches,
asforexampleappliedinBLAT[51]tofindtheseeds.Forvariablelengthseeds,
suffixarraysorsuffixtreescanbeused,asforinstanceinMUMmer[54].A
spaceefficientalternativeofasuffixtreecanberealizedusingtheBurrows-
Wheelertransformation,likeinBWA[58].
•Spacedseeds,orgappedseeds,arelikeexactseedsofafixedsize,howeverwith
artheedifusuallyferencelongerthatthannotallexactcharacterseedsandpositionscontainhadonvettocarematch.charactersThesethatseedsare
ignoredwhencomparingqueryandtargetsequence.Similarlytoexactseeds,
spacedseedscanbecollectedinahashbasedindexalthoughitneedstobe

27

Chapter3.EfficientMatchingofContigs

slightlymodified.Algorithmsusingoneortwospacedseedsareimplemented
inPatternHunterIandII[61,59].
•Approximateseedsallowthatthecorrespondingsequencesaredifferentinany
modifiedposition,althoughhash-indexthecanbesequencesusedthatshouldstorhaesveforahigheachsubstringsimilaritythescore.positionsAgain,ofa
ofalltstringsallowsofathetrade-ofsamefsizebetwhaeenvingaspeedscoreandaboveasensitivitygiven.thrIftisesholdhigh,t.Thethechoiceindex
containsfewerentriesresultinginfewerseedsfound.Ontheonehand,this
bemakesmissingthesuchextensionthataphasematchisfasternot,onfound.theAotherlowhand,thresholdimportantt,ontheseedscontrarmighty,
willresultinmanyseedsfoundthathaveweakersimilarities.Thiscausesthe
extensionphasetotakemoretime,althoughadditionalmatchescanbefound.
Whenusingapproximateseeds,mostlythequeryisindexedsincetheindex
wuseouldapprgrowoximatetoobigseedsiftariseloforwandexampletheFtarASTgetA[is72]larandge.BLAST[3Implementations,2].that
Afterphase.aOvsetoferlappingseedsorhasnearbbeenyseedsfound,artheeusuallyseedsaremergedpostprandocessedinsubsequentlytheextensionaligned
withaSmith-Watermanlikealignmentprocedure.
terTheestingwholelocalseedalignmentandextendmustprcontainocedurateisleastbasedoneonhighthescoringassumptionseed.thatHoweeachver,itin-
ispossiblethatanoptimallocalalignmentcomputedwithSmith-Watermandoes
notcontainanappropriateseedsuchthatheuristicsofthiskindwillmissit.The
sensitivityislowerthanaSmith-Watermanalignmentthatissometimesreferredto
alignment.full-sensitivityasteinOneofsequencestheismostprBLAST.ominentItisantoolsonlineforappralgorithmoximatethatprmatchingeprocessesofthenucleotidequery.orAfterpro-
that,theexpectedrunningtimeforsearchingapreprocessedqueryisproportional
tothelengthofthetarget.
Whenmatchingmanysmallercontigsontoalargereferencegenome,itcanbe
advantageoustocreateanindexofthereference,sincethentheexpectedsearch
timealgorithmsforeachdescribedcontiginwilltheonlynextbesectionproportionalcommonlytortheelyoncontigssuchanlength.indexingTheoffilterthe
sequence.gettargerlar

3.1.3SearchSpaceFiltering
Tofindlocalsimilaritiesbetweentwosequences,aSmith-Watermanscorematrix
likeinFigure3.1canbecomputed.Thismatrixcorrespondstothesearchspace
ofallpossiblelocalalignmentsbetweenthetwosequences.Whileinthispicture
theseedandextendapproachesstrivetofindhighscoringdiagonals,searchspace
filteringdoesinprincipletheoppositebydiscardingregionsofthematrixthatare

28

3.2.MatchingbyFilteringwithq-Grams

toodissimilar.Searchspacefilteringisalsoknownasanexclusionmethodforapprox-
imatestringmatching[36,Ch.12.3].
Althoughthematrixisnotnecessarilyexplicitlycreatedforfilteringthesearch
space,theconceptualideapartitionsitintoequallysizedregionswhicharepossi-
asblyovpossibleerlapping.thatdoThenotgoalfulfillofaafiltercertainfilteralgorithmcriterionis.thenThistodiscarcriteriondasismanyusuallyraegionsnec-
givessareny,thralthoughesholdt,notorsufsimilarlyficient,thatitconditionhaslessthatathanreegionmismatches.yieldsascorTheefilterhigheristhancalleda
losslessAllrifallemainingsuchrpartsegionsofthearekeptmatrix,andwhichnonewiserefalselynotfilterdiscaredded.out,havetobeverified
inapostprocessingphasethatissimilartotheextensionphasedescribedinthe
previoussection.Incontrasttotheseedandextendapproaches,filteringwitha
losslessfilterwillfindallregionshavingascorehigherthanthespecifiedthreshold.
Aftertheindexisbuilt,therunning-timeofexclusionmethodsforsearchingis
intheexpectedcasetypicallylinearorevensub-linearwithrespecttothelengthof
thethusrqueryesulting.Ininthetheworstsamecase,timehowevercomplexity,theascompletetheSmith-Wscoreatermatrixmanhastoalgorithm.beverified

3.2MatchingbyFilteringwithq-Grams
FormatchingthecontigswechosetousetheSWIFT1algorithm,alosslesssearch
rspaceegionsfilterofathatspecifiedwasdevminimumelopedinsizeBielefeldthatha[v76e].lessThethanfilteragivenguaranteeserrortorate.findTheall
analgorithmindexingisofverytherfasteferwhenencematchingsequence.manyMostinforcontigsmationonarpreferesentedenceingenomethisduesectionto
isbasedontheworkofRasmussenetal.[76].

IdeaGeneral3.2.1Filteringmethodstofindapproximatematchesusuallyfollowaso-calledq-gram
countingstrategy.Aq-gramissimplyastrings∈Σqoffixedsizeq;theliteratureuses
frbetwequentlyeenlsequences-tupleorallokws-mertodetersynonymouslymine.whetherCountinganalignmenttheq-gramswithlessthatthanareeerrsharorsed
ispossible.Thesocalledq-gramlemma[71,98]statesthattwosequencesoflength
withHammingdistanceeshareatleastT(,q,e):=+1−q∙(e+1)q-grams.Ifthe
sequencesarecompletelyidentical,theysharealloverlapping(−q+1)q-grams.
Witheachmismatch,uptoqmatchingq-gramsaredestroyed.Thelemmaalsoholds
forThetheqedit-gramdistance,lemmaifcanisbetheusedlengthtoefoftheficientlyshorterdiscardsequencemanyinandissimilaralignment.regionsof
querysequenceandtarget.Inanaiveapproachthatwecallthebasicalgorithm,both
1Theacronymstandsfor:SequenceSearchingandAlignmentwithIndexingandFiltering

29

Chapter3.EfficientMatchingofContigs

sequencesarepartitionedintooverlappingpartsofequalsize.Thisisequivalent
topartitionanimplicitscorematrix,liketheoneinFigure3.1onpage26,into
overlappingboxes.Foreachbox,theq-gramscanbecountedthataresharedby
bothsequences.Wecallthesesharedq-gramsinthefollowingq-hits.Iftheq-hit
countofaboxishigherthanthethresholddefinedbytheq-gramlemma,thenthis
isanecessaryconditionthatanapproximatematchwithlessthanthespecified
numberofmismatchesexists.Ifthecountisbelowthatthreshold,thereisdefinitely
match.suchnoTheSWIFTalgorithmusesfurtherobservationstopartitionthescorematrixmore
cleverlythantheabovebasicalgorithm.Additionally,itprovidesasophisticated
calculationofthepartitiondimensionsandtheneededq-hitthreshold.Beforewe
explaindetailsoftheSWIFTfilterinSection3.2.3,wewillfirstaddresshowthe
q-hitscanbefoundefficientlyusinganindexdatastructure.

3.2.2BuildinganIndexoftheReferenceGenome
Intheabovebasicalgorithmicideaitiscrucialtofindexactmatchesofq-grams,the
q-hits,withintheboxesveryquickly.Thiscanbeachievedwithasocalledq-gram
indexwhichisasimplehash-basedindexofallq-gramsandtheirpositionsinthe
sequence.indexedInpossibleorderqto-gramaccesstoathenaturalq-gramsnumberefinficientlytheinrangememor0,|yΣ,|aq−1hash.Afunctionhashnumbermapsofeacha
q-grams∈Σqcanbecalculated,forexample,withthefollowingrankingfunction:
r(s)=∑qrΣs[i]∙|Σ|i−1(3.1)
1=iwhererΣ(c)mapseachcharacterc∈Σtoanintegerin0,...,|Σ|−1.
rWealizediththewithhelptwoofthistables.hashThefunction,occurrenceantableOCCimplementationthatofcontainsaq-gramtheindexconcatenatedcanbe
listsofoccurrencesforallq-gramsandthesecondoffsetlookuptableOFFwhich
containsforeachq-gramtheoffsetwhereitslistofoccurrencesqbeginsinthefirst
intable.memorThey,alltableoccurrOCCisencesofofsizeaqO(n-gram),andscanOFFbeisrofetrievsizeedO(|Σquickly|).bWyithlookingbothattablesthe
offsetsOFF[r(s)]toOFF[r(s)+1]−1intableOCC.
indexed.TheqIn-gramtheindexfirstcanrun,beeachbuiltbyoccurringgoingqtw-gramoistimescounted.throughBasedtheonsequencethetocounts,be
wecanspecifytheoffsetsinOFF,andaddeachoccurrenceofaq-gramduringthe
secondruntotheappropriatelocationinOCC.
Whencomputingthehashnumberoftheoverlappingq-gramsinbothaboveruns
withrankingfunction(3.1),wecanefficientlyretrievetherankofanextq-gramthat
overlapsatq−1positions:Iffortwooverlappingq-gramsazandzbwitha,b∈Σ

30

3.2.MatchingbyFilteringwithq-Grams

andz∈Σq−1therankr(az)isknown,thenitcanbeupdatedinconstanttimeto
r(zb)usingthefollowingequation:
r(zb)=r(az)+rΣ(b)∙|Σ|q−1
Σ||Whenusingthisconstant-timeupdate,theconstructionofaq-gramindextakes
O(n+|Σ|q)timeforasequenceoflengthn.
Thereareafewpointsworthmentioningforapracticalapplicationofaq-gram
index.catenatedFirstlistofofall,occurrindexingenceslarOCCgeneedssequences(n−qcan+1)leadtimestothememorsizeyprofanoblems.integerThevalue.con-
theDependingsequenceontothebearindexed.chitecturThee,themaincompletememorylistofcurrneedsentaboutdesktopfourtimescomputersthesizemightof
thusbealimitingfactorforindexinglargereukaryoticgenomes.Asaworkaround,
theindexcanbebuiltforpartitionsofthesequence,ormaybewrittentodisk.Both
willbemoretimeconsumingwhenqueryingforq-grams.
integerBesidestothepointtomemorayrlocationestrictioninthefororiginaltheoccurrsequenceenceistablelimited.OCC,Thealsoprtheuseogrammingofan
languageJava,forexample,allowsamaximumvalueof231−1≈2.1∙109foran
integer.Thisinhibitstoindexthehumangenomethathasover3∙109bases.
Asecondpracticalissueisaproperchoiceofqq.Ontheonehand,ahighvalue
ofqintegerletsvthealue.highestThenithashisnotnumberpossibleforatoqaccess-gram(|theΣ|OFF−1)arragetyprbiggeroperlythan.Sincethealsomaximalthe
likesizeoffortheDNAalphabetsequences.Σmatters,If,onthisthekindotherofindexhand,isqisbettertoosuitedsmall,forthensmallco-occurringalphabets
q-gramsmightnotbeveryspecificandaremerelyobservedbycoincidence.The
expectednumberofoccurrencesofaq-graminauniformlyi.i.d.randomsequence
oflengthis(−q+1)/|6Σ|q.IfweconsiderDNAsequenceswith|Σ|=4andan
exemplarysizeof=6∙10basesforprokaryoticgenomes,avalueofq=11results
ineachq-gramoccurringaboutonceintheexpectedcase.

SimilaritiesorfFiltering3.2.3UnlikethebasicfilteringalgorithmofSection3.2.1thatpartitionsthesearchspace
intoblocks,theSWIFTalgorithmpartitionsthescorematrixdiagonallyintoparallel-
ograms.Thisreectsmoreappropriatelytheshapeofanareainwhichhighscoring
.appearmatchesInthesimplecaseoffindingexactmatches,theseparallelogramswouldhave
awidthofonediagonalandalloverlappingq-gramswouldhavetomatch.For
approximatematches,weconsiderepsilon-matcheswhicharebasicallyoptimallocal
alignmentswheretheinvolvedquerysubstringhasalengthofatleastn0andthe
alignmenthasamaximalerrorrateof.Thelattermeansthatthealignmentcontains

31

Chapter3.EfficientMatchingofContigs

lessthan∙n0insertions,deletions,ormismatches.Rasmussenetal.providea
lemmaforthedimensionsofaparallelogramcontainingsuchan-match:
Lemma1(Rasmussenetal.[76])Letxbeasubstringofthequeryoflengthn0orgreater
thathasan-matchtoasubstringyofthetarget.LetU(n,q,):=(n+1)−q∙(n+1),
andassumethattheq-gramsizeqandthethresholdτhavebeenchosensuchthat
q<1/andτ≤min{U(n0,q,),U(n1,q,)},
wheren1=(n+1)/.Then,thereisguaranteedtoexistaw×eparallelogram
containingatleastτq-hitsinxandy,where
w=(τ−1)+q∙(e+1)ande=2(τ−1)+(q−1).
q/1−Forthechoiceofn0theauthorsgivethecorollarythataw×eparallelogramaccord-
ingtoLemma1existsifn0≥q1τ/+q−−q1.Rasmussenetal.givethecorresponding
proofs,togetherwithanefficientalgorithmtofindallw×eparallelogramsfor-
matchesoflengthn0orgreater.Theoutlineofthealgorithmisasfollows:
Withthedimensionsfortheparallelogramsgiven,thesearchofappropriatere-
gionsusesaslidingwindowapproach.Thereforethescorematrixisseparated
intooverlappingbins,eachcontaininge+1diagonals.Thenawindowofsizew
isslidoverthequerysequence.Theintersectionofthewindowandthediagonals
ofthebinsdefineparallelogramregionsforwhichallq-hitsarecountedthatoccur
withinthem.Itsufficestoaddtheenteringandsubstracttheleavingq-hitsofeach
parallelogram.Thepositionsofqueryq-gramsonthetargetareretrievedwiththe
q-gramindexofSection3.2.2.Ifaparallelogramcounterreachesτq-hits,thenthe
parallelogramisreportedtobepostprocessed.Insteadofreportingoverlapping
parallelograms,theyaremergedonthey.
Toimprovethespacerequirementsofthisalgorithm,thenumberofcounters
neededcanbereduced[76,Section3.2.1].Tothisend,w×(e+Δ)parallelograms
areconsideredthatoverlapbyediagonals.Figure3.2(a)showstheconceptofthe
w×(e+Δ)parallelograms.IfΔisapoweroftwoandgreaterthane,thenthebin
indicescanbeefficientlycomputedwithfastbitshiftoperations.
Therunningtimeofthisalgorithmcanbeimprovedbyprocessingeachq-gram
ofthequeryonlyonce[76,Section3.2.2].Usually,itisprocessedonetimewhen
enteringandasecondtimewhenleavingthewindow.Theimprovedversion,as
illustratedinFigure3.2(b),remembersforeachparallelogram,besidestheq-hit
counter,alsotheminimumandmaximumstartingpositionofaq-hitwithrespect
tothequery.Thisallowsthatonlyenteringq-hitsatthecurrentpositionneed
tobeconsidered.Ifaq-hitismorethanw−qawayfromthepreviousone,then
an-matchcannotcontainbothq-hits.Hence,themaximalparallelogramwithout
thisnewq-hit,definedbyminimumandmaximumpositionandofwidthe(or

32

3.3.r2cat–TheRelatedReferenceContigArrangementTool

Figure3.2:ImprovementstotheSWIFTfilteralgorithm:(a)Spacereductionforthe
bycountersprocessingduetoeachthequse-gramofwof×the(e+querΔ)yparonlyallelogonceramsinstead,andof(b)twice.speedFiguresimprovadaptedement
fromunpublishedworkofK.Rasmussen.

e+Δrespectively)isreported,ifitcontainsatleastτq-hits.Inanycase,theq-hit
counterpositionisonresettheandquery.theWhenminimumtheendandofthemaximumqueryispositionsreached,areallsettotheparallelogramscurrent
arecompletecheckedalgorithmandraseportedwellifastheyfurthercontainconsiderationsenoughqar-hits.eThediscussedinpseudocodedetailofinthethe
aforementionedpublication[76].
proThevidesnextthesectionmatchesintrtooducesvisualizeourandimplementationarrangetheofcontigs.theSWIFTfilteralgorithmthat

3.3r2cat–TheRelatedReferenceContigArrangementTool

Ifreferencegenomesareavailable,ithastobeassessedwhethertheyarerelated
enoughtoprovideinformationforcontiglayoutingandthusmighthelpinthegap
referclosingencephase.genomeAgivvisualesfirstinspectioncluesinofthiscorrregard.espondingregionsbetweencontigsand
Weimplementedtheprogramr2cat(relatedreferencecontigarrangementtool)
thattheirisablesimilaritiestoinquicklyanmatchinteractivaesetofvisualization.contigsontoThearspeedelatedofourgenomematchingandrdisplaoutiney
iscanbecompetitivperforetomedotherthatisestablishedhelpfulinprtheograms,finishingandanphaseofautomatedasequencingcontigprarrangementojectby
givingvaluablehintsonlayoutofthecontigs.
Inseveralsequencingprojects,r2catwasalreadysuccessfullyappliedtohelpin
thefinishingprocess:CorynebacteriumpseudotuberculosisFRC41[95],Rhizobiumlupini
(nomanywAgrin-houseobacteriumuserssp.attheH13.3)[CeBiT105],ecandappreciateCorynebacteriumthecapabilityulceransto[93].quicklyAdditionallygenerate
syntenyplotsforprokaryoticgenomeswithr2cat.
Incontigsthewithfollor2catwinginwemoreexplaindetail.thestepsofmatching,visualizing,andarranging

33

Chapter3.EfficientMatchingofContigs

hingMatc3.3.1Toassesstheirrelatedness,similarregionsbetweenthecontigsandarelatedrefer-
encegenomehavetobedetermined.Weimplementedforthistasktheq-gramfilter
algorithmdescribedinSection3.2.
AsmotivatedinSection3.2.2,weuseadefaultsizeofq=11fortheq-grams.To
calculatetheparallelogramdimensionsofthefilter,wechoseaminimumlengthof
n0=450bases.Thischoiceisbasedontheobservationthatsmallercontigsareoften
discardedsincethesedonotprovidemuchinformationandmightevenbearesult
ofasequencingerror.Furthermore,wesetthemaximumerrorrateto8%inorder
toallowmutationstohavehappenedbetweenrelatedsequences,butatthesame
timekeepthefilteralgorithmfast.
AccordingtoLemma1,thechosenvaluesyieldthethresholdτ=44q-hitsthat
havetooccurinaregionofwidthe=64basesandheightw=758bases.Whilefilter-
ingweconsiderw×(e+Δ)parallelogramswithΔ=128asproposedinSection3.2.3
toreducespacerequirements.
Ourmatchingroutineiscapableofhandlingmulti-chromosomalgenomes,pro-
videdinmultipleFASTAfiles,andalsofindsmatchesforthereversecomplement
ofeachcontig.Toperformthematching,firstaq-gramindexiscreatedfortheref-
erencegenome.Onlyq-gramsconsistingofthelettersA,C,G,andTareconsidered,
notdistinguishingbetweenlower-andupper-caseletters.Ifthesequencescontain
othercharacters,forexampleoftheIUPACnucleotideambiguitycode,thesechar-
actersareignoredandq-gramscontainingthemarenotindexed.
Afterbuildingtheindex,eachcontigisprocessedinforwardandreversecomple-
mentarydirectiontofindallparallelogramsfulfillingthefiltercriterion.Thereverse
complementmatchingisnecessarysincetheorientationofthecontigswithrespect
tothereferencegenomeisusuallynotknown,andadditionallythegenomesmight
havebeenshufedbyinversioneventsduringevolution.
Normally,allfoundparallelogramsarealignedwithaSmith-Watermanlikeal-
gorithmtoverifywhethertheyaretruematches,andtofindthelocalalignments
ofthesequences.Inourimplementation,weomitthistimeconsumingstepand
considertheparallelogramsasmatches.Adisadvantageofthisis,thatourmatches
mightnotstartattheexactpositionofatruematchonthereference.Thepositions
maydifferbyuptoΔbasesbecauseoftheoverlappingw×(e+Δ)parallelograms
usedbytheimplementedmemoryimprovement.Additionally,itmighthappen
thatparallelogramsareshorterthan450basesandalsodonotstrictlyfulfillthe
filtercriterionduetothespeedimprovementmentionedinSection3.2.3thatwe
alsoimplemented.However,Rasmussenetal.remarkforthelattercasethatwhen
searchingforlocalalignmentsinbiologicalsequencestheregionstriggeringsuch
parallelogramsareoftenofgreatinterestanyway[76].Thatiswhywekeepthem.
Despiteoftheseobjections,thematchesofourfilterprovideanacceptablyac-
curateimpressionoftherelatednessoftwoDNAsequences,asFigure3.3demon-

34

3.3.r2cat–TheRelatedReferenceContigArrangementTool

Figure3.3:TwoCorynebacteriumurealyticumcontigsmatchedonthereferencege-
nomeCorynebacteriumjeikeium:(a)withtheq-gramfilterimplementedwithinr2cat,
and(b)generatedwithBLAST.

ontostrates.arForelatedthisreferfigurencee,wegenome:arbitrarilyfirstwithpickedourtwolarmatchinggercontigsoutineandandmatchedthesecondthem
timeusingBLAST[2].Bothplots,whichvisualizethematchesfoundbythepro-
Ongrams,theagrothereelarhand,gely.Sourfilterometimes,frequentlyBLASTshofindswssmallasinglematchescombinedthatourmatch,filterwhermisses.e
BLASThasseveralsmaller.Ingeneral,webelievethatthequalityofthematchesis
acceptableforthepurposeofdisplayingsynteniesandalsoforlayoutingcontigs.
*.r2cAllasmatchesextension.foundIncanthesebefiles,cachedalsointheorhumanderroftheeadablecontigstextisfilesstorfored,if,whichforweexample,chose
theautomaticarrangementofSection3.3.3hasbeenused.Besidesloadingand
savingmatchesinr2cat,itisalsopossibletoparseandmodifythefileswithother
programs,ortoeditthemmanually.
Toshowthatthematchingimplementedinr2catiscompetitiveinrunningtime,
wecomparedittothethreewellknownmatchingprogramsBLAST[2],BLAT[51]
andMUMmer[54].ForBLAST,weusedtwodifferentversions:Theprogramblastn
wascasesthancompletelyitsprrepredecessorogrammedblastallinC++implementedandinseemsC.toWehavechangedbecomethefasteroutputinforsomemat
ofbothBLASTversionsfrompairwisealignmenttothelessextensivetabular
outputthatislikewiseproducedbytheotherprograms.Thisslightlyimprovedthe
runningtimeoftheBLASTprograms.
Eachprogramwasusedontwoprokaryoticdatasetstomatchasetofcontigs
ontoareferencegenome:Thefirst,smalldatasetofStreptococcussuis,takenfrom[7],
nomeconsistsofofanother281contigsstrainofwiththisatotalspecies.sizeofDetails2.1ofMbpthethatcontigswerareelistedmatchedinonTablethe3.1ge-,
detailsofthereferencegenomearegiveninTable3.2.Thesecond,largerdataset
ofSinorhizobiummelilotihas446contigsandwasmatchedonanotherstrainwitha
totalsizeof6.7Mbp,giveninthreereplicons.Detailsaregiveninthetables.

35

Chapter3.EfficientMatchingofContigs

Tablepairs3.1:(bp).ContigTheN50datathatcontigwsizaseisusedtodefinedtestasthethesizematching.oftheSizlargestesarecontiggivensuchinbasethat
atleasthalfofthetotalsizeiscoveredbycontigslargerthanthatcontig.
ContigOrganism#ContigsTotalSizeN50ContigSizeMedianSize
SinorhizStreptococcusobiumsuismelilotiSM112814462,071,6017,187,91032,95934,2167742121

Table3.2:Thereferencegenomestotestthematching.Sizesaregiveninbase
(bp).pairsReferenceGenomeRepliconTypeSizeNCBINumber
Sinorhizobiummeliloti1021chromosome3,654,135NC_003047
SinorhizSinorhizobiumobiummelilotimeliloti10211021megaplasmidmegaplasmidpSymBpSymA1,683,3331,354,226NC_003078NC_003037
StreptococcussuisSC84chromosome2,095,898NC_012924

TheresultsforeachprogramanddatasetcanbefoundinTable3.3.Itshows
thetimethatwasneededformatchingandadditionallythenumberofcontigsthat
couldnotbematchedandthuscanneitherbevisualizednorarranged.
Mostprogramsdonotgivematchesforallcontigsexceptforblastallthatoutputs
matchesasshortas12bases.Itisarguableifthosematchescanhelptofindalayout
forthecontigssincethesematchesmightoccuronlybychance.
Thecompetingprogramsfindrealalignmentsandthushaveahardertaskthan
r2catthatonlyfindsparallelogramswhichcouldcontainamatch.However,when
lookingatFigure3.3,thereducedworkseemstobeadequateforaquickimpression
onthesyntenyofthesequences.Usingourownmatchingroutinehastheadvantage
thatr2cathasnodependenciesonotherprogramswhichmightbedifficulttoinstall
andmaintain.Touseexactmatchesnevertheless,wewroteanadditionalPerlscript
thatisabletoreformatthetab-separatedoutputofBLASTinordertocreatea*.r2c
filethatcanbeopenedwithr2cat.Inprinciple,matchesofanyotherprogramcan
beusedaslongastheyareprovidedinasimilartabformat.
Thematchingwasprimarilyimplementedtobeusedondesktopcomputersand
withprokaryoticgenomesinmindthattypicallyaresmallerthan12Mbp.Oncom-
puterswithsufficientmemory,thematchingisalsopossibleforsmallereukaryotic
genomes.Todemonstratethis,wematchedascaffoldoftheDrosophilaananassae
genome(Dana,revision1.3,13749sequenceswithatotalsizeof231Mbp)ontothe
genomeofDrosophilamelanogaster(Dmel,revision5.35,15chromosomeswithatotal
sizeof169Mbp).BothgenomeswereacquiredfromthewebsitesoftheFlyBase
project[97]inmultipleFASTAformat.ThescaffoldofDanaseemstobeindraft
statusandcontainsmanysmallsequences.Incontrast,Dmeliscompletelyfinished

36

3.3.r2cat–TheRelatedReferenceContigArrangementTool

Table3.3:Timesformatchingasetofcontigsonareferencegenome(averageof
twoconsecutiveruns).Additionally,thenumberofcontigshavingnomatchatall
isgiven.Theexperimentswereperformedonasparcv9processoroperatingat
MHz.1593melilotiobiumSinorhizsuisStreptococcusProgramVersionTime(s)UnmatchedTime(s)Unmatched
blastall2.2.2412.30110.90
blastn2.2.24+4.886121.272
bnlatucmer153.0744.99.010994674.240.89284
7539.31027.7r2cat

suchthatitcanbeusedasareferencegenome.Thematchingtook6.9hourson
thesamemachinethatwasusedfortheresultsinTable3.3.Thepeakmemorycon-
sumptionwasat3.1GBytewhilecreatingtheq-gramindex.Wealsomatchedthis
datasetwithblastallandsurprisinglythiswaswith6.6hoursafewminutesfaster.
Thismightbeexplainedbyourobservationthatr2catfoundfivetimesasmany
matches,measuredinthetotallengthofcontigsubstringsinvolvedinmatches.

Visualization3.3.2Thereisavarietyofgraphicaltoolstoexploreandanalyzegenomicdata.Foran
overview,seethereviewofNielsenetal.[69].
Weuseasyntenyplottovisualizesimilarparallelogramsofcontigsandreference
genome.Insuchaplot,thesequencesarerepresentedbythexandy-axisofa
coordinatesystemandmatchingregionsaredisplayedasdiagonallines.Thisvisu-
alizationisrelatedtoadotplotthatcanbederivedifinascorematrix,liketheone
ofSection3.1.1,everyscore-increasingdiagonalisplottedasadot.Syntenyplots
usuallycombineseveraldotstodiagonalsandareoftenippedvertically.
Figure3.4onthenextpageshowsanexemplarysyntenyplotthatwasgenerated
withr2cat.Forthisplot,thecontigsofCorynebacteriumurealyticumwerematched
ontothecloselyrelatedgenomeofCorynebacteriumjeikeiumasreferencegenome.
Thehorizontalaxisrepresentsthereferencegenome,andontheverticalaxisall
contigsarestackedintheorderoftheunderlyingFASTAfile.
EachparallelogramthatwasfoundbytheimplementedmatchingroutineofSec-
tion3.3.1isdisplayedasadiagonalline.Lineswithnegativeslopecorrespondto
reversecomplementarymatches.
Thehorizontalbaratthebottomoftheplothelpstoassessthecoverageofthe
matches:maximumcoverageisdisplayedinblackandfadestolightgreywithless
coverage.Uncoveredregionsarehighlightedinred.

37

Chapter3.EfficientMatchingofContigs

Figure3.4:Syntenyplotgeneratedbyr2cat.ThecontigsofCorynebacteriumure-
alyticumarematchedontothereferencesequenceofCorynebacteriumjeikeium.
DetailsofthesequencesaregiveninSection6.1.1.

Ourimplementationfeaturesanexportofthedisplayedsyntenyplotstoeither
bitmaporvectorbasedgraphicsformats.Sinceeditableformatslikescalablevector
graphics(SVG)aresupported,r2catiswellsuitedtoproducehighqualitysynteny
plotstobeusedinpublicationsandotherprintmedia.
Intheprogram,theviewareaiszoomableandpanablesuchthatmatchescan
beexaminedmoreclosely.Matchescanbeselectedintheplotandalsodisplayed
inaseparatetablewindow.Besidesthestartandstoppositionsofthematchesin
contigsandreferencegenome,alsothenumberofexactlymatchingq-gramsandan
estimatefortherepetitivenessofamatchisshown.Selectedmatchesaredisplayed
inredandmatchesbelongingtoacontigthatisselected,appearinorangeinside
theplot(notshowninFigure3.4).
Initially,thecontigsarestackedintheorderastheyappearintheFASTAfilethat
wasusedformatching.Therearetwopossibilitiestochangetheirorder:Eitherwith
theautomatedapproachthatisdescribedinSection3.3.3,ormanuallyinaseparate
windowshowingthecontigsinatable.Inadditiontomovingcontigsperdragand
dropinthistable,itisalsopossibletoreversecomplementacontigifthisseems
appropriate.Aftermatching,r2catautomaticallyreversecomplementsacontigif
themajorityofmatchesbelongtothereversecomplement.Contigsthathavebeen
reversedaredisplayedintheplotinbluecolor.
Asanadditionaltablecolumn,r2catshowsanestimationforeachcontighow
muchofitssequenceisrepetitiveaccordingtothereferencegenome.Detailsofthe
repeatdetectionaregiveninSection4.4.2.
Whilethemainfocusofourtoolistoarrangeasetofcontigs,thesyntenyvi-
sualizationcanalsobeusedtoinvestigatetherelationshipbetweentwospeciesif,

38

3.3.r2cat–TheRelatedReferenceContigArrangementTool

Figure3.5:Syntenyplotoftwogenomes.ThereferenceisthesameasinFigure3.4,
butinsteadofthecontigsofCorynebacteriumurealyticum,itsalreadyfinishedge-
nomeisusedasquery.

insteadofthecontigs,thegenomicsequenceofanothergenomeischosenformatch-
seing.veralExemplarsmallery,Figurinsertionse3.5andshowsdeletionssuchacanplotbewherobserevaed.largescaleAdditionallyinv,rersionepetitivande
elementscanbeseenthatappearinregularpatternsasidefromthemaindiagonals.

Simple3.3.3MappingContig

Alternativelytochangingtheorderofthecontigsmanually,r2catfeaturesanauto-
maticarrangementofthecontigsbasedontheirmatchestothedisplayedreference
Togenome.perforThemorthederofmapping,thewcontigseisconsiderinferrforedbeachymappingcontigthethemcovontoeragetheofritseference.matches
–tofrtheomrefercontigencectothegenome:referLetencecov(cgenome,p)be–thatthecoversmaximumasinglenumberbaseofqposition-hitsofp.aWematchuse
themaximuminsteadofasumtoavoidcountingtheq-hitsinoverlappingmatches
severaltimes.Tofindwhereacontighasthebestcoverage,weuseaslidingwindow
approach.Inawindowofthesizeofacontig,welookattheaveragewindowcoverage
itsthatsize.isBydefinedslidingasthethesumwindoofwcovo(vc,erp)theovrerefertheence,wpositionseincalculatethatthewindoavwerage,dividedwindobwy
covpositionerageswherforeallityieldspossiblethepositions.highestavEacheragecontigwindoiswcothenverage.mappedFigurtoe3.6thatshowswindothew
resultofapplyingthisproceduretothematchesdisplayedinFigure3.4.
Afteranautomatedarrangement,thedisplayedorderofthecontigsinr2catcan
abe+orexported−tosign,textfilesindicatingwheretheeachinferrlineedgivesorientationtheidentifierwithrofespectacontig,totheasworiginalellas

39

Chapter3.EfficientMatchingofContigs

Figure3.6:ThecontigsdisplayedinFigure3.4areorderedwiththesimplemapping
approachimplementedinr2cat.Matchesinblueindicatethatthecontigisreverse
complemented.

FASTAfile.IftheoriginalFASTAfileisstillavailable,itisalsopossibletoexport
thecontigs,orderedandoriented,intoacopyofit.
Inintegratedarecentanextensionautomaticofprimerr2cat,YvdesignonnestepHertomannaid–theonegapofourclosingbachelorprocessstudentsinthe–
adjacentfinishingcontigsphaseofcanabeusedsequencingtopramplifyoject.thePrimerDNAinpairsbetwattheeen,boranddersofsequenceputativitsub-ely
instancesequently.theForGCthecontent,primerorthedesign,aoccurrvarietyenceofofrelevanthomopolymers,sequenceareefeaturves,aluatedlikeandfor
scorassuredetothatfindtheyhasuitableveprimercomparablecandidates.meltingThetemperaturbestescandidatesandthataretheythenbindpairtoedtheto
targetsequenceinsteadoftoeachother.
Thematchingofcontigstoareferencegenomecanbeconsideredvaluableforgap
closingsynteny.purStill,poses,theresultsassumingofthethatthesimplecorrmappingespondingofthisgenomessectionhavhaevaetohighbedegrhandledeeof
withcare.Figure3.7showsthecontigsofCorynebacteriumurealyticumintheirtrue
ordermatching,matchedthecontigsondifontoferentitsralrefereadyencefinishedgenomes.genome.ThetrueThisorfigurderewasshowsobtainedpossibleby
causeswhyasimplereferencebasedapproachmightnotbesufficient:

40

1.Largescaleinversionsmightsuggestthattwocontigsareadjacentwhilein
facttheyarenot,likeinFigure3.7(a).Here,thebreakpointsoftheinversion,
whicharethepointswheretheinvertedsequencehadbeencut,coincidewith
thebordersofthecontigsandthuscannotbedetected.Inothercases,like
theinversioninthecenterofFigure3.7(b),thebreakpointsaredetectable,
sinceasinglecontighasmatchesontheforwardandbackwardstrandofthe

3.3.r2cat–TheRelatedReferenceContigArrangementTool

Figure3.7:PairwisesyntenyplotsofthecontigsofCorynebacteriumurealyticumand
fourchosencompletegenomesoftheCorynebacteriagenus.Detailsofthese-
quencesaregiveninSection6.1.1.

reference.Insimplecases,thismightbeuntangled;however,theinuenceof
severalinversionslikeinFigure3.7(d)complicatesthistask.

2.Forpensaifgivtheenrdegrefereeence,ofsomesyntenyiscontigslow,ascannotforbeexamplematchedinrFigureasonablye3.7.(c).ThisAlterhap--
nativduringely,thethecoursesequenceofevofaolution,contigor,equivmighthaalentlyve,itsbeensequencedeletedincouldtheharveferebeenence
insertedintothecontigsgenome.Ifacontigisnotmatched,itcannotbecon-
gapstainedinbetwareenefertheencecontigs.basedlayoutFortunatelyand,ifthusthenotreferbeenceconsiderisedcloseforenough,closingonlythe
verysmallcontigscannotbematchedatall.

3.erRepetitivence.eThecontigsactualareplacementmappedtoisonemerofelytheirdonebpossibleychanceoccurrwhenencestheonthehighestref-

41

Chapter3.EfficientMatchingofContigs

numberofmatchesisconsidered.Figure3.7(a)showsthetypicalpatternsof
matches.eepetitivr

Someoftheseobstaclescanbeavoided,oratleastalleviated,ifnotonebutseveral
referencegenomesareconsidered.Whilethesimplemappingapproachdescribed
inthissectionisnotcapableofhandlingseveralreferences,amoresophisticated
approachwillbedescribedinChapter4.There,weshowhowtoincorporatesev-
eralreferencegenomes,aswellastheirphylogeneticrelationship,andalsogivean
algorithmtotreatrepetitivecontigsinaspecialway.

42

4Chapter

AdvancedContigLayoutingusing
GenomeserenceRefMultiple

Iappreciatetheoreticalworkforitselegance,yetfinditsterile
whenitistoodetachedfrompracticalvalue.
(JustinZobel,WritingforComputerScience)

Asimplemappingofcontigstoasinglereferencegenomemightnotbesufficient
todeviseavaluablelayoutofthecontigs.Usingseveralrelatedgenomesasrefer-
encescanimprovethepredictionsofneighboringcontigsformoredistantlyrelated
genomes[109].However,conictinginformationmayarisethatcomplicatestofind
alayoutinwhichthecontigsareuniquelyordered.
isIncapablethistochapteremplo,wyesevdescribeeralraelatedstrategygenomestofindasarlaeferyoutences.ofaOursetofapprcontigsoachusesthat
allsimilaritiesbetweenthecontigsandthereferencegenomestocollecthintsthat
apairofcontigsisadjacent.Thehintsaregatheredinaweightedgraphthatis
ofintraoducedsequencinginSprectionoject.4.1.ToThethisweightsmatter,inSthisectiongraph4.2candescribeshelpinathefastfinishingalgorithmphasefor
estimatingalayoutofthecontigswithrespecttoagivengraph.Afterthis,we
discussenhancementsofthegraphcreationinSection4.3,andpossiblevariations
ofthelayoutingalgorithminSection4.4.

43

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

4.1TheContigAdjacencyGraph

Inthissection,weprovidetheformalnotationforthecontigadjacencygraphthat
weusetocollectadjacencyinformationofasetofcontigs.Basically,thisgraph
containsedgesforallpossibleadjacencies.Theseareweightedinawaythatthe
edgesbetweenpotentialneighborsreceivehighweights.Theweightingisdoneby
analyzingthematchesofthecontigswithrespecttoasetofreferencegenomes.
Eachreferencegenomecanbeusedindependentlytocomputetheweights,anditis
straightforwardtomergetheinformationofseveralreferencesintoasinglegraph.
Inthefollowing,wewillfirstintroducethenotationforthecontigadjacency
graph,thenmotivatetheedgeweightingfunctionandfinally,inSection4.1.3,give
analgorithmtoconstructthegraph.

Notation4.1.1

RecallthenotationforstringsandmatchesgiveninSection3.1.Inthischapter,let
Σ={A,C,G,T}bethealphabetofnucleotidessuchthatΣ∗isthesetofallpossible
finiteDNAsequences.SupposewearegivenasetofcontigsC={c1,...,cn},ci∈Σ∗,
andasetofalreadyfinishedreferencegenomesR={g1,...,g|R|},gr∈Σ∗.
Asalreadyannounced,weusematchesofthecontigstorelatedreferencege-
nomestoinferinformationaboutthelayoutofthecontigs.Letm=(sb,se),(tb,te)
beamatchofcontigcandreferenceg.Thismeansthats=c[sb,se]isasubstring
ofthecontig,t=g[tb,te]isasubstringofthereferencegenome,andbothsequences
aresimilar.Thelengthofamatch,|m|:=|t|=te−tb+1,isdefinedasthelengthof
thecoveredsubstringinthereferencegenome.Forsb>sewedefinec[sb,se]tobe
thereversecomplementofc[se,sb]andcallmareversematch.Further,weassume
withoutlossofgeneralitythattb<teforallg[tb,te].Otherwisewecanreplaceboth
involvedsubstringsbytheirreversecomplements.
Wegeneratethematcheswiththeq-gramfilterdescribedinSection3.3.1.Thus,
foreachmatchm,thenumberofexactlymatchingq-gramsisprovidedwhichcan
beusedasaqualityestimationforthatmatch.Werefertothisnumberasqhits(m).
Thesetofmatchesbetweenacontigci∈Candreferencegenomegr∈Risinthe
followingdenotedasMir={m1,...,ms}.
Eachmatchm=(sb,se),(tb,te)∈Mircanbeinterpretedasaprojectionofcon-
tigciontothereferencegenomegr.Theprojectedcontigπ(m):=(tb−sb),(te+
|ci|−se)referstotheimpliedpairofindexpositionsongr.Forreversecomple-
mentmatches,theprojectioncanbedefinedsimilarly.Notethatthesizeofthe
projectedcontigcandeviatefromtherealsizeofacontigduetoinsertionsordele-
tionsinthematch.Figure4.1showsanexampleoftwoprojectedcontigsaswellas
theirdistance,whichwedefinenext.

44

4.1.TheContigAdjacencyGraph

Figuremand4.1:m.TheProjectionsdistanceπ(md)andreflectsπ(mthe)ofthedisplacementcontigscofandthecbasedprojectionson.theirmatches

Thedistanceoftwoprojectedcontigsπ(m)=(pb,pe)andπ(m)=(pb,pe)isgiven
bythefollowingfunction:
pb−peifpb<pb
dπ(m),π(m)=pb−peifpb>pb(4.1)
−min{|m|,|m|}ifpb=pb.
isIftheundefined.matchesrNoteefertothatdiftheferentterrmeferencedistanceisgenomes,usedthehereindistancetheofsensetheirofprdisplace-ojections
ment,anddisnotametricinthemathematicalsense.Forexample,disnegativeif
theprojectedcontigsoverlap.
Now,wedefinetheedge-weightedcontigadjacencygraphGC,R=(V,E),whichis
thecentralconceptofourlayoutingapproach.Thecontigadjacencygraphcontains
foreachcontigci∈Ctwovertices:liastheleftconnectorandriastherightconnec-
torofcontigci.ThesetofverticesVisthusdefinedasV={l1,...,ln,r1,...,rn}.
Afunctioncontig(v)referstothecontigforwhichvertexvrepresentstheleftor
.connectorrightThecontigadjacencygraphGisfullyconnected,thatisE=(V),andwe
termA={{v,v}∈E|contig(v)=C,Rcontig(v)}theadjacencyedgesthat2connectthe
contigsamongeachother.Theremainingedgesaretermedtheintracontigedges
I={{l1,r1},...,{ln,rn}}.Eachintracontigedge{li,ri}∈Irepresentsthecontigci.
Toeasetheunderstandingofthisconcept,Figure4.2onthefollowingpageshows
anexemplarygraphwithfourcontigs.
Byusingaleftandarightconnectorpercontig,theadjacencyedgesencodethe
relativeorientationoftwocontigs.Theedge{ra,lb},forexample,statesthatthe
rightendofcontigcaisadjacenttotheleftendofcontigcb,whichmeansthat
bothcontigshavethesamedirection.Weillustratethisadjacencyedgeby−ca→−c→b,or
equivalentlyby←c−b←c−a,sincetheedgeisnotdirected.Theadjacencyedge{la,lb},on
thecontrary,indicatesthatoneofthecontigsisreversedtowardstheother,depicted

45

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

Figureedges4.2:IareExemplaromitted.ycontigInsteadofadjacencythis,gtheraphcontigscontainingaredrafourwnascontigspictog.rTheams.intracontig

aseither←c−a−c→b,or←c−b−c→a.Thelastpossiblecaseofanadjacencyofthetwocontigs
wouldbe−c→a←c−b,or−c→b←c−arespectively,givenbytheedge{ra,rb}.
Theedgeweightsofthecontigadjacencygrapharecalculatedwithafunction
w:E→R0+thatwillbedefinedinSection4.1.2.Weareprimarilyinterestedinthe
weightsoftheadjacencyedgesA.Theseshallprovideascoreofhowlikelythe
involvedconnectorsareadjacentwithrespecttothereferencegenomesgr∈R.We
callthisthesupportoftwocontigsbeingadjacent.Forconvenience,weassumethat
wecanretrievethesupportvaluesaftertheircomputationfromthesymmetrical
contigadjacencymatrixW,inwhicharow(orcolumn)containsallweightsinvolving
connector:contigparticulara0...w({r1,rn})w({r1,l1})...w({r1,ln})
...−ci→←cj−......−ci→−cj→...
W=w({rn,r1})...0w({rn,l1})...w({rn,ln})
w({l1,r1})...w({l1,rn})0...w({l1,ln})
...←ci−←cj−......←ci−−cj→...
w({ln,r1})...w({ln,rn})w({ln,l1})...0
(4.2)Eachquadrantofthismatrixcontainsthattypeofadjacencyedgesasdepicted
inthecentralboxes.Theweightsinthemaindiagonalaresettozerosincethey
correspondtoself-loopsofthecontigconnectorswhicharenotconsideredinour
graph.adjacencycontigWecallthesumoftheweightsofalledgesincidenttoanodev∈Vthetotalsupport
ofthatnode,denotedbySv=∑v∈Vw({v,v}).InMatrix(4.2),thisisequivalentto

46

4.1.TheContigAdjacencyGraph

therow-(orcolumn-wise)sumofacontigconnector.Toestimatehowsignificant
anadjacencyedgee={v,v}∈Aisforagivencontigconnectorv∈V,weconsider
itsrelativesupport:Svrel(e)=wS(ve).Intuitively,thisfractiontellshowspecificthe
connectionisforthegivencontigconnector.Asinglehighweightededgeresultsin
arelativesupportcloseto100%,whilemanyequallygoodconnectionswilllower
thevalue.NotethatingeneralSvrel({v,v})=Svrel({v,v}).
4.1.2WeightingtheAdjacencyEdges
Foreachintracontigedgee∈I,wesettheweightw(e)=0sincethesedonottellus
abouttherelationshipbetweenthecontigs.Fortheotheredges,lete={vi,vj}∈A
beanadjacencybetweenthecontigsci=contig(vi)andcj=contig(vj).Then,the
totalweightofthisadjacencyedgeisdefinedas
w(e)=∑wr(vi,vj)(4.3)
gR∈rwherethe(symmetric)functionwrdefinesthesupportofthisadjacencywithrespect
toasinglereferencegenomegr.Eachsupportwr(vi,vj)isbasedonthematchesof
theinvolvedcontigsonthatreference:
wr(vi,vj)=∑sdπ(m),π(m)∙qhits(m)∙qhits(m).(4.4)
m∈Mir,m∈Mjr
Here,disthedistancebetweentwoprojectedcontigs,seeEquation(4.1),ands(d)
isasuitablydefinedscoringfactortoweightthematchesbasedonthedistanceof
theirprojections.Notethatinsteadofthenumberofq-hitsasqualitymeasurefor
thematches,alsotheBLASTbit-scorecanbeused.
Thescoringfactors(d)isbasedonthefollowingobservationsconcerningthe
contigs:ojectedprofdistances1.Projectedcontigsthatarenotadjacenthave,ingeneral,ahighdistanceand
shouldobtainalowscore.Adjacentcontigsshouldgainahighscoreforusu-
allyhavingadistanceclosetozero.However,asillustratedinFigure4.3(a),
thedistanceoftwoprojectedcontigscanreachpositivevaluesduetoinser-
tionsintothereferencegenome.Similarly,thedistancescanbenegativeif
theprojectionsoverlap,whichisthecaseifthereareinsertionsinthenewly
sequencedgenome.ThiscaseisshowninFigure4.3(b).Notethataninsertion
inonegenomecorrespondstoadeletionintheother,andviceversa.
Wemodeltheeffectofinsertionsanddeletionsinthereferencegenomewith
arandomvariableX∼N(µX,σX2)thatsatisfiesaGaussiandistribution.The
expectedvalueµXiszero,andthestandarddeviationσXcorrelateswiththe
deletions.andinsertionsofsize

47

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

Figurewhereas4.3:(b)(a)anAninserinsertiontionininathecontigrefleadserencetoagenomenegativeleadsdistancetoa.positivedistance,

2.Inthefragmentationphaseofasequencingproject,oftenfragmentsdisappear
suchthattherearenoreadsforthisfragment.Additionally,theassembler
softwaremightdiscardunreliablereadsattheendsofthecontigs,whichalso
leadstomissingsequenceinformation.Ifpiecesofthenewlysequencedge-
nomearemissing,thesamesituationarisesasifthereisaninsertionintothe
referencegenome,whichcausespositivedistances.
Theeffectofmissingsequencescanbemodeledaccordingtoarandomvari-
ableY∼N(µY,σY2)thatalsoobeysaGaussiandistribution.Here,theex-
pectedvalueµYmodelstheaveragesizeofthelostfragments,andσY2models
ariation.vtheirTotakethementionedeffectsintoaccount,weuseasuperpositionofbothGaussian
distributions.AssumingthatXandYareindependent,theirsumisalsodistributed
accordingtoaGaussian2distribution2[27,2Ch.11.2]:X+Y∼N(µX+µY,σX2+σY2).
Hence,wesetµ=µYandσ=σX+σY,anduseforourscoringthecombined
GaussiandistributionN(µ,σ2):
2s(d):=√1e−21dσ−µ.(4.5)
π2σThecombinedparametersµandσcanbeestimatedfromalreadyfinishedsequenc-
ingprojects,asdoneintheevaluationinSection6.1.3.Basedonthisestimation,we
canassumethatbothµ>0andσ>0.

4.1.3CreatingaBasicContigAdjacencyGraph
Usingtheabovedefinitions,acontigadjacencygraphcanbecreatedforasetofcon-
tigsandasetofreferencegenomesasdescribedinAlgorithm1.Foreachreference
genome,thecontigsarematchedindependently.Basedontheprojectedcontigs,the
weightsoftheedgesarecalculatedandintegratedintothecontigadjacencygraph.

48

4.1.TheContigAdjacencyGraph

Algorithm1:BasicContigAdjacencyGraphCreation
Input:setofcontigsC,setofrelatedgenomesR
Output:contigadjacencygraphGC,R
1initializecontigadjacencygraphGC,R=(V,E)withw(e)=0foralle∈E
2foreachreferencegenomegr∈Rdo
3foreachcontigci∈Crdo
matchesfind4M5calculateforallimatchesm∈Mirtheprojectedcontigπ(m)
end67foreachpairofcontigconnectorse={vi,vj}∈Ado
8computetheweightwr(vi,vj)andaddittow(e)
end9end10

Thisprocedureisthebasicwaytocreateacontigadjacencygraphandwewill
introduceenhancementstoitinSection4.3.Theseweredesignedtoimprovethe
reliabilityoftheedgesifadditionalinformationisavailable.Forexample,wepro-
videamodifiededgeweightfunctionthatisabletoincorporatethephylogenetic
distancesoftheinvolvedspecies.

PropertiesoftheGraphTheweightsofacontigadjacencygraphcreatedwith
Algorithm1havesomenoteworthyproperties:Strongmatchesoftwocontigsthat
haveahighnumberofq-hitsproduceahighsupportoftheemployedcontigcon-
nectorsiftheirprojectedcontigsoccurcloseonareferencegenome.
Smallscaleevolutionaryevents,likemutationsorsmallerrearrangements,may
resultinafragmentationoflargematchesintoseveralweakerones.However,our
scoringshouldberobusttothis,sincetheprojectedcontigsofthesematchescoin-
cidewiththeprojectionoftheunfragmentedmatch.Thedistancesarethusalike,
andconsequentlytheweightsofallmatchfragmentscontributetoahighweight
intotal.Incontrast,projectedcontigsofweakmatchesthatoccuronlybychance
usuallyhavelargedistancestootherprojectionssuchthatthescorefactor,andthus
alsotheweightcontributedtoanadjacencysupport,islow.
TheapproachofAlgorithm1assumesthatthereferencegenomesarecloselyre-
lated.Infact,largescalerearrangementscanbeproblematicforthesignificanceof
theadjacencysupportvalues:Alargeinsertiononareference,forinstance,causes
anadjacencyoftwoprojectedcontigssurroundingittogainalowweight.Insuch
casesitisadvantageousthatweuseseveralreferencegenomes.Asinglereference
genomethatdoesnothavethisinsertionissufficienttoalsogainanoticeablesup-
portfortheproperadjacency.Ifwelookatlargescaleinversions,thesecancause
highweightedgesforcontigsthatareinfactnotadjacent.Whenusingseveralrefer-

49

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

Figure4.4:Overviewofourapproachtofindalayoutforthecontigs.First,theedges
ofthecontigadjacencygraphareweightedaccordingtoAlgorithm1,andthenthe
bestadjacenciesareextractedtoalayoutgraph,asdescribedinSection4.2.2.

encegenomes,thiscanleadtoconictinginformationforacontigconnectorsince
wedonotknowwhichofthehighweightadjacenciesiscorrect.
Anotherpropertyofthecontigadjacencygraphconcernsrepetitivecontigs.These
have,intermsofadjacencies,severalneighbors.Intheadjacencymatrix,thisbe-
comesapparentbyseveralhighentriesinarow(orcolumn)ifthecontigsarealso
repetitiveonatleastoneofthereferencegenomes.Thismeansthatarepetitive
contigusuallyhasnotonebutseveralsupportedadjacenciespercontigconnector.
Tosummarize,thecontigadjacencygraphcontainsthecollectedinformation
givenbyallmatchestothereferences.Figure4.4(center)illustratesthisproperty.
Althoughrearrangementsandrepetitivecontigsmaycausehighweightsthathave
tobehandledwithcare,wecanusethecalculatededgeweightstofindalayoutof
thecontigs.Thenextsectionfocusesonhowtoextractthemostpromisingedges
task.thisfor

4.2FindingaLayoutoftheContigs

Givenacontigadjacencygraph,wewanttofindasubgraphofitthatcontainsall
relevantadjacenciesinordertoeasethegapclosurephaseofasequencingproject.
Wecallanysubgraphwiththispropertyalayoutgraphofasetofcontigs.Ideally,
theedgesofalayoutgraphbuildasinglepaththatcontainseachcontigonce.We
callthisalinearlayoutofthecontigs.

4.2.1TravelingSalesmanTourThroughtheGraph
Anaturalapproachtodiscoveralinearlayoutistofindatourofmaximalweight
thatcontainseachcontigexactlyonce,andinaspecifieddirection.Withthefollow-
ingminormodificationsofthegraph,thisbecomesequivalenttofindingashortest
Hamiltoniancycle:Alledgeweightshavetobeconvertedtodistances.Thisisdone
byreplacingeachedgeweightwbyc−wwherecisaconstantthatisnotlowerthan

50

4.2.FindingaLayoutoftheContigs

themaximumweightinthegraph.Further,weaddanintermediatenodebetween
theleftandtherightconnectorofeachcontigtoensurethateachcontigisincorpo-
ratedexactlyonce,andonlyinonedirection.Themodifiedgraphisthendefinedas
GC,R=(V,E)withV=V∪{vi|1≤i≤n}andE=A∪{{li,vi},{vi,ri}|1≤i≤n}.
Thedistanceofalledgesthatleadtoanintermediatenodeviissetto0.
AshortestHamiltoniancycleinthemodifiedgraphGdefinesanorderaswellas
themanrprelativobleme(TSP)orientationcanbeofallusedtocontigs.findaThus,linearanylayoutalgorithmofthetosolvcontigsethethatistravelingoptimalsales-
withrespecttotheweightsoftheunderlyingcontigadjacencygraph.Thenaiveap-
proach,whichistocalculatethelengthsofallpossibletours,isalreadyunfeasible
formorethanafewnodessincetheTSPisNPhard.Besidesthenaiveapproach,
therearemanyotheralgorithmstosolvetheTSP[6].Thesecanbedividedinrun-
timeheuristicsthatfindanoptimalsolutionwhilebeingintheexpectedcasefaster
–thanalthoughthenaivnoteapprnecessarilyoach,andoptimal–exactnesssolution.heuristicsthatveryquicklyfindagood
lemRuntimeinstancesofheuristicsuptolikeseveralnodes.branch-and-boundHowever,inalgorithmsthewcanorstbecase,usedthetoysolvstilleprneedob-
exponentialtime.Forhundredsofcontigs,thetimedemandtocomputeasolution
feasible.notstillthusisIncomparisontoalgorithmsprovidingexactsolutions,manyexactnessheuristics
aremuchfaster.Averyfastgreedyalgorithm,forinstance,isthemulti-fragment
incrheuristiceasing[13]distancethatprandoceedsthenasaddedfollows:inthisFirstordertheintoedgesanoftheinitiallygraphemptyaresetofsortedpathby
orifafragments.pathWhenefragmentverwanouldinvcrolveateedanodecycle,wtheouldedgeexceedistheskipped.maximalTheonlydegreeofexceptiontwo,
prtotheocedurlatterecriseatesthefinalmultiplelowHamiltoniandistancecyclepathoflengthfragmentsn.Thiswhichbestaremerconnectiongedsoonerfirst
.laterorTheAlthoughitprmulti-fragmentoducesnotheuristicnecessarilyiswellansuitedoptimaltofindtoura,itlineargatherslayouthighofthesupportcontigs.ad-
layjacencies.outingbeThismorelocalvaluableoptimizationthanofspendingthemuchadjacenciestimetomightingloballytheoptimizecontextofthecontigtour.

Inbrief,alinearlayoutofthecontigsthatisoptimal,ornearoptimal,withre-
specttotheadjacencyedgeweights,canbecomputedusingasuitableTSPalgo-
rithm.However,wefoundoutthatalinearlayoutofthecontigsisnotnecessarily
arrangedbiologicallyrregions.elevant.AThismethodisthatmainlyproduevidestoaanuniquearbitrarlayyoutplacementwhereofrpossible,epeatedbutoralsore-
pointsoutalternativesolutionswherenecessary,maybemoreusefulinpractice.In
thenextsection,wepresentourapproachtotacklethisproblem.

51

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

4.2.2FastAdjacencyDiscoveryAlgorithm

Ourapproachtodiscoverrelevantadjacenciesfromacontigadjacencygraphis
basedonthemulti-fragmentheuristicintroducedintheprevioussection.Wechose
thisgreedyheuristicbecauseitseemsnaturaltofirstincorporatethoseadjacencies
intoalayoutthataremostpromisingtobeinvestigatedforgapclosure.
Asalreadyindicated,repeatingorrearrangedregionsmayprohibitanunambigu-
ouslinearlayoutofthecontigs.Repeatingcontigscreatecyclesinapossiblepath,
andrearrangementscanleadtoconictingadjacenciesofacontig.Toaccountfor
this,werelaxtheconstraintsofthemulti-fragmentheuristic:First,weallowcycles
thatcouldappearduetorepetitivecontigs.Secondly,wheninsertinganedge,we
permitoneoftheincidentnodes,butnotboth,toexceedadegreeoftwo.This
allowstoalsoincludeconictinginformationintoalayout.
OurproceduretoextractalayoutgraphisformallydescribedinAlgorithm2.The
inputisacontigadjacencygraph,forinstancecreatedwithAlgorithm1.Westart
withthemostpromisingedges–thosewiththehighestsupport–andintegrate
themonebyoneintotheinitiallyemptylayoutgraph,exceptifbothoftheinvolved
contigconnectorshavealreadybeenintegrated.Whenanadjacencyedgeisinte-
grated,wealsosaytheedgeisrealizedinthelayout.Toavoidthattoofaintedges
arerealizedinthefinallayout,onecancheckiftherelativesupportofanedgeto
beintegratedwithrespecttobothcontigconnectorsisaboveacertainthreshold.
Theresultofourgreedyalgorithmisalayoutgraph,asexemplifiedinFig-
ure4.4(right).There,theadjacency−c→4←c−2,havingthehighestweight,isrealized
first.Theedge{r3,r2}isintroducedlater,althoughconicting,becauseithasa
higherweightthan{l4,r3},andthusr3isnotoccupiedyet.Finally,theadjacency
←c−4←c−3isrealized,sincel4isthelastfreeconnector.
Theresultinglayoutgraphusuallydoesnotdescribealinearlayout,andingen-
eralthegraphisnotnecessarilyconnected.However,itcontainsmanyofthe
stronglysupportedadjacenciesofthecontigsandincludesatleastoneedgeforeach
contigconnector.Thebestedgesarerealizedfirstandthenpaddedwithpossibly
conictinginformationsuchthatallcontigconnectorsareincludedinthelayout.
Allunambiguouslyincorporatedcontigscanbehelpfulinthefinishingprocess
toguidetheprimerdesignforgapclosure.Yet,knowingaboutconictingedges
canalsobecontributivesincetheseindicatepossiblerearrangementsorshowthe
inuenceofrepetitivecontigs.So,insteadofpinningtheresultdowntoasingle,
possiblywrong,linearlayoutofthecontigs,weprefertooutputthebestpossibilities.
Nonethelessitshouldbekeptinmindthatrearrangementscancauseseemingly
goodadjacenciesthatdonotbelongtoacorrectlayout.

ofthisAlgorithmlayouting,2isforourbasicexampleappraoachspecialtotrlayouteatmentcontigs.forrSepetitivectione4.4contigs.showsvariations

52

4.3.EnhancementsoftheGraphCreation

Algorithm2:BasicContigAdjacencyDiscovery
Input:contigadjacencygraphGC,R=(V,I∪A)
Output:layoutgraphLofthecontigs
1createemptylayoutgraphL=(VL,EL)withVL=∅andEL=∅
2foreachadjacencyedgee={v,v}∈A,sortedbydecreasingweightw(e)do
3ifVL∩{v,v}≤1then
4VL=VL∪{v,v}
5EL=EL∪e
end6end78EL=EL∪I

4.3EnhancementsoftheGraphCreation
Sofar,weintroducedthecontigadjacencygraphanditscreation,aswellasalayout-
ingalgorithmtoextracttheinterestingedges.Thissectionexplainshowadditional
informationcanbeusedinthegraphcreationphasetoimprovethereliabilityofthe
calculatedeights.wedge

4.3.1IncludingPhylogeneticDistances
Thereferencegenomesusedforlayoutingaretypicallyrelatedtodifferentdegrees
tothecontigsgenome.Sometimes,aphylogenetictreeofthespeciesisavailable
thatcontainsphylogeneticdistancesofthespeciestowardseachother.Ifnot,such
atreecanbegeneratedevenifsomegenomesarenotcompletelyassembledyet,for
example,fromthehighlyconserved16SribosomalRNA.
Givenaphylogenetictree,wecanuseittoweightthematchesaccordingtothe
relatednessofthereferencegenome.Assumingthatbetweencloserrelatedspecies
lessrearrangementshavetakenplace,thisweightingalsohelpstoavoidcontradict-
ingedgesinthecontigadjacencygraphthatcanbecausedbyrearrangements.
WhenaphylogenetictreeToftheinvolvedspeciesisavailable,wecanusethe
containedevolutionarydistancestochangethescorefactors(d)fromEquation(4.5)
atpage48to:2
sT(d,dT):=1√e−21ddT−∙µσ(4.6)
dT∙σ2π
wheredTisthetreedistancetotheparticularreferencegenome.
AsillustratedinFigure4.5,ahighertreedistancedTallowslargerinsertions
anddeletions,butscoresthereliabilityofthematchestomoredistantlyrelated
genomestoalesserdegree.TousethescorefactorofEquation(4.6)asdisplayed,it

53

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

Figureetersσ4.5:=10,Influence000andofthµe=ph0arbitrylogeneticarilytreechosenfdistanceorillustrgivenation.inEquation(4.6).Param-

isadvisabletonormalizethephylogeneticdistancessuchthattheclosestreference
specieshasadistanceofoneandlessrelatedgenomeshaveahigherdistance.
Thebasiclayoutingalgorithm,togetherwiththisphylogenetictreeenhancement
hasbeenpublished[41],andtheimplementationisknownastreecat,thephyloge-
netictreebasedcontigarrangementtool.Later,intheevaluationinChapter6,we
willreferbythisnametothecorrespondingalgorithmsandenhancements.

4.3.2IntegratingAdditionalInformation
Insequencingprojects,oftenadditionalinformationoccurswhichcanbehelpfulto
layoutthecontigs.Section2.3.2mentionsforexamplematepairs,fosmidlibraries,
oropticalrestrictionmaps.Thesepiecesofinformationcanbeincludedintoour
approachbymodifyingtheweightsofthecontigadjacencygraphaftertheircompu-
tation,whichtheninuencesthepredictedadjacenciesinalayoutgraph.Ifexpert
informationindicatesthattwocontigsarenotadjacent,itsufficestosettheap-
propriateedgeweighttozero.Thiscontigconnectionwillnotoccurintheresult
afterwards.Onthecontrary,ifforexamplefosmidendsequencingshowsthattwo
contigsareadjacentandquiteclose,theincorporationofthatedgeintothelayout
graphcanbeforcedbysettingthecorrespondingedgeweighttothemaximum
graph.theofeightw

4.4VariationsoftheContigLayouting
rInearrangementsapplicationsofinthetherbasiceferlaenceyoutinggenomeasalgorithmwellastorrealepetitivdata,eitcontigsbecamearetheclearmainthat
reasonformisleadingpredictionsofadjacentcontigsinthelayoutgraph.Inthe
wnextedealsection,withrweepetitivdiscussethecontigsanddetectiongiveandanhandlingalgorithmoftorincludeearrangements.themmoreAfterapprthat,o-
priatelyinSection4.4.2.

54

4.4.VariationsoftheContigLayouting

ementsRearrangHandling4.4.1Rearrangementsthathappenedbetweenthenewlysequencedgenomeandtheref-
erencegenomecancausefalseormissingadjacenciesinacomputedlayoutofthe
contigs.examineWhoewfirsttodealconsiderwithrlargeearrangementsscaleinvdueersions.toinsertionsanddeletions,andthen
Aninsertioninoneofthegenomes,forexamplecausedbyhorizontalgenetrans-
fer,cannotbedistinguishedfromapotentialdeletionofthatsequenceblockinthe
otherlently..IfIntherefersizeenceofarbasedespectivcontigelaysequenceouting,blockbothisrathercasescansmall,thusthebertreatedearrangementequivisa-
sufficientlyhandledbyourscoringfactorforthedistancesofprojectedcontigs,as
describedinSection4.1.3.
theOnawholelargercontigscale,isafitfected,dependsthenwhichthereispartnoofinforthecontigmationwhoaswtoinsertedlayoutorthisdeleted:contig.If
Theonlychanceinthiscaseistousemore,orotherrelatedreferencesinthehope
thattheycontainthenecessaryinformation.Ifonlyaninnerpart,oreithersideof
thethewcontigeightisoftheinsertedadjacencyordeleted,edge.Thethersupportemainingisthenpartsprthatoportionalmatchtodothesizecontributeoftheto
matches.espondingcorrgenomesAnotheraretypeinvofersionslarge[26scale].rInthisearrangementscase,usingthatmoreoccurreferfrenceequentlygenomesinprokarlikelyyoticin-
troducesevenmoreconictingedges.Whilesimpleinversionscenariosmightsuc-
cessfullybeuntangledbyexaminingthematches,aseriesofoverlappinginversions
canbeimpossibletoresolve.
invOneersionsissuggestiontoclustertodetectthewhichmatchesofofathecontigcontigstoshofindwtheirmisleadingmainedgesdiagonals,dueforto
examplewithaclusteringbylinearregression[34].Contigsthatcontainsignificant
forwardaswellasreversecomplementarymatchescanbemarkedtowarnthatthe
correspondingedgesmaybeunreliable.

4.4.2Repeat-awareLayouting
Asalreadystated,ameaningfulbiologicalorderofthecontigsmightdifferfroma
linearlayout.Ourconceptofalayoutgraphwithdrawstherestrictionofalinear
layoutbyallowingalternativeadjacencyedges.Thisisinparticularutilizedby
Algorithm2thatallowscycleswhichcanbecausedbyrepetitivecontigs.Still,this
basicalgorithmgeneratesalayoutgraphinwhicheachcontigisincorporatedonly
once.However,arepetitivecontighastobeincludedseveraltimesintoanadequate
layout,sinceitssequenceoccursseveraltimesinthegenome.
Whileotherapproachesoftentrytoavoidrepetitivecontigscompletely(seeSec-
tion6.1.4),wefoundthatorderingnon-repetitivecontigsfirstandaddingconnec-
tionstorepeatslaterseemstobeagoodstrategy.Unfortunately,theinformation

55

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

aboutrepetitivecontigsisnotdirectlyaccessiblefromthecontigadjacencygraph.
Thermatchesefore,toawereferdescribeencehogenome.wtoAfterinferwarwhichds,wecontigsintroduceareranepetitivenhancedebasedlayonoutingtheirof
thecontigsthatintegratesrepetitivecontigsasoftenasnecessary.

DetectionRepeatrTherepetitiveareesevsequenceseralwaonystothedetectcontigsrthatepetitivareestorcontigs.edinaOnedatabasepossibilityasisdone,toforfindknoexample,wn
btheyproRepeatMaskervided[86sequence].Howdata.ever,Therweeforaime,atwaede-nochosevotoruseepeatthedetectionmatchesthattoaisreferbasedenceon
genomeinordertodistinguishbetweenrepetitiveandnon-repetitivecontigs.We
callthelatterforthesakeofashorternotationfromnowonregularcontigs.
betwUsingeenthecloselyrmatcheselatedtodetectspecies.rSurepeatsely,thisassumesisavthateryrstrepeatingongrassumption,egionsareandconserwevwilled
discussitssensibilityinmoredetaillaterinSection6.2.3.Inaccordancewiththisas-
onlysumption,asinglewereferconsiderenceforgenometherthatepeat-aiswmostarelaycloselyouting,related.insteadofseveralreferences,
rdeterGivenmineasetwhichofmatchesmatchesMarierofepetitivcontige,ci∈andCfrtoomarthiseferwenceederivgenomeewhethergr∈Rthe,wewholefirst
contigcanbeconsideredasrepetitive.
mW=e(scall,sm),=(t,(tsb),se∈),(Mtbr,te)such∈Mthatirarepetitivematchifthereexistsanothermatch
iebeb(i)sthe≥scontigands≤substrings,andofmisincludedinthesubstringofm:
eebb(ii)thematchpositionsonthereferencearenotoverlapping:
{tb,...,te}∩{tb,...,te}=∅.
Sincetheexactpositionsofthematchesmayvary,dependingonthematchingpro-
cedurdefaultewthateisuseaused,valueweofallo10%wforforρ1.condition(i)aslackofρ1timesthelengthofm.By
matchBasedmonofsufthis,wficientespeaklength.ofarSufepetitiveficientcontigmeansifthattheatcontigleasthasρ2atperleastcentofonertheepetitivcontige
isthecovfolloeredwingbylettheCr⊂epetitivCbeethematch:setofser−sbepetitiv≥ρe2∙|c|.contigs.AsWedefaultconsiderwesetallρ2tocontigs90%.thatIn
RareAnnotadvrepetitivantageeoftothisberappregularoach.isthatthematcheswhichareneededtocreatethe
contigadjacencygraphofSection4.1canalsobeusedfordetectingrepeats.How-
rever,epetitivecontigsonthethataremploeryedepetitivreeferonencethenegenome.wlysequencedgenome,arenotnecessarily

56

4.4.VariationsoftheContigLayouting

Toextend,aswellasverify,thepredictionofrepetitivecontigs,onecanusethe
readcoverageinformationobtainedintheassemblyphasethatisdiscussedinSec-
tion2.3.2.Thisevenallowstoestimatehowoftenarepetitivecontighastobe
included.Anotherpossibilitytoverifytherepetitivecontigsistostudytheminthe
tanglestructureofthedeBruijn(sub-)graphofallreads(orcontigs)ofthegenome
tobeassembled[4].

Repeat-awareLayoutAlgorithm
Havingtherepetitivecontigsidentified,weshowhowtousethisinformationto
computeanappropriatelayoutofthecontigs.Tothisend,weadoptAlgorithm2
frompage53tobeawareofrepetitivecontigsandincludethemappropriately.The
overallstrategyistodistinguishbetweenregularandrepetitivecontigsandtopro-
cessbothsetsoneafteranother.Theabsenceofrepetitivecontigsinthefirstset
impliesthatmostcontigsshouldhaveexactlytwoneighbors.Followingthisobser-
vation,wewillbegintocreateasimplelinearchainingofthecontigs.Afterthat,
weexplorehowtherepetitiveelementscanbeintegratedintothisinitiallayoutina
.yawmeaningful

LayoutingtheNon-repetitivePartsoftheGenomeTocreateaninitiallayout
graphoftheregularcontigs,werealizetheiredgesinatwopassprocedure.The
firstpassisbasicallyavariationofAlgorithm2thatcreateslinearchains.Inthe
secondpasswerealizeadditionaledgestocontigsthatwouldbemissedotherwise.
ThecompletealgorithmtocreateaninitiallayoutgraphislistedinAlgorithm3.
Thesinglepartsareexplainedinthefollowing:Inline1ofAlgorithm3,thecontig
adjacencygraphiscreated.Notethatitisbuiltforrepetitiveandregularcontigs,
thustheprocedurestartswithmatchingallcontigsontothegivenreferencegenome.
LikeinAlgorithm1,allpairwisematchesareusedtocalculatetheadjacencysupport
values.However,weintroduceaslightmodificationthathelpstoreducemisleading
edgesforregularcontigs.Asmentionedbefore,contigsaswellasmatchescanbe
repetitive.Itcanhappenthataregularcontighasrepetitivematchesiftheyare
smallenough,forexamplecontigendsareoftenankedbyrepeats.Suchmatches
areignoredintheweightcalculationinordertoavoidmisleadingedgeswithhigh
supportthatarebasedonunreliablerepetitivematches.Ofcourse,forrepetitive
contigsallmatchesareused.
Afterthecontigadjacencygraphhasbeenbuiltforallcontigs,wecreateaninitial
layoutgraphoftheregularcontigs.Startingatline2ofAlgorithm3,weproceed
similarlytoAlgorithm2.Thedifferenceisthatanedgeisrealizedifbothofthe
contigconnectorsarenotincludedyet.Algorithm2requiresonlythatonecontig
connectorisfree.Thus,wegeneratemultiplefragmentsofgoodadjacenciesthatare
ingeneraljoinedtolargerchainsduringthecourseofthealgorithm.Thismatches
theexpectationthatsincewedonotresolverepetitivecontigsatthisstage,theresult

57

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

Algorithm3:SimpleChainingandExtension
Input:setofcontigsC,setofrepetitivecontigsCR⊂C,referencegenomeg
Output:initiallayoutgraphGLoftheregularcontigs
1createcontigadjacencygraphGC,g=(V,E),asinAlgorithm1,omitting
repetitivematchesforregularcontigs
2createemptylayoutgraphGL=(VL,EL)withVL=∅andEL=∅
3Eregular={{v,v}|contig(v),contig(v)∈/CRandcontig(v)=contig(v)}
edgesegularrfind//4foreache={v,v}∈Eregular,sortedbydecreasingweightw(e)do
5ifVL∩{v,v}=0then
6VL=VL∪{v,v}
7EL=EL∪{e}
end8end9//findadditionalregularedges
10foreache={v,v}∈Eregular\EL,sortedbydecreasingweightw(e)do
11ifehasexactlyonevertexuinVLandSurel(e)>τ1then
12VL=VL∪{v,v}
13EL=EL∪{e}
end14end1516EL=EL∪{li,ri}fortheleftandrightconnectorsli,riofallci∈/CR

shouldbeasetoflinearchainsofthecontigswhichcanalsobepresentintheform
ofoneorseveralcycles.
ifvUperytosmallline9,wcontigsefindlieapprbetweenopriatetwolarneighborsgecontigs,formostthenrweegularsometimescontigs.Hoobserwevveera,
contigsshadowingcaneffecthav,easahigherillustratedsupportinFigurthate4.6shado:Thewstheadjacencyedgewedgeeightstobetwtheeensmallthelarcon-ge
tig.thattheThus,smallthecontigalgorithmisnotwouldincludednotrinealizethelatheyoutweakergraph.edgesThiswithbehatheviorisconsequencegenerally
unwanted,but,aswewillseeinAlgorithm4,itcanbeadvantageousforsmallrepet-
itivecontigs.Thatiswhywedonotabandontheeffect,forexamplebyignoringthe
sizeofthematchesintheweightfunction.Instead,wecompensatetheshadowing
efgoodfectasforthepossible.affectedStartingregularinlinecontigs10bofyAlgorithmintegrating3,wthemelookintoattheallinitialedgeslaynotoutyaset
rlayealizedout.andAlthoughseeifthetheyshadocanwingappendedgeanstaysinunintegratedthelayout,contiginmostconnectorcasestothethecorrinitialect

58

4.4.VariationsoftheContigLayouting

Figurebetween4.6:theShadolargerwingeffcontigsect:cIf1aandsmallc3,incontigthec2iscontigonarefadjacencyerencegraphgenomethecorrectlocated
edgestoc2canhavealowerweightthantheadjacency−c→1−c→3.

edgesfromthesmallcontigwillalsoberealized,resultinginatriangleshapeof
connectedcontigs.Tocontrolthatonlyveryspecificedgesrelareincorporated,wetest
whethertheadditionaledgehasahighrelativesupportSofatleastτ1.
AddingtheRepetitiveContigsStartingwiththeinitiallayoutgraphconstructed
byKnoAlgorithmwingthe3,latheyouttaskofisthenowrtoegularincludecontigstherhelpsepetitivtoeclosecontigstheintogapstheinlaybetwout.een.
sinceRepetitivprimersecontigsinrare,epetitivinecontrast,sequencesnotwwillellbindsuitedforaunspecificallyprimerto-basedseveralclosingregionsofgapson
theNonetheless,genome.weThisbeliervesultsethatinitisvunspecificeryhelpfulPCRprintheoductsfinishingandshouldphaseofthusabeavsequencingoided.
projectforaresearchertobeinformedwhetherrepetitivecontigsinterruptagapof
tworegularcontigs,ornot.Inthecaseofseveralrepetitivecontigsinagap,their
orderplays,toouropinion,onlyasecondaryrolebecausethisinformationcannot
dirthisectlywillprhelpoduceintheevenfinishingmoreprunprocess:edictableIfbothresults.primersarebasedonrepetitivecontigs,
OurnecessaryideabetwineenAlgorithmthecorr4isespondingthereforretoegularplacecontigseachrintoepetitivtheeinitialcontiglayasoutoftengraph.as
thoseConsequentlywhich,connectthearimportantepetitiveedgescontigthatwwithewaranttoegularintegrateone,seeinlineour1ofinitiallayAlgorithmoutar4e.
Wedemandthattherelativesupportoftheseedgeswithrespecttotherepetitive
contigedges.isThehigheredgesthanbetwathreenresholdepetitivτ2.eThiscontigsavoidsarethenotincorconsiderporationedinofthisarbitrarilyapproach,weakas
e.vaboatedmotivaForsuitabletheintercounterestingpartedges,thatiswealsotrytoconnectedfindfortoeachtheinvotherolvedendrofegularthercontigepetitiveconnectorcontig,
asfolloshowingwninobserlinesv4ation:toAs10forillustratedtheleftinFigurconnectors.e4.7,arThisepetitivpreocedurcontigeisc∈basedCRonusuallythe
hasseveralgoodedgesforitsrightanditsleftconnectorleadingtodifferentregular
contigs.Theproblemistodeterminewhichedgesbelongtoaparticularrepeat
roccurregularencecontigonortherdering,eferencebecomesgenome.hereTheanadvshadowantage.effect,InthewhichwexampleasanofobstacleFigurefor4.7,
cc51btheytheadjacencyoccurrence−→−→ofhastheraelativhighelysupport,smallriftheepeatingcontigscontigc1c.andThus,c5aretheonlyobjectiveseparatedisto

59

Chapter4.AdvancedContigLayoutingusingMultipleReferenceGenomes

Algorithm4:IntegrationofRepetitiveContigs
Input:setofcontigsC,setofrepetitivecontigsCR⊂C,contigadjacency
graphG=(V,E),initiallayoutgraphGL
Output:repeat-awarelayoutgraphGLofthecontigs
1letErep={{v,v}|contig(v)∈CR,contig(v)∈/CRandSvrel({v,v})>τ2}
2foreachedgee∈Erep,sortedbydecreasingweightw(e)do
3ife={v1,l}containstheleftconnectorlofacontigc∈CRthen
4letrbetherightconnectorofcontigc
5ifexistsv2=argmaxv∈Vw({v1,v})|{r,v}∈Erepthen
6duplicatelandrtolandr
7VL=VL∪{v1,l,r,v2}
8EL=EL∪{{v1,l},{l,r},{r,v2}}
9remove{v1,l}and{r,v2}fromErep
end1011else//e={r,v2}containstherightconnectorrofacontigc∈CR
12performlines4to10analogously
end13end14

Figure4.7:Typicalscenariofortheadjacencyedgesofarepetitivecontigc∈CR.The
dashedlinesdepictthebestedgefromacontigconnectorontherighttoacontig
left.theonconnector

searchforanyregularnodethatisconnectedtoeithersideofarepetitivecontig,a
suitablecounterpartthatisconnectedtotheotherside,suchthattheedgefromthe
nodetothecounterparthasthehighestweight.
Thisway,wefindforeachsignificantoccurrenceofarepetitivecontigthetwosur-
roundingregularcontigswithrespecttothereferencegenome.Foralloccurrences,
weaddtwonewconnectorsoftherepetitivecontigandtheappropriateedgesto
graph.outylatheAssumingthatareferencegenomeishighlyrelated,thevariationofthebasic
layoutingalgorithmintroducedinthissectionhelpsintwoways:First,itdetects
contigsthatarerepetitiveandmarksthemintheoutputtoavoidmisinterpretation.
Second,Algorithm4includestherepetitivecontigsintoapreviouslygenerated

60

4.4.VariationsoftheContigLayouting

initiallayoutsuchthatacopywillbeplacedbetweentworegularcontigsforwhich
thecontigadjacencygraphsuggestsarepeatoccurrence.Unfortunately,itisnot
possiblewiththisapproachtoproposeanorderoftherepetitivecontigs,ifseveral
areincludedbetweentworegularones.
Dealingwiththerepetitivecontigsisahardtaskandusuallyotheralgorithms
filtertheminordertonotgetconfused.Inourbasiclayoutingalgorithm,wedonot
filterrepetitivecontigs.However,theyleadtoconictinginformationintheresult-
inglayoutgraph.Withthevariationdescribedinthissection,weaimatresolving
theseconicts.Theproposedalgorithmsincluderepetitivecontigsmoreappropri-
atelyandfindalinearlayoutfortheregularcontigs.Thecontentsofthissection
havebeenpublished[43],andwerefertotheimplementationofthealgorithmsin
theevaluationchapterasrepcat,therepeat-awarecontigarrangementtool.

Wehopethatthepresentedlayoutingalgorithmsofthischapterarehelpfulin
agapclosingprocess,althoughweareawarethattheydonotproduceaperfect
layoutofthecontigs.Thismightinsomecasesevenbeimpossiblewhenrelying
onlyonrelatedreferencegenomestofindthelayout.Thatiswhywewanttoappeal
tothecommonsenseofaresearchertoseetheprocessoflayoutingmorelikean
experimentinalabinsteadofanimpeccablealgorithmthatfindsthetruelayoutof
thecontigs.Thealgorithmsproposelikelyadjacenciesthatarebasedonthedata,
buttheymayfailtogiveauniquelinearlayoutofthecontigs.Nevertheless,when
comparingourimplementationwithotherrelatedprogramsinChapter6,weshow
thatourpredictionsprovidecompetitive,andoftenevenbetterresults.

61

Chapter

62

4.

ancedAdv

Contig

outingyLa

using

Multiple

erenceRef

Genomes

5Chapter

SoftwaretheofRealization

ThesoftwarecontainingtheideasandalgorithmsofChapters3and4evolvedgrad-
uallyuntilitreachedthecurrentstate.Insteadofjustdescribingthisstate,wewant
tosketchthemostimportantstepsofthisdevelopment.Toouropinion,thisex-
plainsbesthowthecurrentfeaturesemergedandwhichdesigndecisionshavebeen
taken.Thesecondpartofthischapterbrieyintroducessomeexternalprograms
andlibrariesthatareemployedwithin,orincombinationwith,oursoftware.

MilestonesImplementational5.1

InthementationofbeginningtheofSWIFTthisPh.algorithmD.project,(seeStheectionidea3.2.3arose)totomatchuseancontigsexistingontoC++arelatedimple-
referencegenometoaidintheprocessofgenomefinishing.Thisimplementation
wtheasneedfastercamethanupBLASTto,andvisualizeitperthemittedmatchestoinalsoordermatchtoveryinspectlargethercontigs.eliabilityofNaturallythe,
referencegenomes,andtoinferinformationabouttheorderandorientationofthe
contigs.Thisneedeventuallyledtothebirthofr2cat.

r2cat5.1.1Fortheimplementationofr2cat,wechosetheprogramminglanguageJava,mainly
becauseoftworeasons:Firstly,Javaisahighlyplatformindependentlanguagesuch
thatoursoftwarerunsonWindows,Linux,MacOS,andalsoonSolariswhichisthe
CeBiTecdefaultenvironment.ThesecondreasonisthatJavafeaturesdefaultgraph-
icalcapabilitiesduetotheincludedgraphicaluserinterface(GUI)librarySwing.
Accordingly,afirstprototypeofr2catusedSwingtoprovideabasicdotplotvi-
sualizationformatchesthatwerepreviouslygeneratedusingtheabovementioned
SWIFTC++implementationthatwasdevelopedbyKimRasmussen.Atthisstage,
wealsodevisedandimplementedarudimentaryorderingofthecontigsbasedon

63

Chapter5.RealizationoftheSoftware

theirnome.Hocenterwevofer,massthiswhichconceptishadinafeprinciplewawstheandmeansuchofwalleadoptedmatchesanonaideareferofenceJochenge-
BlomwhichresultedinthesimplecontigmappingasexplainedinSection3.3.3.
Thenextstepinthedevelopmentwasdrivenbythewishtocreateastand-alone
applicationformatchingandvisualizationthatwasindependentofcallingexternal
programs.Consequently,thematchingideaofSWIFTwasreimplementedinJava.
Atthispointintime,wedecidedtolicensethesoftwareunderthegeneralpublic
Thislicense(decisionGPL)isthatbasedallowsontheuserstobeliefobtainthattheandscientificmodifythesourcommunitycecodeprfrofitseefrofomcharsharge.-
ingandreusingcode.Additionally,providingalsothecodeallowsotherresearchers
tocomprehendandalsotestourapproachesmoredeeplythanwithanexecutable
alone.applicationEarlyusersofthesoftwarerequestedadditionalfeatures,likeforexamplethe
possibilitytoexporttheorderedcontigsintoFASTAfiles,ortosavepicturesofthe
displayedsyntenyplots.Theseandseveralotherusabilityfeatures,liketheones
introducedinSection3.3,havebeenimplementedovertime.Forexportingthe
syntenyplots,weusedexistingcodeoftheopensourceprojectFreeHEPthatwill
beintroducedinSection5.2.1.ExemplaryplotswerealreadyshowninSection3.3.2.
Thecompleteapplicationofr2cat,includingtheFreeHEPgraphicscode,canbe
packedinasingleJavaArchive(JAR)filewithasizeoflessthanonemegabyte.
Usingthisarchive,theprogramcanbestartedwiththeJavaWebStartTechnology
withouttheneedforaninstallation.
In2009,wewroteanapplicationsnote[42]introducingr2cat,andmadetheimple-1
Latermentation,oneaofvourailableonbachelorthestudents,BielefeldYvUnivonneersityHerrBioinformann,maticsextendedServerr2cat(BiBiSwitherv).code
toautomaticallydesignprimerpairs.ThecodeisbasedonaPerlscriptthatwas
developedbyJochenBlomandChristianRückerttoaidthefinishingofin-house
ojects.prsequencing

treecat5.1.2Simultaneouslywiththelatestdevelopmentsofr2cat,wealsoworkedontheideas
describedinChapter4.Here,itwashelpfulthatthealreadyimplementedmatching
routineofr2catcouldbereusedtocomputethematchestothereferencegenomes.
Untilmid2009,weimplementedthebasiccontigadjacencygraphcreationandthe
firstlayoutingalgorithms.Additionallytothelayoutingheuristicproposedinthis
thesis,wealsoimplementedanexactbranch-and-boundmethodwhich,however,
becomestooslowformorethanadozenofcontigs.
Subsequently,theideasoftreecatwerepublished[41],andtheimplementation
wasmadeavailabletobestartedwithJavaWebStartfromtheBiBiServ.2
12http://bibiserhttp://bibiservv.techfak.uni-.techfak.uni-bielefeld.de/trbielefeld.de/r2cateecat

64

MilestonesImplementational5.1.

ThesoftwaretreecatfeaturesabasicGUIwheretheusercanselectaFASTAfileof
thecontigs,andforseveralrelatedreferencegenomes.Additionally,aphylogenetic
trTheeeinmatchesNewicktotheforrmatefercanencebegenomesspecifiedarethatcachedcontainstoavtheoidanotherdistancestimeoftheconsumingspecies.
matchingmatching,ifthethecontigalgorithmadjacencyisrerun,graphforisexampleconstructedwithlikedifferentdescribedinparameters.SectionAfter4.1,
andthelayoutingisperformedusingAlgorithm2onpage53.
Tovisualizethegeneratedlayoutgraphs,weinitiallyappliedtheGraphvizpack-
age(seeSection5.2.2).Tothisend,thecomputedlayoutgraphwasoutputina
textualformat–theGraphvizDOTlanguage–andthenconvertedtoagraphical
representationusinganexternalprogram.
contigsHowevander,theconnectionsGraphvizresultsinvisualizationoverlappinghassomeedgesseverandedranodes,wbacks:whichAnrendersexcesstheof
graphunreadablesincetheoutputisstatic.Consequentlyinteractivevisualizations
havebeendevelopedintwobachelorprojectswiththegoaltomakethetreecatlayout
graphsmoreaccessibleanduserfriendly:
1.ChristianMieledevelopedanintegrationoftheprefusegraphdrawingtoolkit
(seeSection5.2.4)asareplacementoftheexternalGraphvizvisualization.A
layoutgraphcreatedbytreecatcanthusinteractivelybeassessed:Nodescan
bemoved,edgescanbeselected,andthegraphcanbezoomedandpanned.
2.AnnicaSeidelimplementedalocalviewtonavigatethroughthecomplete
contigadjacencygraph.Tothisend,asinglecontigisdisplayedascenter,
andonbothsidestheadjacencyedgestoothercontigsareshown,sortedby
bytheirclickingadjacencyonthesupport.contigs.AuserAdditionallycan,interactivedgeselycantrabeversemarkedthrasoughprtheomising.graph
trBotheecat.Theextensionssuccessorwerehtscatfinished(seeinSearlyection20115.1.4)butincludesareonlythemintentativaelycommonattachedframe-to
ork.w

repcat5.1.3Byanalyzingtheresultsoftreecat,werealizedthatrepetitivecontigsposeaproblem
inlayouting.Consequently,westartedinthebeginningof2010toimplementideas
howtohandlerepetitivecontigs.Tothisend,wemodifiedtheexistingcodeof
treecatwhichresultedinaprototypeofrepcat.Infact,mostcodeisverysimilar,
thechangesaffectmainlythelayoutingandpartiallythecontigadjacencygraph
creation.Theideasofrepcatwerepublished[43],buttodatetheimplementation
isnotofficiallyreleasedasatool.Reasonsforthisdecisioncanbefoundinthe
evaluationofrepcatinSection6.2.3.

65

Chapter5.RealizationoftheSoftware

htscat5.1.4

Inlate2010,westartedtocombinetheexistingtoolsintoauniformapplication:The
softwarehtscat,thehighthroughputsequencingcontigarrangementtoolsuite,is
designedtobeanextensibleframeworkthatincludesdifferentmethodsandalgo-
rithmstolayoutasetofcontigsinordertohelpinthefinishingprocessofprokary-
ojects.prsequencinggenomeoticThemodularframeworksimplifiesapossibleextensionwithfurthercodeof
otherdevelopers.Initialeffortsweretakentointegratearepeatresolvingapproach
(Ph.D.projectofPatrickSchwientek)thatisnotbasedonrelatedreferencesbuton
informationalreadyprovidedintheassemblyphase.Currently,thesourcecodes
ofbothprojectsarejoinedinasinglerepository,3andsomematchingpartsofr2cat
havebeenintegratedintotheotherproject.
Untilnow,htscatcombinesthefunctionalityofr2catandtreecatinasingledesktop
application.TheorganigramshowninFigure5.1givesanoverviewaboutthein-
corporatedcomponents.Ther2catcodebaseprovidesthematchingroutineandthe
visualizationofsyntenyplots.Duetothetreecatsources,weaddthecapabilitiesto
handlemultiplereferencegenomessimultaneously.Thebachelorprojectssupport
thiswithaninteractivevisualizationofthelayoutgraphs.BesidesprefuseandFree-
HEPasexternallibraries,weuseinhtscattheNetBeansPlatformframeworkthat
willbeintroducedinSection5.2.3.Itcontributesgreatlytotheusabilityanduser
friendlinessoftheGUIandalsosupportstowritemodularcodeforasimplified
application.theofextensibilityAscreeenshotoftherunningapplicationinFigure5.2onpage68showsthemost
importantcomponents:Thepaneldisplayedonthelefthelpstoorganizedifferent
sequencingprojects.Inafolderlikeview,foreachcontigset,thereferencegenomes
areshownonwhichthecontigswerematched.Besidescreatingnewprojectsina
wizarddialog,newreferencegenomescanbespecifiedforanyopenedproject.
Thematchestoeachreferencegenomearevisualizedwithther2catcode,asdis-
playedinthecenteroftheapplicationwindow.Oneorseveralreferencesofaproject
canbeselectedtocreateacontigadjacencygraph.Theresultinglayoutgraphpro-
ducedbythetreecatcodeisvisualizedwiththeimplementationsofthetwobachelor
projectsthatwerementionedabove.InthegivenFigure5.2onpage68,thesevisual
componentsareshownontherightandatthebottomofthewindow.
Allcomponentsofthewindowcanbefreelyreorderedtoadapttotheusers
needs.Fordualdisplayuse,eachcomponentcanbeundockedtoanindependent
window.ThankstotheNetBeansPlatform,allchangesarerememberedandre-
storedwhentheapplicationisstarted.

3ge.net/ceforhttp://htscat.svn.sour

66

Figure

5.1:

erOv

view

of

the

htscat

aresoftw

pictedwithdarkblueboxes,orange

dependencies

or

processing.

esxbo

5.1.

.ucturestr

erref

to

Implementational

rProgam

results

or

Milestones

components

output.

wsArro

are

de-

indicate

67

Chapter5.RealizationoftheSoftware

Figure5.2:Screenshotofhtscat.Top,fromlefttoright:Projectmanagement,r2cat
syntenyplotvisualization,layoutgraphcreatedbytreecat.Bottom:Localvisualiza-
tionofthecontigadjacencygraph.

External5.2LibrariesandSoftware

Inthissection,wedescribethelibrariesandexternalsoftwarepackagesthatweuse
withinthehtscatframework.Mostinformationaboutthementionedfeatureswas
gatheredfromthespecifiedprojectwebsites.

tExporGraphicsFreeHEP5.2.1FreeHEP4isaJavalibrarythatislicensedunderthetermsofthelessergeneral
publiclicense(LGPL)suchthatitcanbesharedandreused.Althoughthelibrary
wasactuallydevelopedfortheveryspecificsubjectofhighenergyphysics(HEP),it
alsocontainssourcecodewithabroaderscope.Inparticularusefulforourproject
isthevectorgraphicspackagethatallowstoexportSwingcomponentstoavariety
ofdifferentgraphicsformats.TheseincludethevectorbasedformatsEPS,PDF,PS,
andSVG,butalsobitmapimageslikeGIForPNGareprovided.Thereiseven
anexporttoTEXthatdrawsagivencomponentwithPSTricks.Weintegratedthat
thepartaboofvetheFrmentionedeeHEPforcodemats.intoourEspeciallyprojectthewhichSVGalloforwsmattoisexportusefulsincesyntenyitisplotseasilyto
4http://java.freehep.org/

68

5.2.ExternalSoftwareandLibraries

editableandprovidesanexcellentpossibilitytocreatehighquality,vectorbased
publications.forgraphics

Graphviz5.2.2TheopensourcesoftwarepackageGraphviz5canbeusedtovisualizegraphs[33]
thatareprovidedinatextualrepresentation,thesocalledDOTlanguage.This
languageeasesthedescriptionofgraphstructureswhichcanbespecifiedbysimply
providingallnodesandedges.TheprogramsoftheGraphvizpackagethenperform
alayoutingtocreateavisualrepresentationofthegraph.Notethatinthiscontext,
layoutingreferstomovingthecontentsofthegraphsuchthatitideallyhasno
overlappingedgesornodes.Withthisaim,differentprogramsofthepackageare
specializedfordifferentgraphtypes.Themostprominentisperhapsdotwhich
layoutshierarchicaldatarepresentedindirectedgraphs.Forourpurpose,however,
neatoisbettersuitedthatlayoutsundirectedgraphsaccordingtoaspringmodel.
AnexampleofsuchagraphisshowninFigure6.9onpage89.

ormPlatfNetBeans5.2.3TheNetBeansPlatform6isaframeworkthatwasoriginallyprogrammedforthe
NetBeansIntegratedDevelopmentEnvironment(IDE),butcanbeusedmoregener-
icallyinthedevelopmentofSwingapplications.Thankstoamodulararchitec-
ture,manyfunctionscanbereusedinowndesktopapplicationswhichimproves
theusabilityoftheapplicationwithoutaddingmuchoverhead.Themainreasons
whywedecidedtousetheNetBeansPlatformforhtscatareanadvancedwindow
managementwithintheapplication,apowerfullookupconceptforintraprocess
communication,existingcodeforprojectmanagement,andamodulemanagement
withloosecouplingofthecomponentsthatallowstoeasilydevelopanddistribute
extensionstothesoftware.

VisualizationGraphPrefuse5.2.4Theprefusevisualizationtoolkit7containsalibrarytodisplaygraphlikedatawithin
Javaprograms[39].Inoursoftware,weuseitasaninteractivereplacementofthe
Graphvizvisualization.Itsupportspanningandzoomingintothegraphandis
extensibleenoughtovisualizecustomnodes.Additionally,severaldifferentpro-
cedurescanbeusedtoplacethenodes,likeforexampleaforcebasedlayouting
comparabletothespringmodelinGraphviz.

65http://wwwhttp://netbeans.or.graphviz.org/featurg/es/platform/
7g/efuse.orhttp://pr

69

Chapter

70

5.

Realization

of

the

Software

6Chapter

LayoutingCorynebacteriaContigs

Thisquencingchapterdata.Afterdemonstratesintroducingtheperforthemancedatasetsofandourprdiscussingoposedprmethodseparatoryonrstepsealinse-
Section6.1,weevaluateandcomparethecapabilitiesofseverallayoutingprograms
otherusingdifsingleferrentefercontigencesetsbasedinlaSyectionouting6.2.apprMoreoaches,specificallyandtr,eecatr2catisiscomparcomparededwithwitha
relatedapproachthatalsousesmultiplereferences.Intheend,weassesstheability
ofrepcattocopewithrepetitivecontigs.

6.1BackgroundandPreparatorySteps

Thiserationsectionofargiveferesaenceladetailedyoutfordescriptionthecontigs,ofthedescribesemployedthetestdata,estimationofexplainsthetheparam-gen-
etersforourcontigadjacencygraphcreation,andfinallyintroducesotherrelated
programsforcontiglayouting.

DatasetstheofDescription6.1.1Forourevaluation,weuseddatafromsequencingprojectsconductedattheCeBiTec.
TheassembledcontigsthatweusedforourstudieswerekindlyprovidedbyAn-
dreasTauchandhisgroup.Allgenomicsequencesinvolvedinthisevaluation
belongtothegenusCorynebacteriumthatiscomprisedofGram-positiverod-shaped
eubacteriawhichtypicallyhaveahighGCcontent.
ThoughthehabitatsandpropertiesoftheCorynebacteriaarediverse,theyare
intensivelyinvestigatedfortwomajorreasons:Firstly,thegenuscontainsseveral
pathogenspecies,forexampleCorynebacteriumdiphtheriaewhichcausesthesevere
diseasediphtheriainhumans.Thesecondmotivationthatdrivestheresearchinthis
fieldistheindustrialutilizationofCorynebacteria,forinstanceintheproductionof

71

Chapter6.LayoutingCorynebacteriaContigs

Table6.1:Thethreecontigsetsusedintheevaluationexperiments.Thenumber
ofmatchingrepetitivtheecontigscontigs,asontotheirdefinedinalreadySectionfinished4.4.2,genomeswas.determinedwithr2catby
ContigOrganism(##RepetitivContigse)LengthTotal(bp)N50Size(bp)Contig
C.aurimucosumATCC70097573(15)2,736,23382,833
CC..urealyticumkroppenstedtiiDSMDSM710944385696(1)(15)2,294,7552,434,342546,37686,391

usedaminoinacidsthemassandprnucleotides.oductionofglutamateCorynebacteriumfortheglutamicumfoodisindustray.primeexamplethatis
SeveralspeciesoftheCorynebacteriumgenushavebeencompletelysequenced
bprynoojects.w.CorDuetoynebacteriatheirdiversitygenomes,howareever,thusthereperfectlyarealsosuitedmanytodelivongoingertestsequencingdatasets
rforeferourences,evandaluation:newTheresequencingareenoughdataarecloselyprroducedelatedregularlyfinished.genomesavailableas
Fortheevaluation,wepreparedthreedatasets,eachconsistingofasetofcontigs
tobelayoutedandasetofreferencegenomeswhicharerelatedtothecontigs
genome.Thethreecontigsetsweresequencedandassembledfromthespeciesof
C.aurimucosum[94],C.urealyticum[91],andC.kroppenstedtii[90],respectively.The
completegenomesofthesespecieshavealreadybeenfinishedandareavailable
fromtheNCBIwebsite.
Wediscardedallcontigsoftheoriginalassemblywithasizeoflessthan500base
waspairs,rtakenesultingsinceinsmallthesetscontigsshownthatinTconsistableof6.1.onlyThistwosteporisthraeercommoneadsarepracticenotverandy
thatinforisamativmore,erbutobustcanbeverycharacterizationconfusingforinthethesizemappingdistributionprocess.ofThecontigN50setscontigthansizethe
Tmeanable3.1oronmedian,page36doesnotcontainschangetheduedefinitiontothisoftheaction.N50Asacontigremindersize.,thecaptionof
WhileC.aurimucosumandC.urealyticumconsistofafewdozensofcontigs,and
requiredsomeefforttobeclosed,C.kroppenstedtiiisaspecialcase.Here,theini-
tialsequencingandassemblyresultedinstantlyinfiveverylargecontigs,ranging
from150–850kbp.Inthiscase,thefinishingwasratherstraightforward,andref-
erencebasedmethodswerenotnecessary.Nevertheless,weincludedthisdataset
intheevaluationtoshowthatseeminglysimpletasksforbiologistsmightbemore
complexforcomputerprogramswhenrelyingonthegivendataonly.
ishedAsrefergenomesenceweregenomesusedforandlaywereoutingextendedthebcontigs,ythechoosingthreefouraboveadditionalmentionedpubliclyfin-
aC.vailablejeikeium,Corthatweynebacteriadownloadedgenomes,fromC.thediphtheriaeNCBIw,C.ebsite.efficiensThe,C.completeglutamicumsetof,referand-

72

6.1.BackgroundandPreparatorySteps

Table6.2:Overviewofthereferencegenomesthatweemployedinourevaluation.All
sequencesbelongtotheCorynebacteriumgenus.
ReferenceGenomeRepliconTypeLength(bp)NCBINumber
C.aurimucosumATCC700975chromosome2,790,189NC_012590
C.diphtheriaeNCTC13129chromosome2,488,635NC_002935
C.efficiensYS-314chromosome3,147,090NC_004369
C.glutamicumATCC13032chromosome3,282,708NC_006958
C.jeikeiumK411chromosome2,462,499NC_007164
C.jeikeiumK411plasmidpKW414,323NC_003080
C.kroppenstedtiiDSM44385chromosome2,446,804NC_012704
C.urealyticumDSM7109chromosome2,369,219NC_010545

Figure6.1:PhylogenetictreeoftheinvolvedCorynebacteria.Forallspeciesmarked
withanasterisk(*)theunderlyingcontigdatawereavailable.Thetreewascalcu-
latedwithEDGAR[15]andvisualizedwithTreeVector[73].

theenceDNAgenomes,moleculesincludingofalltheirmentionedaccessionreferencenumbers,sequencesisshownoccurinTasablecir6.2cular.rNoteepliconsthat
bacteria.theinToassesstheevolutionaryrelationshipsoftheinvolvedspecies,wegenerateda
borphylogeneticJoining[81tr]eetoofthedistancesspeciesofawithsettheofcoreEDGARgenes.frameThewrork[esulting15]thattreeisappliesshownNeigh-in
Figure6.1.Foramoredetailedillustrationforthevaryingdegreeofrearrangements
andplotsthatsyntenywerebetwalreeneadythegivenemploinyedChapterspecies3:Figurleteus3.7roneconsiderpage41someofexemplarilytheshosyntenyws
foursyntenyplotsinvolvingtheC.urealyticumcontigs.WhileFigure3.7(a)shows
aFigurhighe3.7degr(d)eeshoofwslosyntenywandsyntenyonlyfecombinedwrwithearrangementsmanymajortotherC.earrangementsjeikeiumingenome,the
C.butdifaurimucosumferinglevelsgenome.ofsyntenyFigureswith3.7r(b)espectandto3.7(c)thershoeferwaencesimilargenomesinvofersionC.patterefficiensn
.diphtheriaeC.and

73

Chapter6.LayoutingCorynebacteriaContigs

6.1.2DeterminingaReferenceLayout
Thecomputeavaailabilityreferenceofthelayoutfinishedwhichcangenomicbeusedsequencesasaofstandarthedofcontigtruthsetswhenenablescomparusto-
ingthelayoutsgeneratedbythedifferentprograms.
Thereferencelayoutforeachsetofcontigswasdevisedbymappingthemwith
r2catontotheircorrespondingfinishedgenome.ThreeoftheC.aurimucosumcontigs,
withasizeof28kbpintotal,didnotmatchonthefinishedgenomeandcould
thussequencesnotbebelongincludedtotheintoC.thereferaurimucosumencelayplasmidout.ThepET44827.explanationforthisisthatthe
Intheprocessofmapping,r2catrevealedthatallcontigsetscontainrepetitivecon-
latigs;youttheircreatedquantitybyissimplegiveninmappingTableis6.1not.Duetonecessarilytheserreliable.epetitiveIncontigs,general,arreferepetitivencee
tualcontigsplacementmapinnon-uniquelythelineartolayoutmultipleofr2catlocationsismeronelythebygenome.chanceHosincewevaerr,theirepetitivac-e
contigisplacedonasingleoccurrencewhereithasthemostmatches.Whetheran
adjacencyoftworegularcontigsisinterspersedwithrepetitivecontigsisthusonly
chance.ofmatteraToaccountforthis,andtoeasetheevaluation,werelabeledthecontigsofeach
dataset.Weprefixedallregularcontigswithacandallrepetitivecontigswith
ansuchr.thatTheadjacregularenciescancontigseasilywerebethenseen.numberRepetitivedecontigsconsecutivwelyereinnumbertheiredtruetoo,orderbut
onlytodistinguishthem.Theirarbitrarynumbersdonotcontainanyinformation
adjacencies.aboutset:ForThethefirstfollofilewingecontainsvaluation,allrwegularecreatedcontigstwofothermultipleeferFenceASTlaAyoutfilesinfortheireachoriginalcontig
orderandorientation,howeverwiththeFASTAidentifiersrenamedasstatedabove.
Thisallowstoexaminewhethertheprogramspredictthecorrectadjacenciesofthe
regularcontigs.Atthesametime,thefiledoesnotcontainanyfurtherhintsdue
topre-orderedoralreadyorientedcontigs.Sincetherepetitivecontigsareremoved,
theforth.prTheogramssecondideallyfilerisecocrvereatedtheliketruewise,layoutbutitwhichalsoisc00contains,c01,thec02r,c03epetitiv,...e,andcontigs.so
Thisfilewasusedintheevaluationofrepcatthatisspecificallydesignedtohandle
contigs.these

6.1.3ParameterEstimationfortheContigAdjacencyGraph
Tocomputealayoutforasetofcontigsbasedonmultiplereferencegenomes,we
utilizethecontigadjacencygraphintroducedinSection4.1.Theweightsinthis
graphdependontwoparametersofthescoringfunctiongiveninEquation(4.5):
Themeanµ,andthestandarddeviationσoftheGaussiandistributionthatmodels
thedistancesofprojectedcontigs.

74

6.1.BackgroundandPreparatorySteps

Table6.3:Contigsandfinishedgenomethatweusedtoestimateparametersforthe
contigadjacencygraphcreation.
#ContigsTotalN50Contig
ContigOrganism(#Repetitive)Length(bp)Size(bp)
C.pseudotuberculosisFRC4187(1)2,315,33750032

GenomeFinishedFRC41pseudotuberculosis.C

(##RepetitivContigse)LengthTotal(bp)N50SizeContig(bp)
500322,315,337(1)87

RepliconTypeLength(bp)NCBINumber
NC_0143292,337,913chromosome

setInthatorisdertoexplicitlyfindsuitableomittedvinaluestheforevthesealuationparameters,experimentswesuchacquirthatedtheanotherestimationcontig
islargelyindependentoftheevaluationcontigs.WeusedcontigsofCorynebacterium
pseudotuberculosis,andagainfilteredthemforaminimumlengthof500bases.Ta-
ble6.3containsmoredetailedinformationabouttheresultingcontigset,andalso
aboutthecompletegenomewhichhasrecentlybeenfinished[95].
Withthementionedgenomicsequences,weestimatedµandσasfollows:First,
wecontigsdevisedasardescribedeferenceinlaytheoutprofetheviouscontigssection.andIncrtheeatednextaFASTstep,Afiletheofcontigstherwegularere
matchedonallgenomesofTable6.2.Then,foreachgenome,thepairwisedistances
ofprojectedcontigswerecalculated.Sinceweknowfromthereferencelayoutwhich
tocontigstrulyareadjacentadjacent,contigs.wecanTheinvestigatehistogramallofdistancesFigureof6.2prshoojectedwsthesecontigsdistances,thatbelongand
wecanobservethat,besidesoutliers,mostprojectedcontigshaveasmalldistance.
Reasonsforoutliersaremainlyrearrangementsinthereferencegenomes,repetitive
substringsinthecontigs,orunspecificmatchesoccurringbychance.
tribution,Whilethethemainoutliersfractionhinderofatheproperdistancesestimationfollowsofapprµandσoximately.aConsequentlyGaussian,wdis-e
trytoexcludethembydiscardingalldistanceslargerthanthethirdquartile(Q3=
20,906bases)ofthedata.Toalsoremoveoutliersontheotherside,whileaccount-
ingfortheasymmetryofthehistogram,weclipthedatasymmetricallywithrespect
toQ2−theQ3=median−16,(Q0982=are4,808discarbases)ded.ofThetheresultdistances.oftheThus,describedallvoutlieraluesrsmallereductionthanis
shownashistograminFigure6.3.
Onthetrimmeddata,weperformedthenamaximum-likelihoodfittingtoa
Gaussiandistribution.Roundedtointegers,thefittingresultedinameanvalue
ofµ=3,000andastandarddeviationofσ=6,926.ThecorrespondingGaussian
functionwiththeestimatedparametersisalsoplottedinFigure6.3.

75

Chapter6.LayoutingCorynebacteriaContigs

76

Figure6.2:Completehistogramoverthedistancesofprojectedcontigsthatareadja-
centaccordingtothereferencelayout.

Figurewithµ6.3:=3,000Rangeandσselected=f6,926.orparameterestimationandresultingGaussianfunction

Figureofthe6.4:rankBoofx-plotstheofcontigsthewithdistancesrespectoftotheprojectedreferencecontigslaygout.roupedbythedifference

6.1.BackgroundandPreparatorySteps

thereforCertainlye,alsotheonchoicetheofedgesµprandσedicted.hasanTheefdiffectonficultythetocontigfindsensibleadjacencygraph,parametersandis
illustratedinFigure6.4.Itshowsbox-plotsofthedistancesofprojectedcontigs
–notonlyfortrulyadjacentcontigsthathaveadifferenceintheirrankinthe
referencelayoutbyone–butalsoforcontigsthathavearankdifferenceoftwo,
three,andfour.Thelattercorrespondtoapproximateadjacenciesofcontigsthathave
one,histogramtwo,orinthrFigureee6.2contigsreprinesentbetwtheeen.sameNotedata,thathothewever,box-plottheforbox-plotrankdisplaoneysandonlythe
thereducedrangeofthedistancesfrom-50to450kilobasestoemphasizetheinter
ranges.quartilefromAnthoseidealofallscoringotherfunctionranks,hasmeaningtheprthatopertythatimmediaterankoneneighborsdistancesreceivearehighseparatedscores
andFigurethe6.4,latterthislowisscoronlyes.Aspossiblecantobeaseencertaininthedegreebox-plotsonthisfordata,rankonesinceandthetwointerof
quartilerangesofbothdistributionsoverlap.
Ingeneral,thegraphcreationandthelayoutingisrelativelyrobusttosmaller
widerchangesrangeoftheofdistancesparameters.getIf,highhowscoreveres.,theThisrwidthesultsσinisaincrstreasedongertooshadomuch,wingefthenfecta
sinceforexampleapartoftheranktwodistanceswillalsogethighscores.Onthe
contrary,averysmallvalueforσconsidersonlyasmallrangeofdistancesandthus
scoresonlyafractionofthecorrectdistancesadequately.
Thechoiceofµdependsontheexpectationshowbiggapsbetweenthecontigs
are.change,If,forandexample,consequentlysmallµcontigsshouldarbeerincremoveased.edfrHoomweaver,contigtoolarset,gethevaluesgapforsizesµ
favoradjacenciesbetweencontigsthatactuallyhaveasmallcontiginbetween.
Weexperimentedwiththeestimatedvaluesofµ=3,000andσ=6,926alittle
andfoundoutthattheyactuallyyieldquitereasonableresults.Therefore,wede-
cidedtousethederivedparametersforallapplicationsoftreecatintheevaluation
experimentsinSection6.2.

6.1.4OtherSoftwareforContigLayouting
Wefoundseveralprogramsintheliteraturethatareabletocomputealayoutofa
setofcontigsbasedonrelatedgenomes.Inthefollowing,weintroducethepro-
gramswhichweusedtocompareourlayoutingperformancewith.First,thesingle
referencebasedapproachesaredescribedinchronologicalorder,followedbythe
onlyotherapproachtohandlemultiplereferences.

MethodsBasederenceRefSingleProjector2ThewebserviceProjector2[40]mapscontigendsontoatemplatege-
nomeusingBLASTorBLAT.FeaturesofProjector2areanoptionalrepeatmasking

77

Chapter6.LayoutingCorynebacteriaContigs

forcontigandtemplatesequences,avisualizationofthemapping,andanauto-
matedprimerdesignstepforgap-closingpurposes.Priortotheautomatedprimer
design,difficultregionsforprimerwalkingareremoved.Theseincludesequences
withanunbalancedGCcontent,orrepetitivesequenceslikephageDNA,ISelements
duplications.geneor

theOSLaycontigsTheandprarogrameferenceOSLay[79sequence]takesorascafsetfold,ofandBLASTcoromputesnucmerfrommatchestheseabetwlayouteen
ofthecontigs.Tothisend,thealgorithmminimizesheightdifferencesofso-called
localdiagonalextensions,whicharebasicallymatchesfromtheborderofacontig
tothereferencesequence.Ifthereferenceisascaffold,thentheprogramoptionally
layoutsbothsetsofsequencessimultaneously.Aresultinglayoutisvisualizedand
canbeimportedintoaConsed[35]projecttoaidgapclosure.

ABACASTheacronymABACAS[7]standsforalgorithm-basedautomaticcon-
tiguationofassembledsequences.Theauthorsreferbycontiguationtotheprocess
ofaligning,ordering,andorientingasetofcontigs.Thesetasksaremainlydone
withtoolsoftheMUMmerpackage:Thealignersnucmerorpromerofthispackage
areusedformatching,andthenthedelta-filterandshow-tilingprogramsperform
theorderingandorientingofthecontigs.Beforematching,thereferenceFASTA
sequenceischeckedthatitcontainsonlyasinglechromosomewhichhindersto
useABACASformulti-chromosomalreferencegenomes.Aftercontiguation,the
ArtemisComparisonTool(ACT)[19]canbeusedtovisualizeandmanuallyreorder
thecontigs.Additionally,thePrimer3[53]programcanberuntofindsuitable
primerpairstoclosethegaps.

MCMTheMauveContigMover(MCM)[80]isintegratedintothegenomealign-
thementcontigsandbasedvisualizationonasystemcompleteorMauvedraft[24r].eferMCMenceprgenome.oposesTtheorthiselativend,eoritderusesof
orthederingMauveprexploitsogressivthateplacingaligner[25contigs]toinaidentifycorrectlocalordercollinearwillmerblocksgeLCBs,(LCBs).thusThethe
programstrivestominimizetheirnumber.Theprocessofmatchingandordering
iscontigs,iterativtheelyrMauveepeateduntilvisualizationtheorcanderbedoesusednottochangeinspectanymorpotentiale.Afterormisassemblies,deringtheor
alsorearrangementsbetweenreferencegenomeandcontigs.

erencesRefMultiplesevPGAeralrZhaoelatedetal.[sequences109]prasesentreferaences.methodFortofindeacharlaeferyoutenceforagenomesetofafitnesscontigsmatrixusing
iscomputedgivingdistancesbetweenthecontigsbasedontheirBLASTmatches.

78

6.2.EvaluationonRealSequencingData

Allmatricesarecombinedintoasinglefitnessmatrixtosearcha(near)optimal
pathalgorithm).ofcontigItisnoteconnectionsworthywiththattheirtheheuristicrandomizedPGA(pheralgorithmomonepotentiallytrail-basedproposesgenetic
differentadjacencieseachtimeitisrun.

6.2EvaluationonRealSequencingData
Intheremainderofthischapter,weaddresstheoutcomeofthreedifferentexper-
iments:Atfirst,weevaluatetheperformanceofsinglereferencebasedcontiglay-
outers,eachtimewiththeclosestgenomeasreferencesequence.Secondly,we
comparePGAandtreecatusingallreferencegenomesexceptfortheonetobefin-
ished.Finally,weassesstheeffectofrepeatsbyapplyingtreecatandrepcatona
contigsetincludingrepetitivecontigs.
Ineachexperiment,someoftheintroducedprograms,aswellasourimplementa-
tions,wereruntodevisealayoutofthecontigs.Theoutputwasthencomparedto
thecorrespondingreferencelayout.Asqualitymeasurement,wetakethefollowing
fourvaluesthatwealsostateintheresulttablesofeachexperiment:
TPWecountallproposedconnectionsthatalsooccurinthereferencelayoutas
edictions.prpositivetrueFPPredictionsthatdonotappearinthereferencelayoutarefalsepositives.
TPRFromthevaluesofTPandFP,wecalculatethetruepositiverate(alsosensitivity):
TP,=:TPRPwithPbeingthetotalnumberofachievablecorrectconnections.
PPVThepositivepredictivevalue(alsoprecision)tellsthepercentageofallpredictions
thatwereactuallycorrect:
TPPPV:=TP+FP
Notethatwhencountingthetrueorfalsepositiveswedonotconsidertheorien-
tationofthecontigssincePGAdoesnotalwaysprovidethisinformation.Incon-
sequence,itispossiblethatwecounttoomanytruepositives.However,wethink
thatthisevaluationprocedureisfairsinceallprogramsaretreatedequallyandwe
presumethatnoprogramgainsaparticularbenefit.
Besidesthetrueandfalsepositivesofapredictedlayout,wealsomeasuredthe
runningtimeofthedifferentapproaches.Allexperimentswereperformedona
Sun-Fire-V440Solarisserverwithfoursparcv9processorsoperatingat1,593MHz.
Duringtheexperiments,weweretheonlyuserloggedinsuchthateffectsbyother
processesontherunningtimeareminimized.Further,wemeasuredtheusertime

79

Chapter6.LayoutingCorynebacteriaContigs

ofations.theprObocessesviously,whichthetimesexcludesofthethewebtimeserforvicePrsystemojectorcalls2arlikeenotforexamplecomparable.I/Ooper-

6.2.1SingleReferenceBasedOrdering
Thisfirstexperimentwasdesignedtocomparethelayoutingofothersingleref-
erencebasedapproacheswithr2cat.Tothisend,weappliedthesinglereference
programsofSection6.1.4onthecontigsetsusingtheclosestphylogeneticneighbor
asreferencesequence.Accordingtothephylogenetictree,showninFigure6.1,this
isC.glutamicumforthecontigsofC.aurimucosum,andC.jeikeiumfortheothertwo
sets.contigAscontigsets,weusedtheFASTAfilescontainingtherenamedregularcontigs.
Foranyadjacencyestimatedbyaprogramwecouldthusdeterminewhetheritis
atruepositiveorafalsepositiveprediction.Besidestheobviousadjacenciesofci
andci+1,weconsideredthefirstandthelastcontigofeachsetasadjacentsincethe
genomesarecircular.Thenumberofpossibletruepositiveadjacencies(P)isthus55
forC.aurimucosum,5forC.kroppenstedtii,and54forC.urealyticum.
Wedescribenowfirst,howthedifferentprogramswereusedandhowtheesti-
matedadjacencieswereextracted,andthendiscusstheresultsgiveninTable6.4on
page82.Exceptotherwisestated,weusedthedefaultparametersofeachprogram.
•TheProjector2resultsweregeneratedusingitswebservice.There,wefirst
selectedtheappropriatereferencegenomeforeachdatasetfromagivenlist.
FortheC.jeikeiumgenomeeitherthechromosome,ortheplasmidpKW4was
selectable,butnotboth,sothatweusedthechromosomewithouttheplasmid.
Thenextstepwastouploadthecontigstothewebserver.Onthefollowing
parameterpage,wekeptthedefaultsettings,exceptthatweselectedtodown-
loadazipfilecontainingthegeneratedfiles.Asdefault,thematchingwas
performedbyrunningBLATontheserver.TherunningtimesforProjector2
wereestimatedfromtheprovidedstartandstoptimesofthecontigordering
jobs.Accordingtothewebsite,theserverrunsonaquadcoreAMDOpteron
with2.0GHz,thus,therunningtimesontheserverarenotdirectlycompara-
bletotheotherprogramsthatweranlocally.Wegivethetimesnevertheless
toprovidearoughimpression.Theadjacenciesofthecontigswereextracted
fromthemapped_sort_1.csvfilesfoundinthedownloadedzipfiles.
•TogeneratetheOSLayresults,wefirsthadtomatchthecontigsontothecor-
respondingreferencegenomesince,unlikeforallotherprograms,thisisnot
donebytheprogram.Thematchesthatwegeneratedwithnucmer(v.3.07),
aswellastheoriginalFASTAfiles,couldthenbeprovidedtoOSLaywhich
computedanoptimalsynteniclayoutofthecontigs.Tothisend,weusedthe
standardparametersoftheimplementationandselectedtoexportthesyntenic
layout.TheadjacencieswereextractedfromthesupercontigsList.x.txt

80

6.2.EvaluationonRealSequencingData

files.TherunningtimeofOSLayinthefollowingresultstableisonlythe
timeformatchingwithnucmer.Thelayoutingtimecouldnotreasonablybe
measuredsincethenecessaryfileshavetobeprovidedmanuallyinthegraph-
icaluserinterface.Asanestimation,thelayoutingtakesbetweentwotofour
seconds.additional•TheprogramABACASrefusedtousereferencegenomeswithmorethan
onesequence.Thus,weprovidedtheC.jeikeiumreferencewithoutitsplas-
midpKW4.Forthematching,theusercanspecifythateithernucmerorpromer
iscalled.Afterarapidmatchingwithnucmer,wediscoveredthattheprogram
hadnoresultsfortwoofthedatasets,andonlyfalsepositivepredictionsfor
thethird.TheexplanationisthatABACASreliesontheshow-tilingprogram
oftheMUMmerpackagewhichproducedanemptyfileforthetwodatasets
whenusingthenucmermatches.Consequently,wedecidedtomatchalsowith
themuchslowerpromerwhichyieldedbetterresults.Theadjacencieswerein
bothcasesextractedfromthegenerated<fasta>.tabfiles,where<fasta>
isthefilenameoftheoriginalcontigfile.
•TheMauveextensionMCMiterativelymatchesandreordersthecontigs.For
eachiteration,anewfolderiscreatedthatcontainsthesequencesandthe
estimatedlayout.WeusedthebatchoptiontostartMCMfromthecommand
line.ThematchingwasperformedwiththeprogressiveMauveprogram,of
theMauvepackageversion2.3.1,whichwecompiledfortheSolarisoperating
system.Theresultingadjacencieswereextractedfromthefoldercreatedinthe
lastiteration.There,thefile<base>_contigs.tabcontainstheproposed
adjacenciesofthecontigs,with<base>beingtheprefixofthecontigsFASTA
filenameuptothefirstdot.
•Finally,weusedr2catasdescribedinSection3.3.Aftersortingthecontigs,the
resultwaswrittentoafile.
Table6.4givestheresultsofapplyingthedifferentprogramsonthethreedatasets.
ItshowsthatthepioneeringapproachesProjector2andOSLay,whichmainlyuse
matchesattheendofthecontigs,clearlyfindlesstruepositiveadjacenciesthan
neweralgorithmsconsideringallmatches.ThemorerecentprogramsABACAS
(withpromermatches)andMCMpredictmorecorrectadjacenciesbutalsomore
falsepositiveconnections.Therunningtimesoftheseprogramsaredominatedby
thematching:ForABACAS,thematchingwithpromertakesseveraldays.Inview
ofthefewsecondsofsomeotherprogramsthisisanunacceptableamountoftime.
AlsothehoursthatMCMneedsfortheiterativematchingarenotdesirable.
Ourimplementationr2catshowsonthesedatasetsareasonableperformance.
Matchingandsimpleorderingareperformedinsecondsandthelayoutresultcon-
tainshighnumbersofcorrectadjacenciesandamoderatenumberoffalsepositives.

81

Chapter6.LayoutingCorynebacteriaContigs

Tablegenetically6.4:Resultsclosestofthegenomesingleasrefreferenceerencebasedsequencelay.outingTherprogrunningamstimesusingforthephOSLaylo-y
containonlythematching,thoseofProjector2arenotcomparabletotheother
timessincetheyweremeasuredonadifferentCPU.
ProgramTPFPTPRPPVTime(s)
C.aurimucosumcontigswithC.glutamicumreference
Projector27140.130.33(26)
OSLay010.000.0016
ABAABACASCASwithwithnpromerucmer2601200.470undef0.68.92,76932
r2catMCM332211330.600.400.750.409,52613
C.kroppenstedtiicontigswithC.jeikeiumreference
Projector2320.600.60(25)
ABAOSLayCASwithnucmer000000.00undefundef..2412
ABACASwithpromer310.600.7579,415
MCMr2cat22330.400.400.400.406848
C.urealyticumcontigswithC.jeikeiumreference
Projector216150.300.52(24)
OSLay620.110.7523
ABAABACASCASwithwithnpromerucmer2301570.430.000.610.00180,04744
r2catMCM262522290.480.460.540.466,32711

LetustakeacloserlookattheC.kroppenstedtiicontigs.Figure6.5showsthesyn-
tenytotheclosestreferencegenomewiththecontigsorderedandorientedaccord-
ingtothereferencelayout.Withthesimplecontigmappingofr2cat,thecontigc03
isplacedbetweenc00andc01,asindicatedinthefigure.Thisobstructsthreetrue
adjacencies.epositiv

Theexampleshowsthatasinglemisplacedcontigcanproduceuptothreefalse
positiveconnections.Themeasurementoffalsepositivesisthusverystringentin
thisevaluation,andonemightarguethatitisbettertoknowaboutapproximate
adjacenciesinwhichmaybeasmallercontigcanbemissingbetweentwolarger
ones.Especiallyinthenextexperiment,theprogramsprovidemorefalsepositives,
andoftenthesearesuchapproximateadjacencies.

82

6.2.EvaluationonRealSequencingData

Figure6.5:SyntenyplotofthecontigsofC.kroppenstedtiimatchedonitsclosest
referencegenomeC.jeikeium.Thecontigsareorderedandorientedaccordingto
thereferencelayout.

6.2.2MultipleReferenceBasedLayouting

Inthisexperiment,wewanttoinvestigatewhethertheuseofmultiplereference
genomescanimprovetheadjacencyprediction.Severalreferencegenomesmost
likelyincreasetheinformationthatcanbecollectedfrommatchesofthecontigs.
Thisinformationcan,however,alsobeconictingandthusleadtofalsepositive
predictions.Approachesthatarenotrestrictedtooutputasinglelinearlayout,as
forexampletreecatwithitslayoutgraph,willevenaccumulatesuchfalsepositives.
Nevertheless,webelievethatshowingseverallikelypossibilitiesisbetterthanto
outputasinglelinearlayout.Thelattercanbemisinterpretedascontainingthe
onlycorrectsolution.Givingalternativesinconictingcasesexpressesbetterthata
resultshouldbehandledwithcaution.
Inthefollowing,wecomparethemultiplereferencebasedlayoutingprograms
PGAandtreecatbyapplyingthemonthecontigsetswithallgenomesofTable6.2
asreferences.Ofcourse,wheneveragenomeistobereconstructedfromitssetof
contigs,thisgenomeisremovedfromthedatasetofreferencegenomes.Likeinthe
previousexperiment,weusedthecontigsetswithoutrepetitivecontigs.Wefirst
describehowbothprogramswereusedandwhichparametershavebeenchosen,
andthenpresentanddiscusstheresultsofTable6.5onpage85.
AlthoughwealreadyintroducedPGAbriey,wewanttoprovidesomefurther
details.Thishelpstobetterunderstandsomeaspectsoftheresultsandofthe
.PGAofapplicationAlayoutcomputationwithPGAcanbedividedintotwoparts:Inthefirstpart,a
fitnessmatrixiscreatedwhichiscomparabletoourcontigadjacencygraph.Tothis
end,thesetofcontigsismatchedontoeachreferencegenomeusingBLAST.The

83

Chapter6.LayoutingCorynebacteriaContigs

correspondingPerlscriptdiscardscontigssmallerthan3.5kbpbeforematchingand
masksrepetitivematches.Additionally,tooshort,aswellasoverlappingmatches
arefiltered.Basedontheremainingmatches,thefitnessmatrixiscreatedsuchthat
itcontainspairwisedistancesbetweenthecontigs.Thedistancesrangefrom0to20,
whereavaluecloseto0indicatesthatthecontigsarepotentialneighbors.Inthe
laststepofthispart,thematricesderivedfromallreferencegenomesarecombined
matrix.singleaintoInthesecondpartofapplyingPGA,ageneticalgorithmisusedtocalculatealin-
earlayoutofthecontigsthataimstominimizethedistancesgiveninthecombined
fitnessmatrix.Fortheapplicationofthealgorithm,weusedthedefaultparameters
thatarealsogiveninthepublication[109]:Wesetthepheromonetrailpersistence
ρ=0.8,therelativeimportanceof[the]pheromonetrailandthevisibilityβ=3,
theprobabilityforpseudo-random-proportionalselectionq0=0.8,andthemaxi-
malnumberofiterationsto10,000.
Theunderlyinggeneticalgorithmisarandomizedalgorithmsuchthatthelay-
outsofeachcomputationmaydiffer.ThisisprobablythereasonwhytheC++im-
plementationofthegeneticalgorithmcomputesfivelinearlayoutseachtimeitis
run,whicharethencombinedintoasingleresult.Inacombinedresult,foreach
predictedadjacencythepercentageisgivenhowoftenitoccurredinoneofthefive
layouts.Despiteofthisattempttoproducemorerobustresults,thecombinedlay-
outsarestillsubjecttovariation.ThatiswhyweappliedthelayoutingpartofPGA
20timesforeachcontigset,resultingin100computedlayouts.Inthecomparison
inTable6.5,wegivethemeanvaluesofthe20runs,andadditionallythebestre-
sultinparentheses.Asbest,wedefinethatcombinedresultyieldingthehighest
truepositiverate.Incaseofseveralequallygoodlayouts,webreaktiesusingthe
positivepredictivevalue.
Next,weappliedtreecatonthedataasdescribedinSection5.1.2.Tocomputethe
adjacencysupportvalues,weusedthephylogenetictreeoftheCorynebacteria,and
setµ=3,000andσ=6,926,asdiscussedinthesectiononparameterestimation.
TheresultsofbothprogramsarelistedinTable6.5.Thecomparisonshowsthat
ourtreecatachievesequalorbetterresultsthanPGAwhileatthesametimebeing
muchfaster.Therunningtimesgivenforthematchingmainlycomparethecorre-
spondingroutinesinr2catandBLAST,asalreadydoneinSection3.3.1.However,
therunningtimesforthelayoutingimpressivelydemonstratethattreecatis200-fold
fasterthanPGA,onaverageforthethreedatasets.
Anotheradvantageoftreecatisthatithasonlytwoparametersthatinuencethe
layouting,andneverthelessproducesquiterobustresults.Incontrast,PGAdepends,
besidesthealreadymentionedfourparameters,onfurtherdefaultvalues:For
instance,aninitialpopulationsize,defaultmutationandcrossoverprobabilities,as
wellasminandmaxvaluesforthepopulationfitnessarestatedinthepublication.

84

6.2.EvaluationonRealSequencingData

Table6.5:ResultsofapplyingPGAandtreecattolayoutthecontigsetswiththehelp
oofverthe20runsremaining,itsbestrgenomesunof(highestTableTPR6.2aswithrefbesterencesPPV).isPGAgivseninresultsareparenthesesaver.aged
(s)TimesProgramTPFPTPRPPVMatchingLayouting
PGA28.1(31)76.0(79)C.aurim0.51ucosum(0.56)contigs0.27(0.28)214.192.3
treecat31360.560.4634.60.3
contigskroppenstedtii.CPGA3.0(3)2.0(2)0.60(0.60)0.60(0.60)98.015.6
treecat340.600.4330.50.2
contigsurealyticum.CPGAtreecat2815.5(18)4075.9(75)0.520.29(0.33)0.410.17(0.19)221.934.280.90.3

Wenowexaminetheresultsforofallthreecontigsetsinmoredetailandprovide
someadditionalobservationsthatcannotbederivedfromthetablealone.

CorynebacteriumaurimucosumForthiscontigset,PGAfoundinthebestrun
thesamenumberof31truepositiveconnectionsastreecat.Butlookingatthebest
resultofPGAalonemightdistortthegeneralimpressionofitslayoutingquality.
Thebox-plotsinFigure6.6clarifythatthisresultisnottypicalforPGA,sinceoften
worseresultsarepredicted.
Stayingatthisdataset,acomparisonoftreecatwithr2catshowsthatusingmore
referencegenomesdoesnotautomaticallyprovidebetterresults.Whiler2catpre-

Figuredataset.6.6:TheBox-plotssteadyofresultthevofaryingtreecatresultsisshoofwn20byaPGAplainrunsbarf.ortheC.aurimucosum

85

Chapter6.LayoutingCorynebacteriaContigs

ofdictsfive33corradditionalectreferadjacencies,encetrgenomes.eecatfindsFourteentwoless,oftreecatalthoughsfalseusingpositivtheeprinforedictionsmation
mightneverthelessbehelpfulinthegapclosingprocess,sincetheyareapproximate
WhenadjacenciestrhaeecatvingisarunrankwithdifC.ferenceglutamicumofoneasinthesinglereferreferenceencelayout.genome,likeinthe
experimentwithr2cat,ityields35truepositivesand25falsepositivepredictions.
Thisadjacencyshowsgraphthatthatadditionalhinderthegenomesextractioncanintroftrueoducehighadjacenciesweightintheedgeslaytooutingthecontigphase.
Theadviceisthereforetofirstassesstherelatednessofthegenomestobeusedfor
labeyouting,chosenfortobeexamplethebasiswithofthesyntenycontigplots.adjacencyThen,thegraph,mostprfollowingomisingtheonesguidelineshould
qualitybeforequantity.Ourhtscatsuitesupportssuchaworkowbyallowingto
investigatethesyntenyplotsofthecontigsandtoselectthosereferencesthatare
mostrelatedtobuildthecontigadjacencygraph.

CorynebacteriumkroppenstedtiiThiscontigsetconsistsonlyoffiveregularcon-
tigs,butitseemstobehardtolayout.Noneofthecontigorderingprogramspre-
dictedmorethanthreeofthefivepossibletruepositiveadjacencies.Toinvestigate
thisfurther,wetakealookatthecontigadjacencymatrixWfrompage46.Thecom-
pletematrixdevisedbytreecatfortheC.kroppenstedtiicontigsisshowninFigure6.7.
Eachrow(orcolumn)containstheintegerroundedadjacencysupportvaluesfora
certaincontigconnector.Theadjacenciesofthereferencelayout
←−c00−−−c01→−−c02→←−c03−←−c04−
areunderlinedinthematrix.Inthecirculargenome,c04andc00arealsoadjacent.
Asareminder,thematrixissymmetricsuchthateachadjacencysupportappears

86

r00r01r02r03r04l00l01l02l03l04
01rr0000002300003200229400006
r0202302037710000
r0300200010000
r303770043940370
W=l0004021140420438
l0120003944200215
ll0302000294000037043020000000
l0460000815000
Figureweights6.7:computedSymmetrbicalywtreecateightformatrtheixCof.Equationkroppenstedtii(4.2)filledcontigs.withTrueintegeradjacenciesrounded
areunderlined,edgespredictedbytreecatareprintedinboldface.

6.2.EvaluationonRealSequencingData

twice.Theadjacenciesextractedbytreecatinitslayoutingprocessareprintedin
boldfaceinthematrix.Consequently,allboldfacedentriesthatareunderlinedare
truepositivepredictions,andthosenotunderlinedarefalsepositives.
Inthegivenmatrix,wecanobservethatthehighestadjacencysupport(394)for
theconnectionoftheleftconnectorofc01totherightconnectorofc04actually
doesnotbelongtoacorrectadjacency.Thehighweightiscausedbyaninversion
thatisalsodocumentedinthesyntenyplotinFigure6.5forthecaseoftheclosest
genome.enceeferrThisexampleshowsthatingeneraltheweightsinthecontigadjacencygraph
justcollecttheinformationgivenbythematches.Ifthematchesarebiasedor
unreliable,thentheweightsareso,too.Consequently,allresultsderivedwiththe
helpofreferencegenomeshavetobejudgedinthecontextoftheirreliability.

CormosttrueynebacteriumpositiveurealadjacenciesyticumOncomparthisedtocontigallset,otherourapprapproaches.oachtrThiseecatfoundsuccesstheis
slightlydiminishedbythe40falsepositiveconnectionsthatarealsocontainedinthe
resultinglayoutgraph.Nevertheless,thisnumberismuchlowerthanthe75.9false
positivesthatPGApredictedonaverage.
Inthefollowing,wewanttovisuallycomparethelayoutsgeneratedbyPGAand
treecat.Tothisend,weexemplarilyshowtheresultinglayoutgraphscomputedby
bothprogramsforthisdatasetinFigures6.8and6.9.Thegraphsarevisualized
withtheprogramneatoofthegraphvisualizationpackageGraphviz,asdescribed
inSection5.2.2.ForPGAweusedthebestlayoutasdefinedabove.
Thenodesinbothgraphsarelabeledwiththerenamedcontignamesthatindicate
thecorrectadjacencies.Anoptimallayoutwouldthereforebeasinglecircleshowing
allcontigsinorderc00,c01,...,c54.InthetreecatgraphinFigure6.9,anodelabel
smalleradditionallythan3,giv500esthebases,sizeasofwellthatasthecontig.Wincidentehaveedges,drawntoinimprgraoyveallthenodesofcomparabilitycontigs
ofbothgraphs,sincePGAfiltersallsuchcontigsbeforematchingandthuscannot
containtheminthelayout.
ItisarguablewhetherthemissingcontigsmightbeadisadvantageforPGAor
not;however,weliketonotethatneitherthesizeofthecontigstoberemoved
canbespecified,northeremovalcanbeswitchedoffcompletelyintheirsoftware.
Further,inthetreecatlayout,thesmallcontigscontributedonlytofourtruepositive
connectionsbutatthesametimeintroducedtenfalsepositives.
TheedgeweightsinPGAslayoutgraphinFigure6.8tellinhowmanyofthe
wfiveeladraywoutsallthisedgesinconnectiongraywthatasprbelongoposed.toTaoincrconnectioneasetheroccurringeadabilityonlyofthisonce.graph,The
edgelabelsinourtreecatgraphgivetherelativesupport,asdefinedinSection4.1.1.
Avaluecloseto100%indicatesthatthisistheonlyrelevantadjacencyoftheincident
.connectorcontig

87

6.Chapter

88

LayoutingCorynebacteriaContigs

Figure6.8:Thebestlayout(TP18,FP75,TPR0.33,andPPV0.19)thatPGAgen-
eratedfortheC.urealyticumcontigsin20runswhenusingallothergenomesas
referencesequences.Thecontignodesarenumberedaccordingtothereference
layout.Theedgelabelstellhowoftenanadjacencyoccurredinoneofthefivelay-
outs.Toimprovethereadability,all50edgesoccurringonlyinasinglelayoutare
drawningray.

Figure

6.9:

.C

urealyticum

erencerefasgenomes

contig

6.2.

adjacencies

aluationEv

predicted

by

on

Real

treecat

Sequencing

when

using

all

Data

other

sequences.Thecontignodesarenumberedaccordingto

thereferencelayout.Contigssmallerthan3.5kbphavegraynodes.The

labelsshowtherelativesupportwithrespecttothecontigconnectornearby.

edge

89

Chapter6.LayoutingCorynebacteriaContigs

Inacomparisonofthedisplayedgraphs,thetreecatresultlookslessclutteredand
itseemsthatourapproachisabletomorerobustlyinterprettheinformationgiven
bythematchestothereferencegenomes.
FurtherinvestigationofthetreecatlayoutgraphinFigure6.9revealsthatforthe
contigsc07toc22analmostuniquepathcanbeobservedwhichordersmostofthe
innercontigscorrectly.However,ourgraphcontainsalsoconnectionslikefromc02
toc43thathaveahigh(relative)supportbutarenotcorrect.Thehighsupportis
duetothebiginversionthatcanbeobservedinthesyntenyplotinFigure3.7(a)on
page41.There,thereferencegenomeisC.jeikeiumwhichisthenextphylogenetic
neighbortoC.urealyticumandthushasahighinuenceonourresult.
Theresultsshowthatusingseveralreferencegenomescanbeadvantageous,but
thisisnotnecessarilythecase.Inparticular,wefoundanexamplewheretheuse
ofasinglereferenceinsteadofseveralwouldimprovetheresultsoftreecat.In
general,thetruepositiverateoftreecatisamongthebestcomparedtoallother
approaches.Theideaoftreecattodisplayalsoconictinginformationinalayout
graph,however,comeswiththedisadvantageofproducingmorefalsepositivesthan
approachespredictingasinglelinearlayout.Yet,incontrasttoPGAtheresulting
positivepredictivevalueismostlybetter.Additionally,wewouldliketoemphasize
thatseveralfalsepositivesoftreecatareapproximateadjacencies.Suchaprediction
canbeconsideredhelpfulforthefinishingprocessinpractice,despiteofbeinga
falsepositiveintheory.

6.2.3LayoutingRepetitiveContigs
Theaiminthelastsectionofthisevaluationistostudyhowgoodrepetitivecontigs
canbeintegratedintoacontiglayout.Inthepreviousexperimentsthesewere
excludedsincenoneoftheinvolvedprogramswasintendedtohandlerepetitive
contigsproperly.Someapproachesevenactivelytrytoremovesuchcontigs.
Here,weexemplifytheintegrationofrepetitivecontigsontheC.urealyticum
dataset.Inthefollowingexperiment,wethusemployedall69contigsaslisted
inTable6.1,includingthe15repetitivecontigs.Wewanttoremindthatwehave
prefixedallregularcontigswithacandnumberedthemintheirtrueorder.The
repetitivecontigsareprefixedwithanrandarenumberedarbitrarily.
Toknowwhereinstancesoftherepetitivecontigscanbeplacedcorrectly,we
manuallyinspectedtheirmatchesonthefinishedC.urealyticumgenomewithr2cat.
Foreachpairofadjacentregularcontigswenotedwhichrepetitivecontigshavean
occurrencebetweenthem.Thelistcontainsforexamplethattherepetitivecontigr05
canbeplacedbetweentheregularcontigsc19andc20butaswellbetweenc50
andc51.Weobservedthatsomerepetitivecontigsoccurseveraltimesintandem
betweenthesamepairofregularcontigs.
Weappliedtwoprogramstofindalayoutofthecontigs.Thefirstprogramwas
treecatthatisamongthebestprogramsinthepreviousexperiments,andthesecond

90

6.2.EvaluationonRealSequencingData

wasrepcatwhichwascreatedtocorrectlyplacetherepetitivecontigsbetweenthe
regularcontigsallowingthemtoappearmorethanonce.
Bothprogramswereruntwotimesusingasinglegenomeasreference:Theal-
readyfinishedgenomeofC.urealyticumasaperfectreference,andthegenomeof
C.jeikeiumasamorerealisticreferencethatisstillverycloselyrelated.Detailsto
thesegenomesaregiveninTable6.2.
Inthecaseoftreecat,weusedthesameparametersasinthepreviousexperiment.
Fortheapplicationofrepcat,however,wetweakedtheparametersofthecontigad-
jacencygraphcreationtobettermatchtheemployedreferences:Weexperimentally
adjustedµto2,000andσto2,000aswell.Thisdecreaseofbothparametersyielded
morereasonableresultsandcanbejustifiedbythehighrelatednesstothereference
genomes.Closerreferenceshavelessinsertionsanddeletionsand,additionally,the
repetitivecontigsareincludedthistimewhichalsodecreasestheexpectedgapsize.
Further,wesettwothresholdsforrepcatthatwereintroducedinSection4.4.2:We
settheadditionaledgesthresholdτ1=90%andthethresholdforincorporating
repetitivecontigstoτ2=0.1%.
Themanuallyannotatedlistofrepetitivecontigsbetweenregularcontigsservesas
referencelayoutintheseexperiments.AsmotivatedinSection4.4.2,theinteresting
connectionsarethosefromarepetitivecontigtoaregularcontig.Therefore,we
extractedthoseconnectionsfromtheoutputofbothprogramsandcomparedthem
withourmanuallyannotatedlist.Ifarepetitivecontigispresentinbetweentwo
adjacentregularcontigs,wecounttheconnectionstothecorrespondingregular
contigsastruepositive.Ifsuchanadjacencyisnotgiveninourlist,wecountthis
asafalsepositiveprediction.Considerthefollowinglistwhereallregularcontigs
areprintedinboldfaceforabetterdistinction:
∙∙∙−−c27→−−c28→−−r09→−−c29→−−r03→−−r13→−−r12→−−c30→−−c31→−−c32→∙∙∙
Inthisexample,apredictedconnectionfromr09toc28orc29isnaturallycounted
astruepositive.Also,wecountaconnectionfromr13toc29orc30astruepositive
sinceitispresentinbetweentheregularcontigs,althoughnotbeingdirectlyadja-
cent.Apredictedconnectionfromr13toc31,however,iscountedasfalsepositive
becausetherepetitivecontigisnotfoundsomewherenexttoc31.
Weranbothprogramsonthedatasetsandcountedthetrueandfalsepositivepre-
dictionsasdescribedabove.Ourmanuallyannotatedlistrevealedthatthe15repet-
itivecontigsoccurredin79instancesonthegenome.Foreachoccurrence,twotrue
positiveconnectionscouldbepredicted,sothesumofallpositivepredictions(P)
is158.Thisnumberisneededtocalculatethetruepositiveratewhichisgivenin
Table6.6,togetherwiththeotherlayoutqualitymeasurements.
Whencomparingtreecatandrepcat,ithastobenotedthatthelatterwasspecifi-
callydesignedtoincluderepetitivecontigsseveraltimesintoalayout.Whiletreecat
willstoptoaddedgesifbothconnectorsofarepetitivecontighavebeenconsidered,

91

Chapter6.LayoutingCorynebacteriaContigs

Table6.6:ResultsforlayoutingtherepetitivecontigsoftheC.urealyticumdataset.Un-
likethepreviousresulttables,nottheadjacenciesofregularcontigswerecounted,
buttheproperplacementofarepetitivecontignexttoaregularone.
PerfectReferenceC.jeikeiumReference
ProgramTPFPTPRPPVTPFPTPRPPV
repcattreecat13929810.880.180.950.9761235070.390.150.550.77

rtivepcatecontig,canpredictstoppingtwowhentruethepositivreelativeconnectionssupportforbecomesmanylessoccurrthanτences2.Itofwaouldrepeti-be
desirabletoverifythenumberofincludedrepetitivecontigs,forexamplewiththe
rnoteadalwcoavyseragegiveninforsothatmationwearediscussednotinusingSitectionhere.4.4.2Ne,vhoweertheless,verthisrepcatinforhasamationclearis
experiment.thisinantageadvAsexpected,repcatrecoversmorerepeatoccurrencesthantreecat.Withtheperfect
rtionseferarence,eduealmosttor90%epetitivofethecontigspossiblethatappearconnectionssevareralefound.timesinMostasinglemissinggap.Herconnec-e,
repcatonlypredictsasingleoccurrenceandthusmissespossibletruepositives.
considerWhileed,inthewershoesultwtabletheonlycompletelaconnectionsyoutprfromedictedrepetitivbyreepcattor,foregularthecontigscaseofwerthee
Figurperfectesr6.8eferandence,6.9in,Figurthougheit6.10.alsoThiscontainsgraphisinstancescomparableofrtoepetitivtheelayoutcontigsgraphswhichof
aredepictedbyrectangularnodes.Arepetitivecontigisfurtherlabeledwithan
repcatoccurrencealgorithmcountininparprincipleentheses.dowork.TheItshoiswnobsergraphvablehoillustrateswtherthategulartheideascontigsofforthem
acircleandtherepetitivecontigsareattachedinbetween.
enceHowdecrever,easesitistheappartrueentinpositivTeablerate,6.6asthatwellusingasthethemorpositivereprealisticedictivC.evjeikeiumalue.referThe-
Attendencythisispoint,observweedforneedbothtoprcomeogramsbacktoalthoughthetreecatassumptionisaffectedstatedinmorSeectiondrastically4.4.2..
rTheelatedrepcatspecies.algorithmItmightassumesbetruethatthatrepeatingsuchrregionsegionsinareoneconsergenomevedarebetweenusuallycloselyalso
rofepetitivtheseeronegionsarinelateddifferentgenome.genomesOurrisesults,veryhowfragile.ever,Theindicateideaofthatrtheepcatseemsplacementto
workonlywiththeperfectreferencethatholdstheverystringentassumptionof
conservedrepeatpatterns.
repetitivFranklye,wecontigs.havetoThisadviseoutcomeagainstisratherusingarelatedunfortunategenomesinceasrarepetitivefereencertoegionslayaroute
aroneeofotherthemethodsbiggesttoobstacleshandleronepeatsthewaythattodoanotfinishedrelyongenome.referenceFortunatelygenomes.,therFore

92

rence

countisgiveninparentheses.

finished

the

using

repcat

yb

atedgener

connections

smallerthan3.5kbparedrawningray.Repetitive

occur-

Their

once.

than

more

appear

can

contigs,depictedwithrectangularnodes,

contig

urealyticum

.C

6.10:

genomeasreference.Contigs

Figure

93

6.2.

Data

Sequencing

Real

on

aluationEv

Chapter6.LayoutingCorynebacteriaContigs

example,Wetzeletal.[103]recentlyproposedatwo-tieredapproachwhereaninitial
sequencingandassemblyissubsequentlyaugmentedwithmate-pairsthatarefine-
tunedtotherepeatstructureofthegenome.

Referencebasedlayoutingofcontigshasbeenimprovedovertheyears.First
approachesusedonlymatchesofthecontigendsonareferencegenometode-
allvisedamatchesrelativtoeaorrdereferofence,them.andourMoreevrecentaluationapprshowsoaches,thattheincludingyproourvider2catbetter,rusee-
sultscomparedtotheinitialapproaches.Equallyimportant,thesinglereference
layotheroutingapprapproachesoachesneeddiftoferincomputetheirarrunningeasonabletimes.layInout,contrastcertainlytosevr2cateralhashourswiththata
fewsecondsatoprankinthisfield.
ship,Ifitonlycanfewbeadvcontigsantageousmatchontoauserefersevenceeralofgenomethem,ifduethistoisamorpossible.edistantContigsrelation-not
matchingcomparisonontoonetherefersingleenceothermightavailablematchonapproachanotherthatgenomehandlesthatismultipleprorvided.eferInencea
genomes,treecatshowsaconvincingperformance:Timedemandforlayoutingis
superiorThe,lastandtheexperimentlayoutofthisqualityischapteratleastshowsequal,theoftenlimitationsevenofbetterreferthanencePGAbaseds.lay-
outing.instancesintoRepetitivapreopercontigslayarout.eprAlthoughoblematicrsinceepetitivetheycontigsshouldoftenbealsoplacedoccurinservepet-eral
itivelyonarelatedgenome,theadjacenciesoftherepetitiveregionsdonotseem
tobeconservedtoowellbetweenspecies,andthusareferencebasedplacementis
successful.betounlikelyAsaconclusionofthiswholeevaluation,wewouldrecommendacombination
ofr2catandtreecat.Ifacloselyrelatedreferenceisavailable–andtherelatedness
caneasilybeassessedwiththesyntenyplotsofr2cat–thenthesimplemapping
prmoreoducesdistantlyrreasonableelated,rthenesults.alaIfyoutthegraphsyntenycanplotsbeindicatecomputedthatwhichthershoeferwstheencesmostare
promisingadjacenciesthatwerecollectablefromthedata.Oursuitehtscatcontains
allnecessarycomponentssupportingsuchaworkow.

94

7Chapter

SummarOutlookandy

Infinishing.thisthesis,Onewreesultintrofoducedoureffortsbioinforisthematicprapprogramoachesr2catto–aidintrtheoducedprocessinofChaptergenome3–
whichcanbeusedtoquicklymatchasetofcontigsontoarelatedreferencegenome,
andtoinspectthematchesinasyntenyplotvisualization.Thematchescanthen
beusedtoarrangethecontigsinalinearlayoutaccordingtothereferencegenome.
Foradjacentcontigs,primerpairscanbedesignedtofacilitatetheamplification
andsequencingofthegapsinbetween.Throughitsavailabilityanditseaseofuse,
thistoolhasalreadybeenengagedinseveralsequencingprojectsconductedatthe
CenterforBiotechnology(CeBiTec)ofBielefeldUniversity.
Whiler2catismainlyasoftwarecontribution,theprogramtreecatalsocontains
novincludeelideastheandinformationconcepts:givInenbyChaptermultiple4,werdeefervisedenceangenomes.approachTotothisend,simultaneouslyweuse
anadvancedscoringfunctiontocollecthintsonwhichofthecontigsareadjacent
lizebasedevonolutionarmatchesytodistancestherofefertheencerefergenomes.encegenomesThegivscoringenbyafunctioncanphylogeneticeventruti-ee.
Allinformationisgatheredinacontigadjacencygraphwhichrepresentsallpossi-
bleadjacenciesofthecontigs.Afterbuildingthecontigadjacencygraph,weuse
amorfasteexibleheuristicandtopowextracterfultheoutputmostprcomparomisingedtoedgestheintolinearalalayyoutoutthatgraph.mostThisotherisa
programsprovide.Inanevaluationwithrealsequencingdata,wefoundoutthat
treecatprovidesequivalentorbetterresultswhilebeingmuchfasterthanotherre-
latedapproachesfromtheliterature.
Theconceptsoftreecatwereextendedtoalsohandlerepetitivecontigswhichare
prideallyomisinginincludedtheorintoy,waeprdiscoopervlaeryedoutinastheoftenevasaluationtheyofoccurtheinrtheesultinggenome.programAlthoughrepcat
thattherepetitivepartsofagenomedonotbehaveasexpectedforrealisticdata.
Thus,unfortunately,adesiredreferencebasedintegrationofrepetitivecontigsap-
pearsnottobereliable.

95

Chapter7.SummaryandOutlook

Allourimplementationsarebundledintheopensourcesoftwareprojecthtscat
thatweinitiated.AsdescribedinChapter5,itprovidesanextensiblemodular
frameworkwhichcurrentlycombinesthefeaturesofr2catandtreecat.Wehopethat
itwillbecomevaluableinmanysequencingprojects.
Toouropinion,awell-balancedcollaborationbetweenbiologyandbioinformatics
istheenginethatdrivestheresearchinourfield.Duetodifferentscientificlan-
guagesthatcomputerscientistsandbiologistsuse,thiscanbecomecomplicated.
Yet,manysuccessfuljointprojectsshowthattheeffortisworthit.Alsoourproject
wouldnothavebeenfruitfulwithoutconstantfeedbackandsuggestions,andwe
aregratefulforthepossibilityofthesecollaborationswithintheCeBiTec.
Thegoalinourresearchwastohelpinthesequencingofprokaryoticgenomes.
Though,thishelpwasnevermeanttobeareplacementofbiologicalexpertiseor
knowledge.Onthecontrary,wewanttogenerallywarnagainsttrustingautomated
analysesofbioinformaticsoftwarewithoutrestrictions.Hence,alsoalayoutcom-
putedbyoneofourprogramsshouldnotbeseenasgroundtruth,butmoreasa
supportforaproperfinishingofthegenome.
Anypredictedlayouthastobecriticallyquestionedanditshouldbeclearthatthe
layoutgraphonlyreectstheinformationgivenbythedata.Ifdifferentreference
genomesprovideconictinginformation,thenadevisedlayoutgraphmostlikely
showsmisleadingadjacencies.Inourevaluation,particularlyinversionshappened
tobeamajorreasonforintroducingconicts.Apossibleimprovementofourmeth-
odswouldthereforecompriseadetectionandpossiblyalsoanuntanglingofthese
inversions,orofrearrangementeventsingeneral.
Toinvestigateconictinginformation,itmightbedesirabletoknowwhichof
thereferencegenomescontributedtowhichproposedadjacencies.Oneideais
thustocreateforeachreferencegenomeanowncontigadjacencygraphinstead
ofcollectingtheinformationinasingleone.Allgraphscouldsubsequentlybe
compared,andifoneadjacencyisfavoredinonegraph,butnotintheothers,then
thisshowsaconicttobestudied.
Anotherpossibilityistotrytountangletheinversions,alreadybeforethecontig
adjacencygraphiscomputedfromthematches.Intherecentliteratureweobserved
anapproachthattriedexactlythis[28]bydetectingsocalledinversionsignatures.
Giventhesesignatures,itisundersomecircumstancespossibletountanglethe
inversionevents;however,thisisnotpossibleingeneral.
Ultimately,wehavetofacethequestionwhetherfurtherendeavorisnecessary
forreferencebasedlayoutingofcontigs.Certainly,approacheshavebeenimproved
overtheyears.Butisfurtherdevelopmentstillrequired?
Theanswerisambivalent.Ontheonehand,thereisstillroomforimprovement
asdemonstratedabove.Ontheotherhand,wehavetoadmitthat,mostlikely,
onedaythelayoutingofcontigswillbegratuitous.Bythetimewhensequencing
approachescaneitherobtainverylargereadsofseveralhundredkilobases,oreven

96

readcompletechromosomesatonce,theassemblyandfinishingofgenomeswill
becomeaminormatter.Consequently,alsoapproachestolayoutcontigswillvanish.
Nevertheless,wethinkthatsomeofourideasandimplementationseventhencan
beconsidereduseful.Verylargereadscanalsobematchedonareferencegenome
withr2cat.Theresultingsyntenyplotsgiveanimpressionaboutrearrangements
andalsoaboutrepetitiveregionsofbothgenomes,andadditionallythereadscanbe
arrangedaccordingtothereferenceusingthesimplemappingprocedure.Moreover,
completegenomescanalsobecomparedvisually,asalreadydiscussedinSection3.3.
Duetoanopensourcelicensing,thecorrespondingpartsofoursoftwaretomatch
genomesortodisplaysyntenymaybeintegratedinfutureprojectsaswell.
Eventhecontigadjacencygraph,whichisspecializedforcontiglayoutingand
wassolelydesignedtodothisinthecontextofmultiplereferencesequences,can
possiblybereusedinthefieldofcomparativeanalysisofgenomes:Ageneadjacency
graphcouldbeemployedtodiscovergenesthatoccurindifferentrelatedspecies
innearproximityandthusareperhapsfunctionallyassociated.Besidesmarginal
changesoftheparameters,webelievethatthecurrentimplementationcanbeused
withaFASTAfilecontaininggenesinsteadofcontigs.Thegeneadjacencygraph
hasmaybeevenadvantagesoverexistingapproachesthatrelyonapriorhomol-
ogyassignment.Whilethesecommonlyonlyacceptsequencesofuniquelylabeled
genesasinput,ourgraphadditionallyprocessesthedistancesbetweenhomologous
regionsandalsoincludesincompletelymatchinggenesimplicitly.
Thislastexampleshowsthatsomeconceptsandmethodsdevisedinthisthesis
aremoregenerallyapplicable.Possiblytheywillcontributetootherimportant
subjectsinbioinformatics,orevenbeyond.Weareveryexcitedwherethismight
leadtointhefuture.

97

Chapter

98

7.

Summar

y

and

Outlook

wledgmentsknoAcIamverygratefultoallthepeoplewhosupportedandencouragedmeinthelast
fiveyears.Thanksforcheeringmeupinhardtimesandsharingnicemomentsin
goodtimes.Thefollowingpeople,Iwouldliketoacknowledgebyname.
Atfirst,IwanttothankJensStoyeforhissupportandforgivingmetheopportunity
toworkinhisgroup.Youwerenotonlyanadviserandboss,butalsoagoodfriend.
ManythanksgoalsotoAndreasTauchforhiswillingnesstoreviewthisthesisand
forsupportingthecollaborationwithhisgroup.
DuringmyPh.D.yearsinBielefeld,IenjoyedbeingamemberoftheGenomeIn-
formaticsgroup.Inparticular,Iremembervariousretreatsandsocialeventswhere
wehadfunandexperiencedscience.Iwanttothankallmembers,pastandpresent,
thatIwasincontactwithforanamicableatmosphere.Especially,RolandWittler,
mygoodfriendandofficemate,madethistimeverypleasantforme.Thanksfor
sharingapassionforphotographyandespressomakingwithme.Furthermore,
Rolandoftengavevaluablehintsandcontributedtomyworkbyhelpingtorefine
concepts.IalsowanttorecognizeoursecretaryHeikeSamuelasthesoulofour
groupwhokeepseverythingtogether.
Ontheprofessionalside,Iamthankfultoallpeoplewhohelpedmewithideasand
discussions,whogavehintsorsupport,especiallyJochenBlom,AlexanderGoes-
mann,andTimNattkemper.ThankstoChristianRückert,SusanneSchneiker-Bekel,
EvaTrost,andDanielWibbergfortheircollaboration,theirbiologicalexpertise,and
forprovidingdatasets.Additionally,Iwanttothankseveralpeopleforproofread-
ingpartsofthisthesis:MysisterMonikaHare,andmycolleaguesPinaKrelland
PatrickSchwientek.RolandWittlerevenreadthewholemanuscript.
Notonlywassocialandprofessionalencouragementimportant;Ialsowanttoac-
knowledgethefinancialsupportIreceived.Thankyou,Jens,forthepositionin
yourgroup.IamalsogratefultotheInternationalNRWGraduateSchoolinBioin-
formaticsandGenomeResearch.Besidesprovidingtravelgrantsforconferences
andworkshops,likeBREW2007,ISMB2008,andWABI2009,thegraduateschool
gavemethepossibilitytomeettheDalaiLamainSeptember2007inMünster.
Endingtheseacknowledgments,Idedicatethelastwordstomyfamilyandmy
lovedones.IamgratefulfortheconstantsupportandcarefrommyparentsEva
andFranzHusemann,andmysistersMonikaHareandUtaSander.Finally,Ithank
myfiancéeSveaStorkforherpatienceandencouragement.

99

100

yliographBib

[1]C.Adessi,G.Matton,G.Ayala,G.Turcatti,J.J.Mermod,P.Mayer,and
E.Kaattachmentwashima.andSolidamplificationphaseDNAmechanisms.amplification:NucleicAcidsRes.characterisation,28(20):e87,ofprimer2000.
[2]S.F.Altschul,W.Gish,W.Miller,E.W.Myers,andD.J.Lipman.Basiclocal
alignmentsearchtool.J.Mol.Biol.,215:403–410,1990.
[3]S.F.Altschul,T.L.Madden,A.A.Schäffer,J.Zhang,Z.Zhang,W.Miller,and
D.J.databaseLipman.searchprGappedograms.BLASTNucleicandAcidsPSI-BLAST:Res.,anew25(17):3389–3402,generation1997.ofprotein
[4]J.A.AmgartenQuitzauandJ.Stoye.Detectingrepeatfamiliesinincompletely
sequencedgenomes.InProc.WABI,vol.5251ofLNBI,342–353.2008.
[5]S.ments.Anderson.NucleicAcidsShotgunRes.,DNA9(13):3015–3027,sequencingusing1981.clonedDNaseI-generatedfrag-
[6]D.L.Applegate,R.E.Bixby,V.Chvátal,andW.J.Cook.TheTravelingSalesman
Problem:Acomputationalstudy.PrincetonUniversityPress,2006.
[7]S.Assefa,T.M.Keane,T.D.Otto,C.Newbold,andM.Berriman.ABACAS:al-
gorithmbasedautomaticcontiguationofassembledsequences.Bioinformatics,
2009.25:1968–1969,[8]O.T.Avery,C.M.MacLeod,andM.McCarty.Studiesonthechemicalnature
ofthesubstanceinducingtransformationofpneumococcaltypes.J.Exp.Med.,
1944.79(2):137–158,[9]D.Bartels,S.Kespohl,S.Albaum,T.Drüke,A.Goesmann,etal.BACCardI—a
toolforthevalidationofgenomicassemblies,assistinggenomefinishingand
intergenomecomparison.Bioinformatics,21(7):853–859,2005.
[10]S.Batzoglou.Themanyfacesofsequencealignment.Brief.Bioinform.,6(1):6–
2005.22,

101

[11]S.Batzoglou,D.B.Jaffe,K.Stanley,J.Butler,S.Gnerre,E.Mauceli,B.Berger,
J.P.Mesirov,andE.S.Lander.ARACHNE:awhole-genomeshotgunassem-
bler.GenomeRes.,12(1):177,2002.
[12]D.A.Benson,I.Karsch-Mizrachi,D.J.Lipman,J.Ostell,B.A.Rapp,andD.L.
Wheeler.Genbank.NucleicAcidsRes.,28(1):15–18,2000.
[13]J.L.Bentley.FastalgorithmsforGeometricTravelingSalesmanProblems.
Informs.J.Comp.,4(4):387–411,1992.
[14]A.Bird.DNAmethylationpatternsandepigeneticmemory.GenesDev.,
2002.16(1):6–21,[15]J.Blom,S.P.Albaum,D.Doppmeier,A.Pühler,F.-J.Vorhölter,andA.Goes-
mann.EDGAR:asoftwareframeworkforthecomparativeanalysisofmicro-
bialgenomes.BMCBioinformatics,10:154,2009.
[16]I.Braslavsky,B.Hebert,E.Kartalov,andS.R.Quake.Sequenceinformation
canbeobtainedfromsingleDNAmolecules.InProc.Natl.Acad.Sci.USA,vol.
2003.3960–3964.100,[17]N.G.deBruijn.Acombinatorialproblem.InProc.Nederl.Akad.Wetensch.,
1946.758–764.49,ol.v[18]J.Butler,I.MacCallum,M.Kleber,I.A.Shlyakhter,M.K.Belmonte,E.S.
Lander,C.Nusbaum,andD.B.Jaffe.ALLPATHS:denovoassemblyofwhole-
genomeshotgunmicroreads.GenomeRes.,18(5):810–820,2008.
[19]T.J.Carver,K.M.Rutherford,M.Berriman,M.-A.Rajandream,B.G.Bar-
rell,andJ.Parkhill.ACT:theArtemisComparisonTool.Bioinformatics,
2005.21(16):3422–3423,[20]M.J.ChaissonandP.A.Pevzner.Shortreadfragmentassemblyofbacterial
genomes.GenomeRes.,18(2):324–330,2008.
[21]E.Chargaff.Chemicalspecificityofnucleicacidsandmechanismoftheir
enzymaticdegradation.Experientia,6(6):201–209,1950.
[22]J.Clarke,H.-C.Wu,L.Jayasinghe,A.Patel,S.Reid,andH.Bayley.Continuous
baseidentificationforsingle-moleculenanoporeDNAsequencing.Nature
2009.4(4):265–270,,Nanotech.[23]S.T.Cole,R.Brosch,J.Parkhill,T.Garnier,C.Churcher,etal.Deciphering
thebiologyofMycobacteriumtuberculosisfromthecompletegenomesequence.
1998.393(6685):537–544,,eNatur

102

[24]A.C.E.Darling,B.Mau,F.R.Blattner,andN.T.Perna.Mauve:multiple
alignmentofconservedgenomicsequencewithrearrangements.GenomeRes.,
2004.14(7):1394–1403,[25]A.E.Darling,B.Mau,andN.T.Perna.progressiveMauve:Multiplegenome
alignmentwithgenegain,lossandrearrangement.PLoSONE,5(6):e11147,
2010.[26]A.E.Darling,I.Miklós,andM.A.Ragan.Dynamicsofgenomerearrange-
mentinbacterialpopulations.PLoSGenet.,4(7):e1000128,2008.
[27]F.Dekking,C.Kraaikamp,H.Lopuhaä,andL.Meester.AModernIntroduction
toProbabilityandStatistics:UnderstandingWhyandHow.Springer,2007.
[28]Z.Dias,U.Dias,andJ.C.Setubal.Usinginversionsignaturestogenerate
draftgenomesequencescaffolds.InProc.ACMConferenceonBioinformatics,
ComputationalBiologyandBiomedicine.2011.
[29]J.C.Dohm,C.Lottaz,T.Borodina,andH.Himmelbauer.SHARCGS,afast
andhighlyaccurateshort-readassemblyalgorithmfordenovogenomicse-
quencing.GenomeRes.,17(11):1697–1706,2007.
[30]J.Eid,A.Fehr,J.Gray,K.Luong,J.Lyle,etal.Real-timeDNAsequencing
fromsinglepolymerasemolecules.Science,323(5910):133–138,2009.
[31]M.Fedurco,A.Romieu,S.Williams,I.Lawrence,andG.Turcatti.BTA,anovel
reagentforDNAattachmentonglassandefficientgenerationofsolid-phase
amplifiedDNAcolonies.NucleicAcidsRes.,34(3):e22,2006.
[32]R.E.FranklinandR.G.Gosling.Molecularconfigurationinsodiumthymonu-
cleate.Nature,171(4356):740–741,1953.
[33]E.R.GansnerandS.C.North.Anopengraphvisualizationsystemandits
applicationstosoftwareengineering.Softw.Pract.Exper.,30:1203–1233,1999.
[34]A.J.GonzalezandL.Liao.Clusteringexactmatchesofpairwisesequence
alignmentsbyweightedlinearregression.BMCBioinformatics,9:102,2008.
[35]D.Gordon,C.Abajian,andP.Green.Consed:agraphicaltoolforsequence
finishing.GenomeRes.,8(3):195–202,1998.
[36]D.Gusfield.AlgorithmsonStrings,TreesandSequences:ComputerScienceand
ComputationalBiology.CambridgeUniversityPress,1997.
[37]R.W.Hamming.Errordetectinganderrorcorrectingcodes.BellSystemTech.
1950.29:147–160,,J.

103

[38]T.D.Harris,P.R.Buzby,H.Babcock,E.Beer,J.Bowers,etal.Single-molecule
DNAsequencingofaviralgenome.Science,320(5872):106–109,2008.
[39]J.Heer,S.K.Card,andJ.A.Landay.prefuse:atoolkitforinteractiveinfor-
mationvisualization.InProc.SIGCHIconferenceonHumanfactorsincomputing
2005.421–430.CHI,,systems[40]S.A.F.T.vanHijum,A.L.Zomer,O.P.Kuipers,andJ.Kok.Projector2:contig
mappingforefficientgap-closureofprokaryoticgenomesequenceassemblies.
NucleicAcidsRes.,33:W560–W566,2005.
[41]P.HusemannandJ.Stoye.Phylogeneticcomparativeassembly.Algorithms
2010.5(1):3,,Biol.Mol.[42]P.HusemannandJ.Stoye.r2cat:syntenyplotsandcomparativeassembly.
2010.26(4):570–571,,Bioinformatics[43]P.HusemannandJ.Stoye.Repeat-awarecomparativegenomeassembly.In
Proc.GCB,vol.P-173ofLNI,61–70.2010.
[44]C.A.Hutchison,III.DNAsequencing:benchtobedsideandbeyond.Nucleic
2007.35(18):6227–6237,,Res.Acids[45]E.D.Hyman.AnewmethodofsequencingDNA.Anal.Biochem.,174(2):423–
1988.436,[46]R.M.IduryandM.S.Waterman.AnewalgorithmforDNAsequenceassem-
bly.J.Comp.Biol.,2:291–306,1995.
[47]M.ImelfortandD.Edwards.Denovosequencingofplantgenomesusing
second-generationtechnologies.Brief.Bioinform.,10(6):609–618,2009.
[48]W.R.Jeck,J.A.Reinhardt,D.A.Baltrus,M.T.Hickenbotham,V.Magrini,
E.R.Mardis,J.L.Dangl,andC.D.Jones.ExtendingassemblyofshortDNA
sequencestohandleerror.Bioinformatics,23(21):2942,2007.
[49]J.H.Jett,R.A.Keller,J.C.Martin,B.L.Marrone,R.K.Moyzis,R.L.Ratliff,
N.K.Seitzinger,E.B.Shera,andC.C.Stewart.High-speedDNAsequencing:
anapproachbaseduponuorescencedetectionofsinglemolecules.J.Biomol.
1989.7(2):301–309,,Dyn.Struct.[50]J.Kalinowski,B.Bathe,D.Bartels,N.Bischoff,M.Bott,etal.Thecomplete
CorynebacteriumglutamicumATCC13032genomesequenceanditsimpacton
theproductionofL-aspartate-derivedaminoacidsandvitamins.J.Biotechnol.,
2003.104(1-3):5–25,

104

[51]W.J.Kent.BLAT–theBLAST-likealignmenttool.GenomeRes.,12(4):656–664,
2002.[52]R.Knippers.MolekulareGenetik.Thieme,Stuttgart,2001.
[53]T.KoressaarandM.Remm.Enhancementsandmodificationsofprimerde-
signprogramPrimer3.Bioinformatics,23(10):1289–1291,2007.
[54]S.Kurtz,A.Phillippy,A.L.Delcher,M.Smoot,M.Shumway,C.Antonescu,
andS.L.Salzberg.Versatileandopensoftwareforcomparinglargegenomes.
2004.5(2):R12,,Biol.Genome[55]E.S.Lander,L.M.Linton,B.Birren,C.Nusbaum,M.C.Zody,etal.Initial
sequencingandanalysisofthehumangenome.Nature,409(6822):860–921,
2001.[56]P.Latreille,S.Norton,B.Goldman,J.Henkhaus,N.Miller,etal.Optical
mappingasaroutinetoolforbacterialgenomesequencefinishing.BMC
2007.8:321,,Genomics[57]V.I.Levenshtein.Binarycodescapableofcorrectingdeletions,insertions,and
reversals.SovietPhysicsDoklady,10(8):707–710,1966.
[58]H.LiandR.Durbin.FastandaccurateshortreadalignmentwithBurrows-
Wheelertransform.Bioinformatics,25(14):1754–1760,2009.
[59]M.Li,B.Ma,D.Kisman,andJ.Tromp.PatternHunterII:highlysensitiveand
fasthomologysearch.J.Bioinform.Comput.Biol.,2(3):417–439,2004.
[60]R.Li,H.Zhu,J.Ruan,W.Qian,X.Fang,etal.Denovoassemblyofhu-
mangenomeswithmassivelyparallelshortreadsequencing.GenomeRes.,
2010.20(2):265–272,[61]B.Ma,J.Tromp,andM.Li.PatternHunter:fasterandmoresensitivehomol-
ogysearch.Bioinformatics,18(3):440–445,2002.
[62]A.Magi,M.Benelli,A.Gozzini,F.Girolami,F.Torricelli,andM.L.Brandi.
Bioinformaticsfornextgenerationsequencingdata.Genes,1(2):294–307,2010.
[63]M.Margulies,M.Egholm,W.E.Altman,S.Attiya,J.S.Bader,etal.Ge-
nomesequencinginmicrofabricatedhigh-densitypicolitrereactors.Nature,
2005.437(7057):376–380,[64]D.W.Meinke,J.M.Cherry,C.Dean,S.D.Rounsley,andM.Koornneef.Ara-
bidopsisthaliana:amodelplantforgenomeanalysis.Science,282(5389):662–682,
1998.

105

[65]J.R.Miller,S.Koren,andG.Sutton.Assemblyalgorithmsfornext-generation
sequencingdata.Genomics,95(6):315–327,2010.
[66]E.W.Myers,G.G.Sutton,A.L.Delcher,I.M.Dew,D.P.Fasulo,etal.A
whole-genomeassemblyofDrosophila.Science,287(5461):2196–2204,2000.
[67]N.Nagarajan,C.Cook,M.DiBonaventura,H.Ge,A.Richards,K.A.Bishop-
Lilly,R.DeSalle,T.D.Read,andM.Pop.Finishinggenomeswithlimited
resources:lessonsfromanensembleofmicrobialgenomes.BMCGenomics,
2010.11(1):242,[68]S.B.NeedlemanandC.D.Wunsch.Ageneralmethodapplicabletothe
searchforsimilaritiesintheaminoacidsequenceoftwoproteins.J.Mol.Biol.,
1970.48(3):443–453,[69]C.B.Nielsen,M.Cantor,I.Dubchak,D.Gordon,andT.Wang.Visualizing
genomes:techniquesandchallenges.NatureMethods,7:S5–S15,2010.
[70]P.NyrénandA.Lundin.Enzymaticmethodforcontinuousmonitoringof
inorganicpyrophosphatesynthesis.Anal.Biochem.,151(2):504–509,1985.
[71]O.OwolabiandD.McGregor.Fastapproximatestringmatching.Softw.Pract.
1988.18(4):387–393,,.Exper[72]W.R.PearsonandD.J.Lipman.Improvedtoolsforbiologicalsequencecom-
parison.InProc.Natl.Acad.Sci.USA,vol.85,2444–2448.1988.
[73]R.Pethica,G.Barker,T.Kovacs,andJ.Gough.TreeVector:Scalable,interactive,
phylogenetictreesfortheweb.PLoSONE,5(1):e8934,2010.
[74]M.Pop,D.S.Kosack,andS.L.Salzberg.HierarchicalscaffoldingwithBam-
bus.GenomeRes.,14(1):149–159,2004.
[75]D.Pushkarev,N.F.Neff,andS.R.Quake.Single-moleculesequencingofan
individualhumangenome.Nat.Biotechnol.,27(9):847–852,2009.
[76]K.R.Rasmussen,J.Stoye,andE.W.Myers.Efficientq-gramfiltersforfinding
all-matchesoveragivenlength.J.Comp.Biol.,13(2):296–308,2006.
[77]T.D.Read,S.N.Peterson,N.Tourasse,L.W.Baillie,I.T.Paulsen,etal.The
genomesequenceofBacillusanthracisAmesandcomparisontocloselyrelated
bacteria.Nature,423(6935):81–86,2003.
[78]D.C.Richter.AlgorithmsandToolsforGenomeAssemblyandMetagenomeAnaly-
sis.Ph.D.thesis,Eberhard-Karls-Universität,Tübingen,2009.
[79]D.C.Richter,S.C.Schuster,andD.H.Huson.OSLay:optimalsynteniclayout
ofunfinishedassemblies.Bioinformatics,23(13):1573–1579,2007.

106

[80]A.I.Rissman,B.Mau,B.S.Biehl,A.E.Darling,J.D.Glasner,andN.T.Perna.
Reorderingcontigsofdraftgenomesusingthemauvealigner.Bioinformatics,
2009.25(16):2071–2073,[81]N.SaitouandM.Nei.Theneighbor-joiningmethod:anewmethodforrecon-
structingphylogenetictrees.Mol.Biol.Evol.,4(4):406–425,1987.
[82]A.Samad,E.Huff,W.Cai,andD.Schwartz.Opticalmapping:anovel,single-
moleculeapproachtogenomicanalysis.GenomeRes.,5:1–4,1995.
[83]F.Sanger,S.Nicklen,andA.R.Coulson.DNAsequencingwithchain-
terminatinginhibitors.InProc.Natl.Acad.Sci.USA,vol.74,5463–5467.1977.
[84]P.H.Sellers.Onthetheoryandcomputationofevolutionarydistances.SIAM
J.Appl.Math.,26(4):787–793,1974.
[85]J.Shendure,G.J.Porreca,N.B.Reppas,X.Lin,J.P.McCutcheon,A.M.
Rosenbaum,M.D.Wang,K.Zhang,R.D.Mitra,andG.M.Church.Accu-
ratemultiplexpolonysequencingofanevolvedbacterialgenome.Science,
2005.309(5741):1728–1732,[86]A.F.A.Smit,R.Hubley,andP.Green.RepeatMaskerOpen-3.0.http://www.
repeatmasker.org,1996–2010.
[87]T.F.SmithandM.S.Waterman.Identificationofcommonmolecularsubse-
quences.J.Mol.Biol.,147(1):195–197,1981.
[88]R.Staden.AstrategyofDNAsequencingemployingcomputerprograms.
NucleicAcidsRes.,6(7):2601–2610,1979.
[89]H.Tang.Genomeassembly,rearrangement,andrepeats.Chem.Rev,
2007.107(8):3391–3406,[90]A.Tauch,J.Schneider,R.Szczepanowski,A.Tilker,P.Viehoever,etal.Ultrafast
pyrosequencingofCorynebacteriumkroppenstedtiiDSM44385revealedinsights
intothephysiologyofalipophiliccorynebacteriumthatlacksmycolicacids.
2008.136(1-2):22–30,,Biotechnol.J.[91]A.Tauch,E.Trost,A.Tilker,U.Ludewig,S.Schneiker,etal.Thelifestyle
ofCorynebacteriumurealyticumderivedfromitscompletegenomesequence
establishedbypyrosequencing.J.Biotechnol.,136(1-2):11–21,2008.
[92]J.ThompsonandP.Milos.Thepropertiesandapplicationsofsingle-molecule
DNAsequencing.GenomeBiol.,12(2):217,2011.
[93]E.Trost,A.Al-Dilaimi,P.Papavasiliou,J.Schneider,P.Viehoever,etal.Com-
parativeanalysisoftwocompleteCorynebacteriumulceransgenomesandde-
tectionofcandidatevirulencefactors.BMCGenomics,12(1):383,2011.

107

[94]E.Trost,S.Götker,J.Schneider,S.Schneiker-Bekel,R.Szczepanowski,etal.
Completegenomesequenceandlifestyleofblack-pigmentedCorynebacterium
aurimucosumATCC700975(formerlyC.nigricansCN-1)isolatedfromavagi-
nalswabofawomanwithspontaneousabortion.BMCGenomics,11(1):91,
2010.[95]E.Trost,L.Ott,J.Schneider,J.Schröder,S.Jaenicke,etal.Thecomplete
genomesequenceofCorynebacteriumpseudotuberculosisFRC41isolatedfrom
a12-year-oldgirlwithnecrotizinglymphadenitisrevealsinsightsintogene-
regulatorynetworkscontributingtovirulence.BMCGenomics,11(1):728,2010.
[96]G.Turcatti,A.Romieu,M.Fedurco,andA.P.Tairi.Anewclassofcleavable
uorescentnucleotides:synthesisandoptimizationasreversibleterminators
forDNAsequencingbysynthesis.NucleicAcidsRes.,36(4):e25,2008.
[97]S.Tweedie,M.Ashburner,K.Falls,P.Leyland,P.McQuilton,etal.FlyBase:en-
hancingDrosophilageneontologyannotations.NucleicAcidsResearch,37:D555–
2009.D559,[98]E.Ukkonen.Approximatestring-matchingwithq-gramsandmaximal
matches.Theor.Comput.Sci.,92(1):191–211,1992.
[99]J.C.Venter,M.D.Adams,E.W.Myers,P.W.Li,R.J.Mural,etal.Thesequence
ofthehumangenome.Science,291(5507):1304–1351,2001.
[100]R.L.Warren,G.G.Sutton,S.J.M.Jones,andR.A.Holt.Assemblingmillions
ofshortDNAsequencesusingSSAKE.Bioinformatics,23(4):500–501,2007.
[101]M.S.WatermanandM.Eggert.Anewalgorithmforbestsubsequencealign-
mentswithapplicationtotRNA-rRNAcomparisons.J.Mol.Biol.,197:723–728,
1987.[102]J.D.WatsonandF.H.C.Crick.Astructurefordeoxyribosenucleicacid.
1953.171(4356):737–738,,eNatur[103]J.Wetzel,C.Kingsford,andM.Pop.Assessingthebenefitsofusingmate-
pairstoresolverepeatsindenovoshort-readprokaryoticassemblies.BMC
2011.12(1):95,,Bioinformatics[104]D.L.Wheeler,C.Chappey,A.E.Lash,D.D.Leipe,T.L.Madden,G.D.
Schuler,T.A.Tatusova,andB.A.Rapp.Databaseresourcesofthenational
centerforbiotechnologyinformation.NucleicAcidsRes.,28(1):10–14,2000.
[105]D.Wibberg,J.Blom,S.Jaenicke,F.Kollin,O.Rupp,etal.Completegenome
sequencingofAgrobacteriumspH13-3,theformerRhizobiumlupiniH13-3,re-
vealsatripartitegenomeconsistingofacircularandalinearchromosomeand

108

[106]

[107]

[108]

[109]

anaccessoryplasmidbutlackingatumor-inducingTi-plasmid.J.Biotechnol.,
2011.

Wikimedia.http://commons.wikimedia.org/wiki/File:Difference_DNA_
RNA-EN.svg.CreativeCommonsLicenseBY-SA3.0.FileaccessedonJuly
2010.13th,

M.H.F.Wilkins,A.R.Stokes,andH.R.Wilson.Molecularstructureof
deoxypentosenucleicacids.Nature,171(4356):738–740,1953.

D.R.ZerbinoandE.Birney.Velvet:algorithmsfordenovoshortreadassem-
blyusingdeBruijngraphs.GenomeRes.,18(5):821–829,2008.

F.Zhao,F.Zhao,T.Li,andD.A.Bryant.Anewpheromonetrail-basedgenetic
algorithmforcomparativegenomeassembly.NucleicAcidsRes.,36(10):3455–
2008.3462,

109

Soyez le premier à déposer un commentaire !

17/1000 caractères maximum.