Protein structure prediction and folding dynamics [Elektronische Ressource] / von Katrin Wolff
103 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Protein structure prediction and folding dynamics [Elektronische Ressource] / von Katrin Wolff

-

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
103 pages
English

Description

ProteinstructurepredictionandfoldingdynamicsProteinstrukturvorhersageundFaltungsdynamikZurErlangungdesGradeseinesDoktorsderNaturwissenschaften(Dr.rer.nat.)genehmigteDissertationvonDipl.-Phys.KatrinWolffausOffenbachFebruar2010—Darmstadt—D17FachbereichPhysikInstitutfürFestkörperphysikProteinstructurepredictionandfoldingdynamicsProteinstrukturvorhersageundFaltungsdynamikGenehmigteDissertationvonDipl.-Phys.KatrinWolffausOffenbach1.Gutachten:Prof.Dr.MarkusPorto2.Gutachten:Prof.Dr.BarbaraDrosselTagderEinreichung:05.01.2010TagderPrüfung:15.02.2010Darmstadt—D17ErklärungzurDissertationHiermit versichere ich die vorliegende Dissertation ohne Hilfe Dritter nur mit denangegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die ausQuellen entnommen wurden, sind als solche kenntlich gemacht. Es wurde noch keinPromotionsversuchunternommen.Darmstadt,den26.Februar2010(K.Wolff)AbstractThe topic of protein folding can be studied from two different points of view. The first is concernedwith the question of how the biologically relevant three-dimensional structure can be determined froma given amino acid sequence. This is of great practical interest as experimental determination of proteinstructureisdifficultandcostly,whereassequencingisrelativelysimpleandcheap. Thesecondquestionisthatofthephysicalprocessoffoldingwhereoftenknowledgeofthebiologicallyactive(native)structureis assumed.

Sujets

Informations

Publié par
Publié le 01 janvier 2010
Nombre de lectures 13
Langue English
Poids de l'ouvrage 14 Mo

Exrait

andProteinfoldingstructuredynamicsprediction

FaltungsdynamikundProteinstrukturvorhersageZurErlangungdesGradeseinesDoktorsderNaturwissenschaften(Dr.rer.nat.)
genehmigteDissertationvonDipl.-Phys.KatrinWolausOenbach
Februar2010—Darmstadt—D17

FInsactituhbetrefürichFePhstkysikörperphysik

PrProotteeininsstrutrukturcturveorprheedirscagtioenunanddFfolaltundinggsddynynamiamikcs

GenehmigteDissertationvonDipl.-Phys.

1.Gutachten:Prof.Dr.MarkusPorto
2.Gutachten:Prof.Dr.BarbaraDrossel

TTaaggddeerrEinrPrüfuneicg:hun1g:5.02.2005.0110.2010

Darmstadt—D17

Kintra

olW

sua

Oenbach

DissertationzurErklärung

HiermitversichereichdievorliegendeDissertationohneHilfeDritternurmitden
angegebenenQuellenundHilfsmittelnangefertigtzuhaben.AlleStellen,dieaus
Quellenentnommenwurden,sindalssolchekenntlichgemacht.Eswurdenochkein
Promotionsversuchunternommen.

Darmstadt,den26.Februar2010

l)oW.(K

Abstract

Thewiththetopicofquestionproteinofhowfoldingthecanbebiologicallystudiedrelevantfromtwothree-dimensionaldifferentpointsofstructureview.canTheberstisdeterminedconcernedfrom
agivenaminoacidsequence.Thisisofgreatpracticalinterestasexperimentaldeterminationofprotein
structureisdifcultandcostly,whereassequencingisrelativelysimpleandcheap.Thesecondquestionis
isthatofassumed.theAsphysicalproteinprocessofmisfoldingfoldingisthewherecauseoftenofseveralknowledgeofdiseases,thethisisbiologicallyofactivebio-medical(native)importancestructureas
well.ThisthesisMoreover,coversitisalsotheseoftwofundamentalaspectsofinterestproteinasafolding.mesoscopicTothissystemendso-calleddisplayingstructuralcooperativeproleseffects.are
deneddiction.Onwhichthemayotheractashandalinktheybetweencontainproteinstructuresequenceinformationandinastructurecompressedfortheformtaskofwhich,structureaswillpre-be
shown,alsoencodesthefoldingprocess.
Adetailedseverestructurebottleneckindescriptionproteinandthestructurerenementpredictionofisstructurethetransitioncandidatesfromthataarecoarse-grainedalreadyclosetoatomorethe
oftargetpromisingstructure.Ascandidates.renementThisisisverywherestructuralcomputation-intensiveprolesitpredictedisadvisablefromtosequenceconcentrateprovetoonbeaselectionadvanta-
ifgeous.theTheirstructuresetusefulnessisofinonlylteringismoderateatleastqualityonparortoiftheestablishedcriterionismethodsespeciallyandtheystrictareonclearlywhatissuperiortobe
structure.goodaconsideredAnimportantquestionregardingthefoldingprocessisinhowfarthestructureofthenativestate
quence.dictatestheOneclassfoldingofsuchpathway,thusnative-centricallowingmodelstoareabstractso-calledfromGthe¯o-modelschemicalwhichdetailsofadditionallytheaminorelyacidonthese-
stateprincipleattractofeachminimumother.Infrustration,contrast,themeaningmodelthatonlypresentedthosehere,aminowhichacidsisthatbasedareoninstructuralcontactintheproles,nativeal-
lowsthreenon-nativewell-studiedinteractions.exampleproteinsApplyingshowsadaptedthatsamplingexperimentalschemesresultstoandbothbehaviournative-centricobservedmodelsinanddetailedto
all-atomsimulationscanbebetterexplainedintheprole-basedmodelthanintheG¯o-model.Inpartic-
ular,theprole-basedmodelshowsacooperativefoldingtransitionandexistenceofsecondarystructure
infoldingtheunfoldedbehaviour,state.theThus,prole-basedwhileasimplemodel,modelwhichisbasedonnative-centricpairwiseaswell,contactscancannotreproduceadequatelyfundamentaldescribe
observations.experimental

Zusammenfassung

DerThemenkomplexderProteinfaltungkannauszweiverschiedenenBlickwinkelnbetrachtetwerden.
ZumeinenstelltsichdieFrage,wiebeigegebenerAminosäuresequenzdiefürdiebiologischeFunktion
maßgeblichedreidimensionaleStrukturbestimmtwerdenkann.DiesistvonenormempraktischenInter-
esse,dadieexperimentelleStrukturaufklärungvonProteinenzeitaufwändigundteuerist,währenddie
Sequenzierungvergleichsweiseeinfachundgünstigmöglichist.DerzweiteAspektistderphysikalische
Vgesetzt.organgderAuchFdiesaltung.istvonHierbeiwirddiebiologisch-medizinischernative,biologischBedeutung,aktivedadieStrukturFvielfachalschfaltungalsvonbekanntProteinenvoraus-zu
SystemverschiedenenmitkooperativenKrankheitenEffektenführenauchkann.vonDarüberfundamentalemhinausistderInteresse.Faltungsvorgangalsmesoskopisches
DievorliegendeDoktorarbeitbeleuchtetdiesebeidenFragestellungenderProteinfaltung.ZuBeginn
werdensogenannteStrukturproledeniert,dieeinerseitsalsBindegliedzwischenSequenzundStruk-
turinderStrukturvorhersagefungierenkönnen,undandererseitsStrukturinformationund,wiegezeigt
wird,FaltungsinformationinkomprimierterFormenthalten.
EinschreibungEngpassundindiederweitereStrukturvorhersageOptimierungvonistderStrukturen,ÜbergangderenvongenerellegrobkörnigerFaltungzubereitsdetaillierterderZielstrukturStrukturbe-

entspricht.DadieserSchrittsehrvielRechenleistunginAnspruchnimmt,empehltessich,eineVor-
auswahlderStrukturkandidatenzutreffen.AndieserStellekommenvorhergesagteStrukturprolezum
Tragen,die,wieindieserArbeitgezeigtwird,mitherkömmlichenAuswahlmethodenmindestensgleich-
wertigundunterbestimmtenBedingungensogardeutlichüberlegensind.Diesistinsbesonderedannder
Fall,wenndieUrsprungsmengederStrukturkandidatenvonnurmäßigerQualitätistodereinbesonders
striktesKriteriumandieQualitätderausgewähltenStrukturenangelegtwird.
EinewichtigeFragezumFaltungsprozessist,inwieferndiegefalteteStrukturdenFaltungswegvor-
gibtunddahervonchemischenDetailsderAminosäuresequenzabstrahiertwerdenkann.Einesolche
nativ-zentrierteModellklassesindsogenannteG¯o-Modelle,diezusätzlichaufdemPrinzipderminima-
lenFrustrationberuhenundnursolcheKontaktezwischenAminosäurenanziehendgestalten,dieauch
indernativenStrukturbestehen.ImGegensatzdazukönnenindemhiervorgestelltenModell,wel-
chesaufStrukturprolenberuht,auchnicht-nativeWechselwirkungenauftreten.Anhandangepasster
Sampling-MethodenwirddannfürdreigutuntersuchteBeispielproteinegezeigt,dassdasFaltungsver-
haltenprolbasierterModellebessermitexperimentellbeobachtetemVerhaltenunddematomaraufge-
lösterSimulationenübereinstimmtalsdiesfürdasG¯o-ModellderFallist.Diesbeziehtsichinsbesondere
aufdieKooperativitätdesFaltungsübergangsunddasVorhandenseinvonSekundärstrukturimungefalte-
tenZustand.WährendeineinfachesaufpaarweisenKontaktenbasierendesModelldasFaltungsverhalten
alsonichtadäquatbeschreibenkann,istesdennochmöglichmitnativ-zentriertenModellen,wiedem
prolbasiertenModell,grundlegendeBeobachtungenzureproduzieren.

Contents

1Introduction11.1Motivation.................................................1
1.2ProteinStructure.............................................2
1.3ContactMapsandStructuralProles.................................5

11PredictionStructureProtein22.1Motivation.................................................11
2.2StructureSelection............................................12
2.2.1BenchmarkStructures......................................13
2.2.2PredictionofStructuralProles................................15
2.2.3DistributionofRosettaCandidateStructures........................16
2.2.4FilteringofRosettaCandidates................................18
2.2.5FilteringofCASP8Models...................................25
2.3Discussion..................................................26

29DynamicsFoldingProtein33.1Motivation.................................................29
3.2TheProteinModel............................................30
3.3ProteinStructureReconstruction....................................36
3.4FoldingSimulationsandFreeEnergyLandscapes..........................38
3.4.1FoldingSimulations,TimeSeriesandDistributionsofObservables..........39
3.4.2ConstrainedSampling......................................42
3.4.3Metadynamics..........................................44
3.4.4ConstrainedSamplingandMetadynamicsCombined...................47
3.4.5ExampleProteinsandComparisonofModels.......................49
3.4.6ContactMapsasMicrostates..................................60
3.4.7HeatCapacitiesandFoldingTransitions...........................61
3.5Discussion..................................................64

andConclusion4Outlook

ARMSDandTM-ScoreDistributionofCandidateStructures

67

69

i

ii

FiguresofList

1.1Aminoacidsandpeptidebond.....................................3
1.2Proteinsequenceandproteinstructure................................5
1.3Coarse-grainedproteinstructure....................................6
1.4Contactmapandstructureproles..................................6
2.1ArticialNeuralNetwork(ANN)topredictstructuralprolefromsequence.........15
2.2DistributionofRMSDand1TM-scoreforproteins1pv0and1ubq..............17
2.3CorrelationoftheRosettascoreandtheEC-scoreforexactproleandpredictedprole,
toRMSDforproteins1ubqand1shg.................................19
2.4RMSDand1TM-scoredistributionoflteredstructuresforproteins1c9oAand1ubq..20
2.5RMSDdistributionoflteredandclusteredstructuresforproteins1c9oAand1btb.....21
2.6Numberofgoodstructuresindependanceofnumberofselectedstructuresforproteins
1c9oAand1ubq..............................................23
2.7DistributionoflteredstructuresforinterpolatedECsforproteins1gb1and1c9oA....23
2.8RelativefrequenciesofgoodstructuresfortheCASP8set.....................26
3.1Tubemodel.................................................30
3.2MovesetofMonteCarlosimulation..................................31
3.3Restrictedstructuralproles......................................33
3.4Geometriccriterionofsecondarystructure..............................34
3.5Statisticsofreconstructedproteins..................................36
3.6Examplesofreconstructedstructures.................................37
3.7Distributionoffullcontactoverlap...................................37
3.8FoldingtimeseriesforthevillinheadpieceintheEC-model...................39
3.9FoldingtimeseriesforthevillinheadpieceintheG¯o-model...................40
3.10DenitionofpartsAandBforthevillinheadpiece.........................40
3.11FoldingofthevillinheadpieceinpartialRMSDcoordinates...................41
3.12Foldingofthevillinheadpieceinhelixcontentandcontactnumbercoordinates......42
3.13Constrainedsamplingforthevillinheadpiece............................43
3.14Fourstatesystem.............................................45
3.15Metadynamicssamplingforthevillinheadpiece..........................46
3.16Constrainedmetadynamicssamplingforthevillinheadpiece..................48
3.17Combinedmetadynamicssamplingsforthevillinheadpiece...................48
3.18Villinheadpiecetargetstructure....................................49
3.19FreeenergylandscapeforthevillinheadpieceintheEC-andG¯o-modelinhelixcontent
andnumberofcontacts.........................................50
3.20FreeenergylandscapeforthevillinheadpieceintheEC-modelinRMSDandRMSD..51
BA3.21FreeenergylandscapeforthevillinheadpieceintheG¯o-modelinRMSDandRMSD..52
BA3.22Freeenergylandscapeforthevillinheadpiecewithfoldingtrajectories............53
3.23BBLtargetstructure............................................55
3.24FreeenergyproleforBBLintheEC-modelinend-to-enddistance..............55
3.25FreeenergyproleforBBLintheEC-modelincontactoverlap.................56
3.26FreeenergyproleforBBLintheG¯o-modelincontactoverlap.................56
3.27WWdomaintargetstructure......................................56
3.28FreeenergylandscapefortheWWdomainintheEC-andG¯o-modelinhelixcontentand
contactnumber..............................................57

iii

iv

3.293.303.313.323.333.343.353.36

A.1A.1A.1A.1A.1

FoldingtimeseriesfortheWWdomainintheEC-modelforenergyandhelixcontent...
FreeenergylandscapefortheWWdomainintheEC-modelincontactoverlapandRMSD
FreeenergylandscapefortheWWdomainintheG¯o-modelincontactoverlapandRMSD
Frequentcontactsandfrequentcontactmaps............................
HeatcapacitycurvesforthevillinheadpieceintheEC-andG¯o-model.............
EnergydistributionforthevillinheadpieceintheEC-model...................
EnergydistributionforthevillinheadpieceintheG¯o-model...................
CongurationsofthevillinheadpieceintheEC-modelatdifferenttemperatures......

DistributionofRMSDand1TM-scoreforproteins1pv0,1gb1,1shgand1jic..
DistributionofRMSDand1TM-scoreforproteins1r69,1c9oA,1mjcand1fgp
DistributionofRMSDand1TM-scoreforproteins1ubq,1oqp,1btband1p9yA
DistributionofRMSDand1TM-scoreforproteins2imf,1volA,1ix9Aand1f5x
DistributionofRMSDand1TM-scoreforproteins1gk9Aand1by1.......

.....

.....

..........

.....

.....

5859596162636364

7172737475

ofListTables

1.1Canonicalproteinogenicaminoacids.................................
2.1ProteintargetstructuresforwhichcandidateswerepredictedusingRosetta.........
2.2ProteintargetstructuresfromCASP8.................................
2.3Qualityofdifferentlteringmethods.................................
2.4Filteringbyinterpolatedstructuralproles.............................
2.5Centresofthe10largestclusters....................................
3.1Freeenergydifferencesforfourstatesystem............................
A.1EstimatednumberofcandidatestoexpectonestructurebelowRMSD=5Åforlargepro-
teintargetstructures...........................................

413142224254569

v

vi

ofListAbbreviations

ANN

ASPC

ASP8C

CV

EC

MC

MD

mEC

PDB

PE

predEC

PSSM

RMSD

SCOP

networkneuralArticial

CriticalAssessmentofTechniquesforProteinStructurePrediction

8thCriticalAssessmentofTechniquesforProteinStructurePrediction

vectorContact

Effectiveconnectivity

CarloMonte

DynamicsMolecular

connectivityeffectiveInterpolated

BankDataProtein

eigenvectorPrincipal

connectivityeffectivePredicted

matrixscoringspecicositionP

squaremeanootRdeviation

ProteinsofClassicationStructural

vii

viii

Introduction1

Motivation1.1

Proteinscarryoutavarietyofvitalfunctionsineverylivingorganism,rangingfromenzymeproteins
regulatingbiochemicalreactionstomotorproteinscausingthecontractionandmovementofmuscles.
Forthevastmajorityofproteinsthethree-dimensionalshapeiscrucialforbiologicalfunction[1].
Whiletheaminoacidsequenceconstitutingtheproteindenesthethree-dimensionalformatphysio-
logicalconditions(usuallyuniquely)[2]itisnotatallclearhowthismappingfromsequencetostructure
shouldbeperformed.ExperimentalstructuredeterminationbyX-rayorNMRstudiesontheotherhand
isverycostlycomparedtotherelativelysimplesequencing.Therefore,thereareahostofproteinse-
quencesavailablewhileonlyamuchsmallernumberofstructuresareknown[3,4].Thus,predictionof
proteinstructurefromsequenceposesaformidablechallengewithgreatimpactonproteinengineering
applications.medicalandNotonlythenalbiologicallyactiveform,theso-callednativestateofaprotein,isofscienticinterest
butalsothefoldingprocessthatmaytakemicrosecondsforthefastestfoldersoruptoseveralminutes
formorecomplexproteins[1].Proteinsinlivingorganismsoccasionallyfoldinto“wrong”shapesthat
cannotfulllthebiologicalfunctionand,althoughthesemisfoldedproteinsareusuallyquicklydegraded,
theyareconnectedtodiseasessuchasBSEorAlzheimer’s[1].
Thetwoparts,proteinstructurepredictionandfoldingdynamics,thusapproachtheproblemofhowa
chainofaminoacidsfoldsintoaspecicstructurefromtwodifferentangles.Whiletheformerfocuseson
thenalstructure,i.e.theoutcomeoffolding,thelatterelucidatesthephysicalprocessitself.Successful
methodsofstructurepredictionusuallyignorethephysicalprocess,whereasinvestigationsonfolding
behaviouroftenincorporateinformationonthenativestructure.Althoughthesemethodsdiffer,both
casescangainfromtheuseofstructuralproles,aswillbeshowninthisthesis.
Proteinstructurescanberepresentedasso-calledstructuralproles[5,6](seeSection1.3).Sucha
proleisanarrayofthesamelengthastheprotein’saminoacidsequenceandcontainsinformationon
eachposition’sconnectivity,orpropensitytohavecontactswithotherpartsofthesequence.Assuchit
isstronglycorrelatedtothehydrophobicityoftheaminoacidatthepositioninquestion[7].
Predictionoftheunknownstructureofaproteincanprincipallyfollowoneoftworoutes:Eitherthe
sequenceisfoundtobehighlysimilartothatofoneormoreproteinsofknownstructure,thenatemplate
canbecreatedfromtheknownstructure(s)andusedasabasisforstructureprediction.Thismethodis
knownashomologymodellingwhichtodaygivesveryacceptableresults[8].Ifthisisnotthecaseand
nosufcientlysimilarstructurecanbefound,thestructurehastobepredictedabinitioandonlylocal
similaritiestoknownstructurescanbeexploited.Acommonmethodthenistoproceedintwosteps,the
rstofwhichconsistsincreatingalargenumberofcoarse-grainedcandidatestructures.Thesecondstep
istorenetheserstguessestoahigherlevelofdetail.Asrenementiscomputationallyverycostly,itis
veryimportanttolimitthistasktothemostpromisingcandidates.Therstpartofthisthesis,Chapter2,
addressesthisproblemofcandidateidenticationandselection.
Inthiscontextofabinitiostructurepredictiontheuseoftheexactprolederivedfromthestructure
willbediscussedasaproofofprinciplefortheselectionofgoodcandidatestructures.Astheseexact
proles,however,arenotavailableforrealpredictions,thetruechallengeistouseprolesintheselec-
tionstepthathavebeenpredictedfromsequence.Filteringbyeitherexactorpredictedprolesistested
ontwodifferentsetsofproteinsandcomparedtothemoreestablishedmethodsoflteringbylow-
resolutionenergy[9]andstructureselectionbyclustering[10,11].Theinuenceofproleprediction
qualityontheperformanceoflteringisinvestigated,asaresizeofproteinsandqualityofcandidate
sets.

1

Simulationsofproteinfoldingcangiveinsightintointermediatestructuresorlong-livedmetastable
onstatestwo-asorwellasthree-dimensionalelucidatethefoldinglattices[12,mechanism.13],Suchbeads-on-a-stringsimulationsmodelsrangefromwithonlyhighlysimpliedhydrophobicmodelsinter-
actionsdynamics[14,(MD)15]orGsimulations¯o-modelsofvaryingbasedonthecomplexity.principleForaofproteinminimuminanfrustrationaquaeous[12,solution16]totheeffectsmolecularof
morewatercanexactbeapproachtakenisintotoaccountsimulateaimplicitlyfewlayersviaofsolvationmolecularmodelswateralong(implicitwithwaterthe,e.g.proteinRef.itself[17]).(explicitThe
water)[18].This,however,dramaticallyincreasescomputationalcostsasitnotonlyincreasesthesys-
eldtem[19]sizeorbutanattheeffectivesameforcetimeelddecreaseswiththeso-calledpossibleunitedintegrationresidues[20,time21]step.canbeLikewise,used.anOfcourseall-atomtheseforce
methodsformationallspacehavecanbedifferentexhaustivelyadvantagesandenumerated,short-comings.andareWhileespeciallylatticeusefulifproteinsgenericintheirfoldingdiscretebehaviourcon-
isMDforceinvestigatedeldson[13,the22],othertheyhandarecannotcapturerealisticahighenoughleveltoofstudydetailsspecicclosetoproteins.aprotein’sHighlynativestatesophisticatedbut
areand,notmorewell-suitedimportantlyto,toinvestigatetoday’sforceentireelds,foldingwhicharetrajectories.optimisedThisisforduefullytothefoldedhighstructurescomputationalandsmallcost
aremoleculesalsoaonlynumber[23]ofsopurelythattheirtheoreticalbehaviourmodelsisnot[24–26]necessarilythatmakerealisticforpredictionsunfoldedonheatconformations.capacitycurvesThere
andthelikebutnotonfoldingpathways.
Insimulationsthesecondinapartofcoarse-grainedthisthesis,modelChapterbiased3,Istudytowardsthethedynamicsnativeofstructure.proteinsThisbymeansapproachofisMontesimilarCarloin
doesspiritnottoG¯assumeo-modelsthatwhichonlyalsothoserelyaminoontheacidsthatknowlegdeinteractoftheinthenativenativestatestatebutitinteractdiffersinduringsofarasfoldingit
(i.e.theprincipleofminimalfrustration).Inthecontextoffoldingdynamicsthestructuralprole
isnativecomputedstructure.fromThisthethesisknownshowsthatthree-dimensionalthemodelstructurebasedonandtheusedtostructuralcreateproleapotentialallowsthefavouringsuccessfulthe
reconstructionofthree-dimensionalproteinstructures,andinvestigatesthefoldingbehaviourbymeans
oflandscapesfoldingaretrajectoriespresentedandandfreediscussed.energySomelandscapes.specicAdaptedexamplesamplingproteins,forschemeswhichtocreateexperimentalfreedataenergyor
tomolecularthoseobtaineddynamicsbyG¯simulationso-modelsandareitisavailable,shownarethatbyexaminedaddinginonlymorealittledetail.moreThesecomplexityresultsaretothecomparedmodel
considerablymorerealisticbehaviourcanbeobserved.

StructureProtein1.2

Naturallyoccurringproteinsmostlyconsistof20standard(so-calledcanonical)aminoacidsasbuild-
ingblocks.Theseaminoacidsformlinear(unbranched)chainsorsequenceswhichsubsequentlyfold
intohighlycarboxyl(C)-terminusspecic(seethree-dimensionalFig.1.1(a))andstructures.asideEachchainRamino(“residue”acidhasoransimply“rest”)amino(N)-terminusthatdeterminesanda
theaminoacid’sidentity.ThecarbonatomtowhichthesidechainisattachedistheC-atom.Alist
ofcanonicalaminoacidsisgiveninTable1.1wheretheyarelooselygroupedaccordingtotheirpo-
larityhydrophobicityandcharge,orfollowinghydrophilicityRef.and[27].thereforeTheseitspropertiestendencyaretobeimportantpartoftotheaccountproteinforcoreanoraminopartofacid’sthe
surface.solvent-exposedofAnotherglycine,seeimportantFig.1.1(b)property)canisbesheerpackedsize–intowhiletightsmallturns,sidelargechainsside(suchchainsasthe(e.g.singlethatofhydrogentryptophan)atom
aretoobulky.Therearetwoaminoacidsthatcanbemarkedasspecialfortheirstructuralproperties:
AsProline’ssuchsideprolinechainfrequentlyclosesbackdisruptsinonsecondaryitsbackbone’sstructuralnitrogenelements.atomCysteinemakingontheprolineotherhandparticularlyevenrigid.has

2

OO(a)(b)GLY(c)R1R2
OOTRPOH+OH
OH2NH2NCYSONH2NH2
OHOHH2NH2N
OHOHPROONHOR2
SHHRNOHR1NOH+H2O
HNHO2

Figure1.1:Aminoacidsandpeptidebond.Part(a)showsagenericaminoacidwithRdenotingtheside
chainthatdistinguishestheaminoacid,theredindicatesthepositionofthe-carbonorC.
Somespecicaminoacidsaregivenin(b),glycine(GLY)isthesimplestaminoacidwherethe
entiresidechainconsistsofasinglehydrogenatom,whereastryptophan(TRP)isparticularly
bulkyandaromatic.Proline(PRO)isspecialinthatitssidechainclosesbackinonthenitrogen
atomandcysteine(CYS)canformverystabledisulphidebondswithothercysteinesinthe
protein.Part(c)showstheformationofapeptidebondbetweentwoaminoacids.Forthose
aminoacidswheretheC-atomisachiralcentrethenaturallyabundantenantiomerisgiven.
DrawingsofchemicalstructuresarecourtesyofC.Wol.

thatimpactcanonbefartertiaryaway(orinthequaternary)sequenceorstructureevenbylocatedformingonveryanotherstablechaindisulphidewithinabondsproteinwithothercomplex.cysteines
Inproteinsequencesaminoacidsareusuallyabbreviatedtoeitherathree-oraone-lettercode.These
abbreviationsarealsoreportedinTable1.1.Therearemoreabbreviationsforunknownorunclear
aminoacids(e.g.iftheaminoacidisleucineorisoleucinebuttheexactkindcouldnotbedetermined
thisisindicatedbythethree-lettercodeXLEandone-lettercodeJ).Onerelativelyfrequentnon-standard
aminoacidisselenomethionine(MSE)whichisincorporatedbytheorganisminsteadofmethioninein-
discriminately,thesoledifferencebetweenthetwobeingthatselenomethioninecontainsaselenium
ineatomandinsteadusedinofXthe-raysulphurstructureatom.determinationBecauseofasthisitslargerpropertymassitishelpssometimeswiththesubstitutedcrystallographyformethion-phase
[28].problemTheaminoacidsinaprotein’ssequence,orprimarystructure,formpeptidebonds(seeFig.1.1(c)
andbecomeFig.1.2linked(a)):andOneoneaminomoleculeacid’sofNH2water-groupisreactsreleased.withInantheother’srigidpeptideCOOH-group,bond,theconsecutivetwoaminoC-,acidsC-,
N-theandCpredominant-atomsaretransrestrainedcongurationtolie(2.8inÅainplanethewithmuchararerxedcisdistanceconguration).betweenCThisis-atomsofequivalent3.8Åtoin
thevaluesofstatement180(thattransthe)or0dihedral(cis).angleThe!dihedralbetweenanglesplanesdenedbetweenbytheC-Cplane-NanddenedC-N-CbyCis-N-Crestrictedandtheto
planedenedbyN-C-CiscalledandthedihedralanglebetweentheplanedenedbyN-C-Candthe
planedenedbyC-C-N.Theseangles,and,arenotrestrictedbythepeptidebondbutmakeup
freedom.ofdegreesbackbone’stheTostructureformaelementscompact[29]foldedsuchasstructure-helicesandandsqueeze-sheetsout(seewaterFig.from1.2the(b)).interiorThis,resultsproteinsinformvaluesofsecondaryand
-sheets.clusteringThesearoundsecondarytypicalregionsstructureinelementsso-calledareRamachandranadditionallyplotsstabilised[30]bycorrespondinghydrogentobonds-helicesbetweenor
ferredstrandsofchirality-sheetsofor-helicesturnsinbutthehelicescause[31].oftheirChiralityofhomochiralityaminoacids(aswell(seeasFig.that1.1of(b)the)resultsriboseinofaRNA)pre-
isstillpuzzling[32].Secondarystructureelementsarethenassembledintotertiarystructures(see
Fig.tional1.2unit(c))consistsoftenofwithseveralrecurringproteinsmotifsformingsuchas-complexes.-(notTheseshown).proteinQuiteassembliesoftenthearethenbiologicallyreferredfunc-to
asquaternarystructures.Theycaneithercontainrepeatingidenticalsubunitsorheterogenoussubstruc-

3

aminoacidthree-lettercodeone-lettercodeside-chainpolarity
alanineglycineGLALAYAGnonpolarnonpolar
valineleucineVLEUALLVnonpolarnonpolar
nonpolarIILEisoleucinenonpolarFPHEphenylalaninenonpolarPPROprolinenonpolarMMETmethioninenonpolarWTRPtryptophanchargedDASPacidasparticglutamicacidGLUEcharged
lysinearginineLARGYSKRchargedcharged
polarSSERserinepolarTTHRthreoninepolarYTYRtyrosinepolarHHIShistidinepolarCCYScysteinepolarNASNasparaginepolarQGLNglutamine

Table1.1:The20canonicalproteinogenicaminoacids,therstthreecolumnscontainfullnameand
three-andone-lettercodes.Thelastcolumngroupsaminoacidsintothreeclasses:Nonpolar,
chargedandpolar,respectively.Source:Ref.[27].

tures.Althoughstructuredeterminationproceedsatamuchslowerpacethansequencingofproteins,
therstsequence(ofinsulin[33],1955)andstructure(ofmyoglobin[34],1958)weredeterminedat
time.sametheroughlyAprotein’ssequencethusdeterminesthenativethree-dimensionalstructurewhich,accordingtoAnn-
sen’sfamousparadigm[2],liesinthefreeenergyminimumandisthusstableatphysiologicalconditions.
Specically,thismeansthataproteinwillrefoldtoitsnativestateafterdenaturation,oncephysiological
conditionsarerestored.Thisparadigmholdsatleastforsmallproteins.Largerandmorecomplexpoteins
mayrequireassistanceofchaperonesforfolding[27]andacoupleofproteinshavebeenobservedfor
whichthebiologicallyactive(native)stateisnotthemoststableconformation.Theseproteinsdegrade
afterminutestohoursandbecomeinactive[36–38].Foldingisalsoinuencedbyconnementand
molecularcrowding[39,40].Notwithstandingthesefew(butnotable)exceptions,Annsen’sparadigm
isusuallyassumedinproteinfoldingwhichisalsothecaseforthisthesis.Thestructuresdiscussedfor
foldinginthisthesisarerathersmall(upto45aminoacids)anddomainsoflargerproteins.Theyfold,
however,autonomouslyifexcisedfromthelargeproteinandindependentlyfromotherdomainsifpart
ofthelargeproteinswhichthemselvesfoldinamodularfashion[1].
In1968Levinthalnotedthatevenaproteinofmoderatesizewouldrequireastronomictimesfor
foldingbyarandomsearchofconformations,whereasinfactproteinsmayfoldveryquickly.Heresolved
whatseemedtobeaparadoxbypostulatingpathwaysoffoldingthatwerefollowedbytheprotein[41,
42].Todaythisviewhasbeenreplacedbythenotionofafunnelinenergyleadingtowardsthenative
structure[16,43],thusnoclear-cutpathwayorsequenceofintermediateconformationshastobe
adheredtobutinsteadmultipleroutesareallowedthatnallyreachthebottomofthefunnel.While
potentialenergyisfunnelledtowardsthenativestates(withsomepossibleroughness),lossofentropy
almostcompensatesforthegaininenergyduringfoldingandfreeenergyhastoovercomea(atleast
one)barrier.Thepossibilityofdownhillfoldingwithoutafreeenergybarrierexistsbutitsobservation

4

)a(

IFQMVKT∙∙∙RG
TLG

)b(

()c

Figure1.2:Proteinsequenceandproteinstructureillustratedonubiquitin(PDB[35]id.1ubq).(a)Amino
acidsequenceorprimarystructure,(b)secondarystructureelements–helix,anti-paralleland
parallel-sheet–(c)andtertiarystructure.

forarealproteiniscurrentlyintenselydebated[44–46](seealsothediscussionofexampleproteinBBL
insection3.4.5).Levinthal’sparadox,however,isstillofimportancewhendesigningpotentials,mainly
forhavetostructurebedesignedprediction.tobefunnelledConformationtowardsspaceisthetoonativevasttobestructureifexhaustivelynear-nativesampledstructuresandenergyaretobepotentialsfound
times.reasonablewithintheProteinsnumberareofpossibleclassiedfoldsintoisfoldslimited.andNewfamiliesfoldsbasedforontheirsingle-domainstructuresproteins[47,are48]rarelyanditencounteredappearsthatand
newterritoryismostlyduetonewassembliesofdifferentdomains[49].Fromthisfollowsthatmanyse-
quencesmapontoverysimilarstructures,meaningthatstructureinformationisevolutionarilystronger
couldconservedbethandesignedtosequencemorethaninformation80%[50].sequenceThereidentityalsobutexistfoldingexceptionsintoveryfromthisdifferentruleandstructuresproteins[51]
andnaturallyoccurringproteinsofdifferentfoldsbut40%sequenceidentitywereobserved[52].Usu-
ally,structureshoweveris,bysequencesearchingfor,similarityeveninfersremotelyeven,similarstrongersequencesstructuralofknownsimilaritystructureanda[8].goodwaytopredict

1.3ContactMapsandStructuralProles

ofProteincoarse-grainingstructuresiscantobenottreatsimpliedtoproteinsdifferentinalevelsquantumofchemicalcoarse-grainingdescription,(seeFig.ignore1.3).electronTheveryrstcorrelationsstep
andusedininsteadall-atomemployforceeldsinteratomicsuchaspotentialsCHARMM[27](Fig.[53],1.3AMBER(a)).This[54]orwillresultGROMACinSthe[55],empiricaltonameapotentialsfew.
Thenextlevelofsimplicationusuallyistoomitthesidechainsthatarespecicforeveryaminoacid
andconsideronlythebackboneconsistingofrepeating[NCCO]-units(Fig.1.3(b)).Theinversestep,
topositionsincludefromsideachainssidechainintoalibrarygiven[56]backboneand,oralthoughrecoverthecomplicated,fullinformation,canbeviewedinvolvesasbasicallyoptimisingsolved.rotamer
FurthersimplifyingthebackboneresultsinaC-trace.Becauseoftherigidpeptidebondthisrep-
canberesentationsimilarlyisessentiallyrecovered[57].equivalentThetonextthelevelbackboneofof[simplicationNCCO]-unitswouldandconsiderthefullelementsall-atomofsecondarydescription
ofstructurerepresentativeasbasicCunits-atoms[58](with(seeFig.one1.3exception(c)),inhoweverSection,inthis3.4.7).thesisWweorkingwillwithstayatsecondarytheabstractionstructurelevelele-
mentsstructureasandbuildingprecedeblocksthealsolattermakesintheformation,assumptionwhichthatisonetheyaretheoryofsignicantlyfoldingmoremechanismsstablethanbutnottertiarythe
5

)a(

)b(

)c(

Figure1.3:Dierentlevelsofcoarse-grainingillustratedonubiquitin(PDBid.1ubq),(a)showstheposi-
tionsofallnon-hydrogenatomsandcovalentbondsbetweenthem(carbonisshownincyan,
oxygeninred,nitrogeninblueandsulphurinyellow),(b)givesthebackbonestructurewith
C-atomshighlightedascyanballsandpeptidebondsshowninred,(c)takescoarse-graining
tothelevelofsecondarystructureelements.

PCEV
CE

2(a)(b)7 700""(c)1. 1.828 PCEV
j6 600vi1. 1.6 1.46EC
5 5001.4
4 4001. 1.2 12
1rebmuneudiser2 100#"yrtnerotcev0. 0.44
3 30000.. 0.8 0.686
20100. 0.22
000 01 100re2 20s0id3u 300en4 40u0m5b 500er6 600i7 700 00 001 1002 20r0esid 303u0en4 40u0mbe5 50r0i6 6007 700
Figure1.4:Fromthethree-dimensionalstructure(a)thecontactmap(b)canbecomputed.Possible
structureproles(c)arecontactvector(CV),principaleigenvector(PE)andeectiveconnec-
tivity(EC).Theexampleproteinshownisubiquitin(PDBid.1ubq).

onlyone.Itisthereforemoreinformativetoomitthisaprioriassumptionandverifyafterwardswhether
route.thisfollowsimulationsfoldingThedistancebetweenconsecutiveC-atomsis(approximately)xed,aconsequenceoftherigidpep-
tidebond,whichleavesroughly2LdegreesoffreedomwhereListhenumberofaminoacids.(The
byrsttheaminointeratomicacidcanbedistance.placedForthearbitrarilynext,theaminosecondacidoneaminoangleacidhastoarbitrarilybespeciedonaandsphereforoftheradiusremainingxed
Land3onlyaminoretainacidstwoinformationangles.)aboutAhugecontacts,i.e.simplicationcloseistoproximitydiscard,thebetweenindividualaminoacidsaminowhichacidresultspositionsin
anL×Lsymmetricandbinarymatrix,theso-calledcontactmap.Ingeneral,thisrepresentationisnot
equivalentconformationstothewhichcoordinatewouldallbeinformationmapped–atochainthewithemptynomatrix.contactsatHoweverall,canforstilltakecompactalargefoldednumberproteinsof
thecoordinaterepresentationcanberetrievedfromthecontactmap[59,60]withonlyamarginalloss
inresolutionthatiscomparabletoresolutionobtainedinexperiments.Anothersmallissuethathasto
benotpreserveconsideredthisisthatpropertychirality.Ashelicesmattersinhardlybiologicaleveroccurmoleculesintheandleft-handedthemappingversion,tocontacthowever,itmatricesiseasydoesto
structure.correctthepick6

Thereareseveralwaystodeneacontactbetweentwoaminoacids,thesimplestbeingbythedistance
betweenC-atoms.Thedistancethreshold,orcontactradiusrc,isalsosomewhatarbitraryandwhether
adenitionisagoodchoicealsodependsontheintendedapplication.Inthisthesis,acontactradiusof
rc=8.5Åwasfoundtoworkwellinthecontextofbothstructurepredictionandfoldingdynamics.The
contactmapisthendenedas

1di,j<rc^i|j|>2
Cij=0di,jrc_i|j|2(1.1)
wheredi,jisthedistancebetweenC-atomiandC-atomj.Trivialcontacts,suchasself-contactsand
anycontactwith|ij|2,aredisregarded.Inasimilarapproach,contactsbetweenaminoacidscould
bedenedbasedondistancesbetween“heavyatoms”,whichmeansanynon-hydrogenatoms.The
distancethresholdisthenusuallysettorc=4.5ÅandCij=1ifanyheavyatomofaminoacidicomes
closerthan4.5Åtoanyheavyatomofaminoacidj(withanalogoustreatmentoftrivialcontacts).In
thefollowing,however,thethree-dimensionalstructurewillbedescribedatthelevelofC-atomsand
.accordinglydenedarecontactsForstructurepredictionpurposescontactmapsarenotveryusefulastheyarethemselvesverydifcult
topredict[61].Inparticularfalsepositivesinthepredictionofthecontactmapseverelyworsenthe
resultingthree-dimensionalstructurewhilemissingcontactsarenotasharmful[62].
Inthecontextoffoldingsimulations,useofthecontactmapasabiastowardsthetargetstructure
correspondstoG¯o-modelswherenativecontacts,i.e.interactionsinthenativestate,aremadeattractive.
Structureinformationcanbefurthercompressedbymakinguseofstructuralproles.Thereisahost
ofdifferentdenitions[6]ofone-dimensionalrepresentationsthatconveyinformationabouteachamino
acid’sconnectivity.Asimplestructuralproleisthecontactvectorthatgivesthenumberofcontactsfor
,acidaminoeachi

(1.2)

Lc˜i=Nj=1Cij.(1.2)
XThenormalisingconstantNcanbechosensuchthat
L1c˜hi=Li=1c˜i=1.(1.3)
XThischoiceofproleisusefulforstructurecomparisonandalignmentsinceitscomputationrequires
verylittletime[63].Adrawbackwhenitcomestostructurepredictionorfoldingsimulationsisthatthe
contactvectorisdegeneratewhencomparedtothecontactmap.Multiplestructuresthatcanbequite
distinct,inparticularonlypartlyfoldedstructures,aremappedontothesamecontactvector[64].
ofstructureStructuralprolespredictionthatandarefoldingderivedfrominvestigationsthecontactalthoughmap’stakingeigensystemmorearecomputingbettertime.suitedThefortheeffectivetask
map’sconnectivityeigenvectors(EC)istheweightedproleofaccordingchoicetointhethisthesiscorrespondingandcontainseigenvalues,contributionsfromallthecontact

Lc=11(j)v(j)vh(j).i(1.4)
XAj=1
Herecistheeffectiveconnectivity,avectorialquantity,andthev(j)(j=1,...,L)aretheLeigenvectors
ofthecontactmapwiththeireigenvalues(j).Thequantityvh(j)iistheaverageofentriesvi(j)of
eigenvectorj,Aandareparametersusedtoxtheaverageofc,chi=1,andtherelativevariance
ch2/ich2i=c˜h2/ic˜h2itothesamevalueasthatofthecontactvectorc˜[6].Asitturnsout,forasingle
7

domainstructureisoftenclosetothelargesteigenvalue,(1),meaningthatthecontributionof
theeigenvectorcorrespondingtothelargesteigenvaluewillbedominant.Consequently,theeigenvector
tothelargesteigenvalue,theprincipaleigenvector(PE),onitsownisalsoavalidchoiceforastructural
proleforsingledomainstructureswhere(1)andhasbeenusedatearlystagesofthisthesis.
Bothproledenitions,ECandPE,canbeunied[6]astheymaximisethequadraticform

QCijcicj(1.5)
XjiforagivencontactmapCijunderdifferentsideconditions.Thewell-knowndenitionofthePEisto
maximiseQundertheconstraintthat
Lch2i=1ci2=1.(1.6)
XL1=iFortheECanadditionalconditionisintroduced,namely

L1chi=Li=1ci=1(1.7)
XandunderthetheseconditionconstraintsinEq.then(1.6)isleadstochangedthetochexpression2/ich2iin=Eq.c˜h2/i(1.4)c˜h2iaswherethementionedopenabove.parametersMaximisingAandQ
haveFromtobethisxedtounifyingsatisfydenitionthetheconstraintscorrelation[6].tohydrophobicitybecomesevident.MaximisingQin
acidsexpressionj–and(1.5)evenmeansmorethatsoifentriestheseciaminowillbeacidslargehaveformanyaminoacidscontactsiwithandmanylargeentriescontactscjtoofothertheiraminoown.
Inordertohavemanycontacts,anaminoacidwillbeburiedintheproteincore,ashydrophobicamino
acidsaminotendacids,toasdo.againMoreoveris,theycharacteristicwillofhaveaminocontactsacidsofpredominantlyhighwithhydrophobicityother.Anburied,entryi.e.cofhydrophobic,eitherEC
iorthosePEthusaminoacidsdependswithnotwhichonlyionisinthecontactnumber(andofthecontactsaminoofacidsaminowithacidiwhichbutthosealsoareoninthecontactcontactsetc.)of
andthereforecontainsinformationonaminoacidconnectivitywhichismoredetailedthanthatinthe
.vectorcontactThereisnomathematicalproofthateitherECorPEareindeedequivalenttothecontactmap(under
appropriateconstraintsonthecontactmapsuchasconnectednessorpossiblyexistenceofsecondary
compactstructuremotifs)single-domainbutfortheproteinsPEthereinvestigated.isaForreconstructiontheECnoalgorithmdegeneracy[65]that(oneECwasprolefoundtoworkcorrespondingforall
tothePEmultipleisthat,contactformaps)multi-domainwasproteinsencounteredwhereinanytheofcontactthemapssimulationsdecomposerunforintothisthesis.disconnectedAdrawbackblocks,ofit
willonlygiveinformationaboutthelargestandbest-connectedblock.AstheECcontainscontributions
fromallthecontactmap’seigenvectors,andthusinformationonallproteindomains,itdoesnotdisplay
thisinvestigatedprobleminandthewasfoldingconsequentlydynamicsallusedconsistastheofastructuresingleproledomainofintheirchoice.nativeAlthoughstate,thethesmallconformationsproteins
encounteredduringfoldingcanbemorecomplicated.
fromThesesequencevectorialthancontactrepresentationsmapsofbecauseproteintheystructureareofastheprolessamearedimensionmuchmoreastheamenablesequence,towhichpredictionisa
stringofaminoacids.Theprole’scorrelationtoaminoacidhydrophobicitycanthenbeexploitedto
predictitfromagivensequence.
Inthecontextofproteinstructureprediction,inparticularintheselectionofstructurecandidates,
prolespredictedcalculatedprolescanfrombetheused.Thecandidates’prolestructurespredictedareforagivencomparedsequenceusingaservesscoreasabasedtargetonthetowhichdifferencethe

8

betweenthetwoproles.Thoseconformationsthatareingoodagreementwiththepredictedprole
renement.furtherforretainedareLikewise,forproteinfoldinganenergyisdenedbasedonameasureofthedifferencebetweenthe
andtargetthusthestructure’scorrectproleandstructurethathasofthebeencurrentreached.Asisconformation.thecaseThewithenergyG¯iso-modelsminimaltheifmodelthethuscorrectcontainsprole
anenergyobviousinthebiastowardsprole-basedthemodelnativerelatesstructurehowbutwellallinsteadtheofaminomakingacids’onlynativeconnectivityorinteractionshydrophobicityattractiveisthe
thesatisedotherintheaminocurrentacids.conformation.Thisresults,ineffect,inaninteractionofallaminoacidswithall

9

01

PredictionStructureProtein2

Motivation2.1

AsexperimentaldeterminationofproteinstructuresbyeitherX-raycrystallographyorNMRisverycostly,
predictionofstructurefromsequenceisofgreatinterest.Thispredictioncaneitherbebasedonalready
knownsequence-structurepairsorbeperformedabinitio.
Ifsequencesofhighsimilarity(andknownstructure)canbefound,themethodofchoiceishomol-
ogymodellingwheretheknownstructuresareusedtocreateatemplateonwhichthestructuretobe
predictedismodelled.Inbiology,theterm“homology”isusedtostatecommonancestryofproteins.In
thecontextusedhere,however,itisnotnecessarytoestablishtruehomologyintheabovesenseforthe
selectionofsequences.Instead,sequencesofknownstructureareselectedbasedontheirsimilarityto
thequerysequencewhichmakeshomologybetweenthetwoverylikely.Still,“homologymodelling”is
thestandardtermusedinthiscontextalthough“comparativemodelling”maybeusedequivalentlyand
isthemoreaccurateterm[8,66,67].Proteinstructureismorestronglyconservedinevolutionthan
sequence[50]soevenamoderatelevelofsimilarityovertheentiresequencesufcestobecondent
ofhighstructuralsimilarity–althoughthereareafewnotableexampleswherethisdoesnothold,see
e.g.Ref.[51]wheretwoproteinsareengineeredat88%sequenceidentitybutwithcompletelydifferent
(asopposedto/)foldsorRef.[52]fortwonaturallyoccurringproteinsof40%sequenceidentity
anddifferentfolds.Thegeneralrulefornaturalproteinsthoughisthatsequencesimilaritymeanshigh
structuralsimilarityandoneofthemainchallengesistoproperlyincorporateinformationfromremote
[8].homologuesIfnosuchstructureswithsequencesofsufcientlyhighsimilaritytothequerysequencearedetected,
structurehastobepredictedabinitio.Thisdoesnotmeanthatnostructuralinformationfromother
proteinsentersthepredictionbutthatinformationisusuallyonlylocalandalsolessreliable.This
branchofproteinstructurepredictionisthereforemorechallengingthantemplate-basedhomology
69].[68,modellingThebi-annualCriticalAssessmentofTechniquesforProteinStructurePrediction(CASP)haswitnessed
remarkableprogressinboththesecategoriesoverthelast15years[70–74].Forthisassessmentexper-
imentalistsagreetoholdbackrecentlyresolvedstructuresandtheoristsareinvitedtosendintheir
predictions.OnemethodthatrepeatedlyperformedverywellinCASPisRosetta[10,72,75].There,
startingfromasequence,therststepistopredictsecondarystructureandcreatealibraryofstructure
fragmentsforthatparticularsequencebasedonsequenceandsecondarystructuresimilarity.Thesefrag-
mentsarethenassembledintocomplete,foldedproteinstructuresandinclusionofdifferentfragments
orlocalmovementsareproposedaccordingtoaMonteCarlo(MC)scheme.
Thefragmentassemblystepresultsinasetofverymanycoarse-grainedproteinstructureguesses
whichisexpectedtocontainafewcandidatesthatareclosetothenativestructure.Ifcomputationtime
werenotanissueallthesecandidatescouldbene-grained,i.e.omittedsidechainsincluded[56],and
optimisedagain.However,onlythoseguessesthatarealreadycloseenoughtothecorrectstructure
willprotfromthisrenement,thevastrestwillremaintrappedintheirrespective(wrong)folds.It
ishencewastefulofcomputerresourcestotreatallcandidatestofurtheroptimisation.Inthisrstpart
ofthethesis,Ithereforeinvestigatethetasktoidentifythefewgoodcandidatescontainedinthelarge
[76].setscoarse-grainedThischapterisorganisedasfollows:Afterashortpresentationoftheresearchcontextandbackground
material,Section2.2introducesthebenchmarkproteinsusedinthisstudy(Subsection2.2.1)andsum-
marisesthemethodsofproleprediction(Subsection2.2.2).Subsections2.2.3and2.2.4presentthe
lteringresultswhenusingRosettatopredictproteinstructures,Subsection2.2.5theresultsforstructure
predictionsdownloadedfromtheCASP8website.Thechapterendswithadiscussion(Section2.3).

11

SelectionStructure2.2

Simulationofdetailedside-chainsinrealisticpotentialsfacestwoseriousdrawbacks:Foronething
itisaverycomputation-intensivetaskthatrequiresimmensecomputingresources.Theother,more
fundamental,problemisthattoday’sallatompotentialsorforceeldsareoptimisedforfullyfolded
nativeproteinsandthuscanonlyfaithfullymodelthevicinityofthesestructures[23].Thevastspaceof
unfoldedoronlypartiallyfoldedstructurescanthusnotbeexpectedtoberepresentedasaccurately.
Thismakessuchdetailedpotentialsunsuitableforfollowingtheentirefoldingprocess–andevenmore
soforpredictivefoldingwherenoparameterscanbetunedinfavourofthenativestructure.Successful
structurepredictorsthereforeignorethefoldingprocessandconcentrateonthenalstructure.For
thisreason,thetimeseriesofproteinconformationsencounteredin,forexampleRosetta,cannotbe
consideredasafoldingtrajectory.Instead,entirefragmentsoftheproteinstructurearereplacedina
singlestepwhichspeedsupthesamplingofconformationspaceandincreasesthechancestohitona
structurethathasatleasttheoverallcorrectfold.
Thousandsofcandidatestructuresareproducedinthecoarse-grainedstep,whichonlydescribesthe
protein’sbackboneandsomeinteractionssuchasstericrepulsion.Itisnecessarytoproduceverymany
structuresatthisstageasthecoarse-grainedpotentialmaynotproducecloseguessesoneveryoccasion.
ForsinglestructurepredictionsthatwereenteredintoCASP,theRosettagroupindeedne-grainedall
thesecandidatestructures[72],butthisisnotfeasibleforhigh-throughputpredictions.So,ashas
beenmentionedbefore,thetaskistoselectonlypromisingcandidatestructuresforhigh-resolution
renement[77]inphenomenological[78]orphysics-based[79]forceelds.
Theselectionstepmayinvolverankingcandidatestructuresbytheenergyfunctionusedtoproduce
thelow-resolutionstructures.Thereforeanoptiontoimprovetheselectionistodenebetterenergy
functionsforscoringoflow-resolutioncandidatestructures[9].Anotherapproach,whichappearsto
bepromising,istoperformaclusteringofthestructuresbytheirpairwiserootmeansquaredistances
(RMSDs)andthenconsiderthelargestclusters[10]ortheclustersoflowestenergy[11].Thisis
basedonthenotionthatwhilethecoarse-grainedmodelmaynotsucceedindiscriminatingthesingle
beststructurebyenergy,itwillstillonaveragecreatemanyconformationsinthevicinityofthenative
structurewhichcanbedetectedbyclusteringusingpairwisesimilarities.Clusteringbydistancematrices
andidentifyingtheclusteroflowestenergyhasalsoprovedsuccessfulinthereconstructionofprotein
structuresfromhighlyapproximatebackbonetorsionangles[80].
Relatedtotheclusteringapproachisthedenitionofameta-scoringfunctionbasedonthecorrelation
ofscoringfunctionsthatareweaklyfunneledtowardsthenativestate[81].Thishasbeenapplied
totherankingofpredictedproteinmodels[82]whichissimilarinspirittotheselectionofcandidate
structures.Themeta-methodofdetectingsimilaritiesandcleverlycombininingpredictionsofdifferent
methodshasalsobeensuccessfulinrecentroundsofCASP[83,84].Ifsparseexperimentaldatais
known,suchasNMRchemicalshifts,theirinclusionasconstraintsonproteinstructuresubstantially
86].[85,predictionimprovesAnotherpromisingmeanstoselectstructurecandidatesforrenementistheuseofstructuralproles
suchastheeffectiveconnectivity,Eq.(1.4).Aswillbediscussedinmoredetailbelow,lteringbya
predictedproleoutperformedlteringbyRosetta’slow-resolutionenergyandclusteringbyRMSDsin
mostcases[76],irrespectiveofwhethercandidatestructureswereproducedusingtheRosettasuiteor
downloadedfromtheCASP8(8thCriticalAssessmentofTechniquesforProteinStructurePrediction)
server[87]andthuscamefromvariouspredictionmethods.Structuralprolescanbedeterminedfrom
knownstructuresandusedtoefcientlycomparethem[63]but,mostimportantlyinthecontextofthis
chapter,thestructuralproleofaprotein’snativestatecanalsobepredictedtogoodaccuracyfromits
aminoacidsequence.Predictionofone-dimensionalstructuralprolesisaboveallmucheasierthanthe
predictionofresidue-residuecontacts,i.e.two-dimensionalcontactmaps.Thepredictedprolecanthen
beusedasatargetwhichiscomparedtoeverysingleprolecomputedfromthecandidatestructures.
Onlythosecandidatestructureswithprolesthataresimilartothepredictedtargetareselectedas

21

classlengthdescriptionPDBid46antikinaseSda1pv01gb1immunoglobulinbindingdomainofproteinG56+

  • Accueil Accueil
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • BD BD
  • Documents Documents