Fast, sensitive protein sequence searches using iterative pairwise comparison of hidden Markov models [Elektronische Ressource] / Michael Remmert. Betreuer: Patrick Cramer
175 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Fast, sensitive protein sequence searches using iterative pairwise comparison of hidden Markov models [Elektronische Ressource] / Michael Remmert. Betreuer: Patrick Cramer

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
175 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Dissertation zur Erlangung des Doktorgradesder Fakultät für Chemie und Pharmazieder Ludwig–Maximilians–Universität MünchenFast, sensitive protein sequencesearches using iterative pairwisecomparison of hidden Markov modelsMichael Remmertaus Köln2011Erklärung:Diese Dissertation wurde im Sinne von §13 Abs. 3 bzw. 4 der Promotionsordnungvom 29. Januar 1998 (in der Fassung der sechsten Änderungssatzung vom16. August 2010) von Herrn Professor Dr. Patrick Cramer betreut.Ehrenwörtliche Versicherung:Diese Dissertation wurde selbstständig, ohne unerlaubte Hilfe erarbeitet.München, am 16. September 2011Michael RemmertDissertation eingereicht am: 16. September 20111. Gutachter: Prof. Dr. Patrick Cramer2. Gutachter: Prof. Dr. Dmitrij FrishmanMündliche Prüfung am: 23. November 2011AcknowledgementsI owe my gratitude to all those people who have made this dissertation possible by support-ing and encouraging me throughout the last years, and I can only try to acknowledge themhere.First and foremost, my deepest gratitude is to my advisor Dr. Johannes Söding for givenme the opportunity to work in his group and to contribute to many fascinating projects.Furthermore, I would like to thank Johannes for all the fruitful discussions, the constantsupport, and for making the last years to such a great experience.I would like to thank Prof. Dr. Patrick Cramer for being my doctoral supervisor, andProf. Dr. Dmitrij Frishman for being my second PhD examiner.

Informations

Publié par
Publié le 01 janvier 2011
Nombre de lectures 37
Langue English
Poids de l'ouvrage 16 Mo

Extrait

Dissertation zur Erlangung des Doktorgrades
der Fakultät für Chemie und Pharmazie
der Ludwig–Maximilians–Universität München
Fast, sensitive protein sequence
searches using iterative pairwise
comparison of hidden Markov models
Michael Remmert
aus Köln
2011Erklärung:
Diese Dissertation wurde im Sinne von §13 Abs. 3 bzw. 4 der Promotionsordnung
vom 29. Januar 1998 (in der Fassung der sechsten Änderungssatzung vom
16. August 2010) von Herrn Professor Dr. Patrick Cramer betreut.
Ehrenwörtliche Versicherung:
Diese Dissertation wurde selbstständig, ohne unerlaubte Hilfe erarbeitet.
München, am 16. September 2011
Michael Remmert
Dissertation eingereicht am: 16. September 2011
1. Gutachter: Prof. Dr. Patrick Cramer
2. Gutachter: Prof. Dr. Dmitrij Frishman
Mündliche Prüfung am: 23. November 2011Acknowledgements
I owe my gratitude to all those people who have made this dissertation possible by support-
ing and encouraging me throughout the last years, and I can only try to acknowledge them
here.
First and foremost, my deepest gratitude is to my advisor Dr. Johannes Söding for given
me the opportunity to work in his group and to contribute to many fascinating projects.
Furthermore, I would like to thank Johannes for all the fruitful discussions, the constant
support, and for making the last years to such a great experience.
I would like to thank Prof. Dr. Patrick Cramer for being my doctoral supervisor, and
Prof. Dr. Dmitrij Frishman for being my second PhD examiner. I am also very grateful
to Prof. Dr. Klaus Förstemann, Prof. Dr. Roland Beckmann, Prof. Dr. Ulrike Gaul and
Prof. Dr. Karl-Peter Hopfner for offering their time as members of my dissertation commit-
tee.
Furthermore, IdeeplyappreciatethecriticalreadingofthisthesisbyDr.JohannesSöding
and Theresa Niederberger.
I also would like to thank all members of the Söding and Tresch group for the great
atmosphere in our office, for all their help and discussions, and for all the fun at the social
activities. Thanks to Andreas, Maria and Andy for the help in integrating their tools in
this project.
Last but not least, I am deeply grateful to my parents, my sister Ulrike with Jürgen, my
nieces Christiane and Caroline and all my good friends for all the support, patience, trust
and encouragement.
iiiSummary
Most sequence-based methods for protein structure or function prediction construct a mul-
tiple sequence alignment (MSA) of homologs as a first step. The standard search tool
to generate multiple sequence alignments is PSI-BLAST (> 25 000 citations), an extension
of BLAST to profile- comparison. owes its sensitivity to its use of se-
quence profiles and to its iterative search scheme. Significant sequence hits are added to
the evolving multiple alignment from which a sequence profile is generated for the next
search iteration. HMMER3 is similar to PSI-BLAST, but uses a profile hidden Markov model
(HMM) instead of a sequence profile to represent the evolving query multiple alignment. The
gain in sensitivity over PSI-BLAST is paid by a factor 3 to 4 reduction in speed.
In my thesis work, I have developed HHblits, the first method for iterative sequence
searchingbasedonthecomparisonofprofile HMMs. Itisfasterthan PSI-BLASTand HMMER3,
more sensitive, and constructs multiple alignments of significantly better quality. The
method builds on the HHsearch algorithm for pairwise comparison of profile HMMs, to
which it owes its high sensitivity and alignment quality. In parallel to HHblits, our group
developed a fast clustering method that can generate a covering set of HMMs for the entire
UniProt database in a few weeks time.
To speed up the search by a factor∼ 2000, I developed a prefilter that is based on
a novel algorithm for profile-profile comparison designed for maximum speed. The algo-
rithm effectively reduces alignment to profile-sequence alignment by coding
the database profiles by “column state sequences“ in which profile columns are represented
by an alphabet of 219 discretized column states. This permits a fast implementation of the
profile-profile comparison that employs the SIMD (single instruction multiple data) instruc-
tion sets available on modern CPUs. In this way, HHblits performs 16 byte operations in
parallel in a single clock cycle. Several further filtering steps reduce the amount of slow,
full-blown HMM-HMM comparisons to a fraction of < 1/1000. Our tests show that the loss
of sensitivity due to the prefilter is negligible.
On a standard SCOP20 ROC benchmark (SCOP1.73 proteins filtered to 20% maximum
sequence identity), HHblits detects twice as many true positives as PSI-BLAST and 54%
more than HMMER3 at 1% error rate in the first iteration. Two search iterations HHblits
detect significantly more true positives than five PSI-BLAST iterations. Alignment quality is
likewise improved significantly. Furthermore, we are able to make confident fold predictions
−3(E-value < 10 ) and build structural models for 394 Pfam domains for which no fold
prediction has been possible.
ivHHblits is a robust, general-purpose protein sequence search tool that has the potential to
replace PSI-BLAST as state-of-the-art method for the generation of MSAs. Due to the high
alignment quality, HHblits alignments are able to improve secondary structure prediction
methods such as PSIPRED. Furthermore, these better alignments facilitates the function and
structure prediction for proteins for which nothing is yet known.
In the second part of this thesis we study the evolution of outer membrane β-barrels
(OMBBs). These proteins are the major class of outer membrane proteins (OMPs) from
Gram-negative bacteria, mitochondria and plastids. Their transmembrane domains consist
of 8 to 24β-strands forming a closed, barrel-shapedβ-sheet around a central pore. Despite
their obvious structural regularity, evidence for an origin by duplication or for a common
ancestry had not been found.
We use three complementary approaches to show that all OMBBs from Gram-negative
bacteria evolved from a single, ancestral ββ hairpin. First, we link almost all families
of known single-chain bacterial OMBBs with each other through transitive profile searches.
Second, we identify a clear repeat signature in the sequences of many OMBBs in which the
repeatingsequenceunitcoincideswiththestructuralββ hairpinrepeat. Third,weshowthat
the observed sequence similarity between OMBB hairpins cannot be explained by structural
or membrane constraints on their sequences. The third approach addresses a longstanding
problem in protein evolution: how to distinguish between a very remotely homologous
relationship and the opposing scenario of ”sequence convergence“. The origin of a diverse
group of proteins from a single hairpin module supports the hypothesis that, around the
time of transition from the RNA to the protein world, proteins arose by amplification and
recombination of short peptide modules that had previously evolved as cofactors of RNAs.
This research provides the basis for the identification and classification of outer membrane
β-barrels and explains the evolutionary origin of this important bacterial protein class.
vContents
Acknowledgements iii
Summary iv
1. Motivation and overview 1
I. HHblits - Iterative HMM-based homology searches 3
2. Introduction to remote homology searches 4
2.1. Scoring models and gap penalties . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Pairwise sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. Profile alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4. HMM-HMM alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1. Log-sum-of-odds score . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2. Pairwise alignment of HMMs . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3. HHsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5. Iterative sequence search methods . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1. PSI-BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2. HMMER3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Material and methods 15
3.1. Workflow of HHblits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2. Generation of HHblits databases with kClust . . . . . . . . . . . . . . . . . . 18
3.3. Fast prefiltering of HHblits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1. Column state alphabet for sequence profile encoding . . . . . . . . . 20
3.3.2. Generation of the column state alphabet . . . . . . . . . . . . . . . . 22
3.3.3. Translation of sequence profiles in column state sequences . . . . . . 23
3.3.4. SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.5. Fast prefilter algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.6. Gapless local alignment with SSE2 . . . . . . . . . . . . . . . . . . . 27
3.3.7. Smith-Waterman with SSE2 . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.8. with backtrace . . . . . . . . . . . . . . . . . . . . 32
viContents
3.4. Additional filter steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1. Early stopping . . . . . . . . . . . . . . . . . . . . . . .

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents