Fast, sensitive protein sequence searches using iterative pairwise comparison of hidden Markov models [Elektronische Ressource] / Michael Remmert. Betreuer: Patrick Cramer

ludwig-maximilians-universitat_munchen - Michael Remmert

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

175 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Informations

Publié par	ludwig-maximilians-universitat_munchen
Publié le	01 janvier 2011
Nombre de lectures	37
Langue	English
Poids de l'ouvrage	16 Mo

Extrait

Dissertation zur Erlangung des Doktorgrades
der Fakultät für Chemie und Pharmazie
der Ludwig–Maximilians–Universität München
Fast, sensitive protein sequence
searches using iterative pairwise
comparison of hidden Markov models
Michael Remmert
aus Köln
2011Erklärung:
Diese Dissertation wurde im Sinne von §13 Abs. 3 bzw. 4 der Promotionsordnung
vom 29. Januar 1998 (in der Fassung der sechsten Änderungssatzung vom
16. August 2010) von Herrn Professor Dr. Patrick Cramer betreut.
Ehrenwörtliche Versicherung:
Diese Dissertation wurde selbstständig, ohne unerlaubte Hilfe erarbeitet.
München, am 16. September 2011
Michael Remmert
Dissertation eingereicht am: 16. September 2011
1. Gutachter: Prof. Dr. Patrick Cramer
2. Gutachter: Prof. Dr. Dmitrij Frishman
Mündliche Prüfung am: 23. November 2011Acknowledgements
I owe my gratitude to all those people who have made this dissertation possible by support-
ing and encouraging me throughout the last years, and I can only try to acknowledge them
here.
First and foremost, my deepest gratitude is to my advisor Dr. Johannes Söding for given
me the opportunity to work in his group and to contribute to many fascinating projects.
Furthermore, I would like to thank Johannes for all the fruitful discussions, the constant
support, and for making the last years to such a great experience.
I would like to thank Prof. Dr. Patrick Cramer for being my doctoral supervisor, and
Prof. Dr. Dmitrij Frishman for being my second PhD examiner. I am also very grateful
to Prof. Dr. Klaus Förstemann, Prof. Dr. Roland Beckmann, Prof. Dr. Ulrike Gaul and
Prof. Dr. Karl-Peter Hopfner for oﬀering their time as members of my dissertation commit-
tee.
Furthermore, IdeeplyappreciatethecriticalreadingofthisthesisbyDr.JohannesSöding
and Theresa Niederberger.
I also would like to thank all members of the Söding and Tresch group for the great
atmosphere in our oﬃce, for all their help and discussions, and for all the fun at the social
activities. Thanks to Andreas, Maria and Andy for the help in integrating their tools in
this project.
Last but not least, I am deeply grateful to my parents, my sister Ulrike with Jürgen, my
nieces Christiane and Caroline and all my good friends for all the support, patience, trust
and encouragement.
iiiSummary
Most sequence-based methods for protein structure or function prediction construct a mul-
tiple sequence alignment (MSA) of homologs as a ﬁrst step. The standard search tool
to generate multiple sequence alignments is PSI-BLAST (> 25 000 citations), an extension
of BLAST to proﬁle- comparison. owes its sensitivity to its use of se-
quence proﬁles and to its iterative search scheme. Signiﬁcant sequence hits are added to
the evolving multiple alignment from which a sequence proﬁle is generated for the next
search iteration. HMMER3 is similar to PSI-BLAST, but uses a proﬁle hidden Markov model
(HMM) instead of a sequence proﬁle to represent the evolving query multiple alignment. The
gain in sensitivity over PSI-BLAST is paid by a factor 3 to 4 reduction in speed.
In my thesis work, I have developed HHblits, the ﬁrst method for iterative sequence
searchingbasedonthecomparisonofproﬁle HMMs. Itisfasterthan PSI-BLASTand HMMER3,
more sensitive, and constructs multiple alignments of signiﬁcantly better quality. The
method builds on the HHsearch algorithm for pairwise comparison of proﬁle HMMs, to
which it owes its high sensitivity and alignment quality. In parallel to HHblits, our group
developed a fast clustering method that can generate a covering set of HMMs for the entire
UniProt database in a few weeks time.
To speed up the search by a factor∼ 2000, I developed a preﬁlter that is based on
a novel algorithm for proﬁle-proﬁle comparison designed for maximum speed. The algo-
rithm eﬀectively reduces alignment to proﬁle-sequence alignment by coding
the database proﬁles by “column state sequences“ in which proﬁle columns are represented
by an alphabet of 219 discretized column states. This permits a fast implementation of the
proﬁle-proﬁle comparison that employs the SIMD (single instruction multiple data) instruc-
tion sets available on modern CPUs. In this way, HHblits performs 16 byte operations in
parallel in a single clock cycle. Several further ﬁltering steps reduce the amount of slow,
full-blown HMM-HMM comparisons to a fraction of < 1/1000. Our tests show that the loss
of sensitivity due to the preﬁlter is negligible.
On a standard SCOP20 ROC benchmark (SCOP1.73 proteins ﬁltered to 20% maximum
sequence identity), HHblits detects twice as many true positives as PSI-BLAST and 54%
more than HMMER3 at 1% error rate in the ﬁrst iteration. Two search iterations HHblits
detect signiﬁcantly more true positives than ﬁve PSI-BLAST iterations. Alignment quality is
likewise improved signiﬁcantly. Furthermore, we are able to make conﬁdent fold predictions
−3(E-value < 10 ) and build structural models for 394 Pfam domains for which no fold
prediction has been possible.
ivHHblits is a robust, general-purpose protein sequence search tool that has the potential to
replace PSI-BLAST as state-of-the-art method for the generation of MSAs. Due to the high
alignment quality, HHblits alignments are able to improve secondary structure prediction
methods such as PSIPRED. Furthermore, these better alignments facilitates the function and
structure prediction for proteins for which nothing is yet known.
In the second part of this thesis we study the evolution of outer membrane β-barrels
(OMBBs). These proteins are the major class of outer membrane proteins (OMPs) from
Gram-negative bacteria, mitochondria and plastids. Their transmembrane domains consist
of 8 to 24β-strands forming a closed, barrel-shapedβ-sheet around a central pore. Despite
their obvious structural regularity, evidence for an origin by duplication or for a common
ancestry had not been found.
We use three complementary approaches to show that all OMBBs from Gram-negative
bacteria evolved from a single, ancestral ββ hairpin. First, we link almost all families
of known single-chain bacterial OMBBs with each other through transitive proﬁle searches.
Second, we identify a clear repeat signature in the sequences of many OMBBs in which the
repeatingsequenceunitcoincideswiththestructuralββ hairpinrepeat. Third,weshowthat
the observed sequence similarity between OMBB hairpins cannot be explained by structural
or membrane constraints on their sequences. The third approach addresses a longstanding
problem in protein evolution: how to distinguish between a very remotely homologous
relationship and the opposing scenario of ”sequence convergence“. The origin of a diverse
group of proteins from a single hairpin module supports the hypothesis that, around the
time of transition from the RNA to the protein world, proteins arose by ampliﬁcation and
recombination of short peptide modules that had previously evolved as cofactors of RNAs.
This research provides the basis for the identiﬁcation and classiﬁcation of outer membrane
β-barrels and explains the evolutionary origin of this important bacterial protein class.
vContents
Acknowledgements iii
Summary iv
1. Motivation and overview 1
I. HHblits - Iterative HMM-based homology searches 3
2. Introduction to remote homology searches 4
2.1. Scoring models and gap penalties . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Pairwise sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. Proﬁle alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4. HMM-HMM alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1. Log-sum-of-odds score . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2. Pairwise alignment of HMMs . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3. HHsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5. Iterative sequence search methods . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1. PSI-BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2. HMMER3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Material and methods 15
3.1. Workﬂow of HHblits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2. Generation of HHblits databases with kClust . . . . . . . . . . . . . . . . . . 18
3.3. Fast preﬁltering of HHblits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1. Column state alphabet for sequence proﬁle encoding . . . . . . . . . 20
3.3.2. Generation of the column state alphabet . . . . . . . . . . . . . . . . 22
3.3.3. Translation of sequence proﬁles in column state sequences . . . . . . 23
3.3.4. SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.5. Fast preﬁlter algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.6. Gapless local alignment with SSE2 . . . . . . . . . . . . . . . . . . . 27
3.3.7. Smith-Waterman with SSE2 . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.8. with backtrace . . . . . . . . . . . . . . . . . . . . 32
viContents
3.4. Additional ﬁlter steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1. Early stopping . . . . . . . . . . . . . . . . . . . . . . .