Distributed and parallel algorithms and systems for inference of huge phylogenetic trees based on the maximum likelihood method [Elektronische Ressource] / Alexandros Stamatakis

Documents
146 pages
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Distributed and Parallel Algorithmsand Systems for Inference of HugePhylogenetic Trees based on theMaximum Likelihood MethodAlexandros StamatakisLehrstuhl für Rechnertechnik und RechnerorganisationDistributed and Parallel Algorithms and Systems forInference of Huge Phylogenetic Trees based on theMaximum Likelihood MethodAlexandros StamatakisVollständiger Abdruck der von der Fakultät für Informatik der TechnischenUniversität München zur Erlangung des akademischen Grades einesDoktors der Naturwissenschaften (Dr. rer. nat.)genehmigten Dissertation.Vorsitzender: Univ.-Prof. Dr. Hans Michael GerndtPrüfer der Dissertation:1. Univ.-Prof. Dr. Arndt Bode2. Univ.-Prof. Dr. Christoph Zenger3. Univ.-Prof. Dr. Thomas LudwigRuprecht-Karls-Universität HeidelbergDie Dissertation wurde am 23.06.2004 bei der Technischen UniversitätMünchen eingereicht und durch die Fakultät für Informatik am 20.10.2004angenommen.AbstractThe computation of large phylogenetic (evolutionary) trees from DNA sequencedata based on the maximum likelihood criterion is most probably NP-complete.Furthermore, the computation of the likelihood value for one single potential treetopology is computationally intensive.This thesis introduces a number of algorithmic and technical solutions whichfor the first time enable parallel inference of large phylogenetic trees comprisingup to 10.000 organisms with maximum likelihood.

Sujets

Informations

Publié par
Publié le 01 janvier 2004
Nombre de visites sur la page 39
Langue English
Signaler un problème

Distributed and Parallel Algorithms
and Systems for Inference of Huge
Phylogenetic Trees based on the
Maximum Likelihood Method
Alexandros StamatakisLehrstuhl für Rechnertechnik und Rechnerorganisation
Distributed and Parallel Algorithms and Systems for
Inference of Huge Phylogenetic Trees based on the
Maximum Likelihood Method
Alexandros Stamatakis
Vollständiger Abdruck der von der Fakultät für Informatik der Technischen
Universität München zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften (Dr. rer. nat.)
genehmigten Dissertation.
Vorsitzender: Univ.-Prof. Dr. Hans Michael Gerndt
Prüfer der Dissertation:
1. Univ.-Prof. Dr. Arndt Bode
2. Univ.-Prof. Dr. Christoph Zenger
3. Univ.-Prof. Dr. Thomas Ludwig
Ruprecht-Karls-Universität Heidelberg
Die Dissertation wurde am 23.06.2004 bei der Technischen Universität
München eingereicht und durch die Fakultät für Informatik am 20.10.2004
angenommen.Abstract
The computation of large phylogenetic (evolutionary) trees from DNA sequence
data based on the maximum likelihood criterion is most probably NP-complete.
Furthermore, the computation of the likelihood value for one single potential tree
topology is computationally intensive.
This thesis introduces a number of algorithmic and technical solutions which
for the first time enable parallel inference of large phylogenetic trees comprising
up to 10.000 organisms with maximum likelihood.
The algorithmic part includes a technique to accelerate the computation of
likelihood values, as well as novel search-space heuristics which significantly ac-
celerate the tree inference process and yield better final trees at the same time.
The technical part covers technical solutions for the acquisition of the enor-
mous amount of required computational resources such as parallel MPI-based and
distributed seti@home-like implementations of the basic sequential algorithm.
Finally, the program has been used to compute a biologically significant ini-
tial small "tree of life" containing 10.000 representative organisms from the three
domains: Bacteria, Eukarya, and Archaea based on data from the ARB database.Acknowledgements
Many people have contributed to this thesis.
First of all I would like to thank Prof. Bode for the excellent working atmo-
sphere at the Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) and
the trust and freedom he granted me for my research. Furthermore, I am grateful
to Prof. Zenger who agreed to evaluate this thesis as 2nd reviewer. I am particu-
larly thankful to Prof. Ludwig who has been accompanying and supporting me for
many years now during my studies and doctoral research. Dr. Harald Meier, the
leader of the high performance bioinformatics group at the LRR, deserves special
gratitude for providing me fundamental biological knowledge.
From my colleagues at the Technische Universität München (TUM) I would
like to thank Markus Lindermeier, Martin Mairandres, Jürgen Jeitner, Edmond
Kereku, and Pögl for their kind support on various issues and their good
company over the last years.
I would also like to mention several colleagues from outside the TUM who
have greatly helped me to accomplish this work: Ralf Ebner from the Leibniz
RechenZentrum (LRZ), Gerd Lanferman from the Max-Planck Institut (MPI)
Potsdam, and the HPC team from the Regionales Rechenzentrum Erlangen
(RRZE).
I am especially grateful to my student Michael Ott who contributed to this
thesis by implementing the distributed versions of RAxML.
Finally, I would like to thank my parents for their ever-lasting support of my
work and ideas.Contents
1 Introduction 1
1.1 Motivation . . . . . .... ... .... ... .... .... ... 1
1.2 Scientific Contribution . . . . . ... ... 5
1.3 Structure of the Thesis . . . . . .... ... .... .... ... 6
2 Phylogenetic Tree Inference 7
2.1 What is a Phylogenetic Tree? . . .... ... .... .... ... 7
2.2 Obtaining new Insights from Phylogenetic Trees . . . ... 8
2.3 Prerequisites for Phylogenetic Tree Inference .... .... ... 10
2.3.1 Computation of Multiple Alignments ... 10
2.3.2 Adequate DNA Portions .... ... .... .... ... 12
2.3.3 The ARB Database . . . ... ... 12
2.4 Problem Complexity . . . . . . .... ... .... .... ... 13
3 Phylogeny Models and Programs 15
3.1 Basic Model Classification . . . .... ... .... .... ... 15
3.2 Distance-based Methods . . . . ... ... 16
3.2.1 UPGMA . .... ... .... ... .... .... ... 16
3.2.2 Neighbor Joining . . . . ... ... 17
3.3 Parsimony Criterion.... ... .... ... .... .... ... 17
3.4 Maximum Likelihood Criterion . ... ... 20
3.4.1 Calculating the Likelihood of a Tree . .... .... ... 21
3.4.2 Optimizing the Branch Lengths of a Tree . . ... 25
3.4.3 Models of Base Substitution . . . . . .... .... ... 26
3.5 Bayesian Phylogenetic Inference .... ... ... 32
3.6 Measures of Confidence . . . . ... .... .... ... 34
3.7 Divide-and-Conquer Approaches .... ... ... 38
3.8 Testing & Comparing Phylogeny Programs . .... .... ... 39
3.9 State of the Art Programs . . . . .... ... ... 41
3.9.1 Algorithms for Tree Building & Sequential Codes . . . . 41
3.9.1.1 Progressive Algorithms . . .... .... ... 42
i3.9.1.2 Global Algorithms . . . . .... .... ... . 44
3.9.1.3 Quartet . . . ... . 47
3.9.2 Performance of Sequential Codes . .... .... ... . 47
3.9.3 Parallel & Distributed Codes . . . . ... . 49
3.9.3.1 parallel fastDNAml . . . .... .... ... . 51
4 Novel Algorithmic Solutions 53
4.1 Novel Algorithmic Optimization: AxML . . .... .... ... . 54
4.1.1 Additional Algorithmic Optimization . . . ... . 59
4.2 New Heuristics: RAxML . . . .... ... .... .... ... . 62
5 Novel Technical Solutions 67
5.1 Parallel and Distributed Solutions for AxML.... .... ... . 67
5.1.1 Parallel AxML . . . . .... ... ... . 67
5.1.2 Distributed Load-managed AxML . .... .... ... . 68
5.1.2.1 The Load Management System . ... . 68
5.1.2.2 Implementation . . . . . .... .... ... . 70
5.1.3 AxML on the Grid . . .... ... ... . 71
5.1.3.1 The Grid Migration Server . . . .... ... . 72
5.1.3.2 Implementation of GAxML . . . ... . 74
5.1.4 PAxML on Supercomputers . . . . .... .... ... . 77
5.2 Parallel and Distributed Solutions for RAxML . . . ... . 79
5.2.1 Parallel RAxML . . . .... ... .... .... ... . 80
5.2.2 Distributed RAxML . ... ... . 82
5.2.2.1 Technical issues . . . . . .... .... ... . 83
6 Evaluation of Technical and Algorithmic Solutions 87
6.1 Test Data . . . . .... ... .... ... .... .... ... . 87
6.2 Test & Production Platforms . ... ... . 89
6.2.1 Adequate Processor Architectures . .... .... ... . 89
6.2.2 Performance of PC Processors . . . ... . 90
6.3 Run Time Improvement by Algorithmic Optimizations . . . . . . 91
6.3.1 Sequential Performance . . . . . . .... .... ... . 91
6.3.2 Parallel . .... ... ... . 93
6.4 Run Time and Qualitative Improvement by Algorithmic Changes . 94
6.4.1 Experimental Setup . . .... ... .... .... ... . 94
6.4.2 Real Data Experiments ... ... . 96
6.4.3 Simulated Data . . . . .... .... ... . 99
6.4.4 Pitfalls & Performance of Bayesian Analysis . . . . . . . 99
6.5 Assessment of Technical Solutions . . . . . .... .... ... .103
6.5.1 Distributed Load-managed AxML . ... .103
ii6.5.2 Parallel RAxML . . . . .... ... .... .... ... 105
6.5.3 RAxML@home . . . . ... ... 108
6.6 Inference of a 10.000-Taxon Phylogeny with RAxML .... ... 109
7 Conclusion and Future Work 113
7.1 Conclusion . . . . .... ... .... ... .... .... ... 113
7.2 Future Work . . . . ... ... ... 114
7.2.1 Algorithmic Issues . . . .... ... .... .... ... 115
7.2.2 Technical Issues . . . . ... ... 116
7.2.3 Organizational Issues . . .... ... .... .... ... 117
Bibliography 119
iiiivList of Figures
1.1 Growth of sequence data in GenBank . . . . .... .... ... 2
1.2 Charles Darwin as seen by a contemporary cartoonist .... ... 3
1.3 Domains of life: Eukarya, Bacteria, and Archaea . . .... ... 4
2.1 Phylogenetic subtree representing the evolutionary relationship
between monkeys and the homo sapiens . . . .... .... ... 9
3.1 Parsimony score computation by example . . .... .... ... 18
3.2 Py score by example: one possible assignment 19
3.3 Parsimony score computation by e another possible as-
signment . . . . . . .... ... .... ... .... .... ... 20
3.4 Rooted example tree with root at node .. .... .... ... 22
3.5 Unrooted example tree with virtual root placement possibilities,
likelihood remains unaffected . . .... ... .... .... ... 25
3.6 Schematic representation of the GTR model parameters . . . . . . 27
3.7 Hierarchy of probabilistic models of nucleotide substitution . . . . 29
3.8 Abstract of a bayesian MC tree inference process
with two Metropolis-Coupled Markov Chains .... .... ... 34
3.9 Outline of the MCMC convergence problem . .... .... ... 35
3.10 Example of an unresolved (multifurcating) consensus tree . . . . . 36
3.11 for stepwise addition . .... ... .... .... ... 43
3.12 Example for with quickadd option . .... ... 44
3.13 Possible rearrangements of subtree ST6 . . . .... .... ... 45
3.14 A possible bisection and some possible reconnections of a tree . . 46
3.15 All nearest neighbor interchanges for one inner branch . . 47
3.16 Schematic difference in likelihood distribution over some model
parameter for a hypothetic final tree topology obtained by
bayesian and maximum likelihood methods . .... .... ... 49
4.1 Heterogeneous and homogeneous column equalities . .... ... 54
4.2 Global compression of equal column . . . . . .... .... ... 55
v

4.3 Example likelihood-, equality- and reference-vector computation
for a subtree rooted at p . . . . .... ... .... .... ... . 57
4.4 Rearrangements traversing one node for subtree ST5, branches
which are optimized are indicated by bold lines . . .... ... . 63
4.5 Example rearrangements traversing two nodes for subtree ST5,
branches which are optimized are indicated by bold lines . . . . . 64
4.6 Example for subsequent application of topological improvements
during one rearrangement step .... ... .... .... ... . 64
5.1 The components of the Load Management System LMC . . . . . 69
5.2 System architecture of DAxML.... ... .... .... ... . 70
5.3 Architecture of GAxML . . . . . . ... . 74
5.4 GAxML tree visualization with 29 taxa inserted . . .... ... . 78
5.5 tree with 127 taxa . ... . 79
5.6 Number of improved topologies per rearrangement step for a
SC_150 random and parsimony starting tree .... .... ... . 81
5.7 Parallel program flow of RAxML . . . . . . ... . 85
5.8 Program flow of distributed RAxML . . . . .... .... ... . 86
6.1 AxML and fastDNAml inference times over topology size for
quickadd enabled and disabled .... ... .... .... ... . 92
6.2 RAxML, PHYML, and MrBayes final likelihood values over tran-
sition/transversion ratios for 150_SC . . . .... .... ... . 96
6.3 RAxML likelihood improvement over time for 500_ZILLA . . . 98
6.4 Topological accuracy of PHYML, RAxML and MrBayes for 50
100-taxon trees . .... ... .... ... .... .... ... .100
6.5 Convergence behavior of MrBayes for 101_SC with user and ran-
dom starting trees over 3.000.000 generations . . . .... ... .101
6.6 150_SC likelihood improvement over time of RAxML and Mr-
Bayes for the same random starting tree . . .... .... ... .102
6.7 150_ARB likelihood improvement over time of RAxML and Mr-
Bayes for the same random starting tree . . .... .... ... .102
6.8 Convergence behavior of MrBayes for 500_ARB with user and
random starting trees . . . . . .... ... .... .... ... .103
6.9 Average evaluation time improvement per topology class: op-
timized (SEV-based) DAxML evaluation function vs. standard
fastDNAml evaluation function.... ... .... .... ... .104
6.10 JNI and CORBA-communication overhead ... .105
6.11 Worker object migration after creation of background load on its
host .... ... .... ... .... ... .... .... ... .106
6.12 Impact of 3 subsequent automatic worker object replications . . . 106
vi