Technische Universit¨at Mu¨nchen
Zentrum Mathematik
Evolving trees:
Models for Speciation and Extinction
in Phylogenetics
Tanja Stadler
Vollst¨andiger Abdruck der von der Fakult¨at fu¨r Mathematik der Technischen
Universit¨at Mu¨nchen zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften (Dr.rer.nat.)
genehmigten Dissertation.
Vorsitzender: Univ.-Prof. Dr. Rupert Lasser
Pru¨fer der Dissertation: 1. Univ.-Prof. Dr. Anuschirawan Taraz
2. Prof. Dr. Jotun Hein,
University of Oxford / United Kingdom
3. Prof. Dr. Mike Steel,
University of Canterbury / New Zealand
Die Dissertation wurde am 10.09.2008 bei der Technischen Universit¨at Mu¨nchen
eingereicht und durch die Fakult¨at fu¨r Mathematik am 10.12.2008 angenommen.Nothing in Biology Makes Sense Except in the Light of Evolution
Theodosius DobzhanskyAbstract
Aphylogeny represents the evolutionary relationship between species. In this thesis,
we answer questions arising in the process of reconstructing and analyzing phylo-
genies. We develop and discuss a general class of neutral models for speciation and
extinction.Theseresultsareusedinournovelalgorithmfordatingsupertreesaswell
as for drawing lineages-through-time plots. Further, using the analytic results, we
can calculate p-values of our introduced statistic to test the evolutionary hypothesis
of lineage-specific bursting. We provide accurate simulation tools for models which
cannot or have not been analyzed. Widely used simulation algorithms are shown
to be wrong for these general models. We end the thesis by providing a complex-
ity result for dating phylogenies with reticulation events. In general, a phylogeny
with reticulations does not necessarily have a valid temporal dating. We prove that
adding missing species in an optimal way, such that the altered phylogeny has a
dating, is NP-complete. The main mathematical tools used in this thesis come from
stochastics, statistics, combinatorics and complexity theory.
iZusammenfassung
Eine Phylogenie repr¨asentiert die Verwandtschaftsbeziehungen zwischen Spezies. In
der vorliegenden Arbeit beantworten wir Fragestellungen, die bei der Rekonstruk-
tion und Analyse von Phylogenien auftreten. Wir entwickeln eine allgemeine Klasse
von neutralen Modellen fu¨r Speziation und Aussterben. Die Ergebnisse werden in
unserem neuen Algorithmus zur Datierung von Supertrees sowie zur Erstellung
von so genannten lineages-through-time Plots verwendet. Des Weiteren benutzen
wir unsere theoretischen Ergebnisse, um p-Werte fu¨r die entwickelte Statistik zum
Testen von Phylogenien auf Linien-spezifische Bursts zu berechnen. Fu¨r Modelle,
die nicht analysiert werden k¨onnen oder nicht analysiert wurden, stellen wir Sim-
ulationsalgorithmen zur Verfu¨gung. Wir zeigen, dass weit verbreitete Simulation-
sprogramme selbst unter g¨angigen Modellen fehlerhaft arbeiten. Zum Abschluss der
Arbeit besch¨aftigen wir uns mit der Datierung von Phylogenien, in denen Retiku-
lationsereignisse auftreten. Eine Phylogenie mit Retikulationen besitzt nicht im-
mer eine zul¨assige zeitliche Datierung. Wir beweisen, dass das optimale Hinzufu¨gen
von fehlenden Spezies, so dass die neue Phylogenie eine Datierung besitzt, NP-
vollst¨andigist.DieindieserArbeitverwendetenmathematischenMethodenkommen
haupts¨achlichausderStochastik,Statistik,KombinatorikundKomplexit¨atstheorie.
iiAcknowledgements
Completing my PhD thesis has been a wonderful experience. My advisor Anusch
Tarazhasbeenverysupportivethroughoutthewholetime.Hehelpedmetoestablish
my direction in research and always gave me very valuable advice.
The visits to New Zealand would not have been possible without Mike Steel
providing fantastic support. Talking to Mike was very inspiring and led to creative
new ideas. Mike always encouraged me to continue following my interests and goals.
Erick Matsen was a great office mate in New Zealand and Berkeley. Working
with him on the runs statistic was a lot of fun. And Erick, thanks for fixing my
laptop! Klaas Hartmann is a great collaborator asking the right questions regarding
useful applications. On top, he is the best mountaineer guide.
CollaborationsanddiscussionswithVincentBerry,JamesDegnan,AlexeiDrum-
mond,DanielFord,SimoneLinz,DirkMetzler,NoahRosenberg,RolandSchultheiß,
Charles Semple, Tom Wilke and Dennis Wong led to new exciting projects and
helped me to see my research in a broader scientific perspective.
Further, I would like to thank the people of the group M9 in the Mathematics
departmentattheTechnische Universit¨atMu¨nchen,thepeopleoftheBiomathemat-
ics research center at the University of Canterbury in New Zealand and the people
of the Evolutionary Biology department at the Ludwig-Maximilian-Universit¨at in
Mu¨nchen for providing a great atmosphere to work in.
FinancialsupportbytheDeutscheForschungsgemeinschaftthroughthegraduate
program “Angewandte Algorithmische Mathematik” at the Technische Universit¨at
Mu¨nchen and by the Allan Wilson Center through several summer studentships is
gratefully acknowledged.
Last but not least I want to thank my husband and my family for the fantastic
support during all the ups and downs while working on the thesis.
ivList of Symbols
Symbol Meaning page
i,jA time between i-th and j-th speciation event in tree with n leaves 23n
kA date of k-th speciation event in tree with n leaves 23n
eA length of edge e in treeT 23T
uA date of vertex u in tree T 23T
λ birth rate 6
μ death rate 6
N reticulation network 108
p (i) probability of vertex u having rank i inT 19u
p (i,j) probability of vertex u having rank i and v having rank j inT 20u,v
r rank function of an oriented treeT 7
r(T) set of rank functions onT 7
R(T) number of runs in treeT 77
t time of origin of a reconstructed oriented tree 5or
τ tree shape 8
T,T (reconstructed / ranked) oriented tree on n leaves 6n
T subtree ofT induced by the descendants of vertex v 10v
˚V set of interior vertices of an (oriented) tree 7
BDA constant rate birth-death approach 99
BDP constant rate birth-death process 2, 32
cBDP BDP conditioned on obtaining n leaves today 32
CBP constant rate critical branching process 6, 30
cCBP CBP conditioned on obtaining n leaves today 30
CRP constant relative probability (model) 15
DNA desoxyribonucleic acid 1
ET evolving tree (model) 5
GSA general sampling approach 96
HCV Hepatitis C virus 75
HGT horizontal gene transfer 106
LSB lineage specific bursting 73
LTT lineages-through-time (plot) 2
vvi
MCMC Markov chain Monte Carlo 1
mrca most recent common ancestor 7
PBMSA pure-birth memoryless sampling approach 93
PDA proportional-to-distinguishable-arrangements 6
SSA simple sampling approach 91
UR uniform rank (model) 2, 17