Assessing the stability of a phylogeny without bootstrap: an analytical approach

pefav

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

21 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

aligned sequence

aligned over

species sites

methods

distance between

change occurs only

species called

maximum likelihood

sequence after

species

Sujets

Assessing the stability of a phylogeny without
bootstrap: an analytical approach
Mahendra Mariadassou
June 19, 2007
Contents
1 Introduction 2
2 Framework 3
2.1 Notations and deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Likelihood and stability . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Phylogenetic reconstruction for ﬁnite size samples 7
T3.1 Connection between ‘ and Q . . . . . . . . . . . . . . . . . . . . . 8
3.2 Distance between Q and Q . . . . . . . . . . . . . . . . . . . . . . 8n
3.3 Stability of the inferred tree . . . . . . . . . . . . . . . . . . . . . . 11
4 Comparison with widely used methods 13
4.1 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Comparison on the toy example . . . . . . . . . . . . . . . . . . . . 14
A Proofs of the preliminary results 17
A.1 Proof of the ﬁrst result . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.2 Proof of the second result . . . . . . . . . . . . . . . . . . . . . . . 19
References 20
11 INTRODUCTION 2
1 Introduction
Phylogenies, or evolutionary trees, are the basic structures necessary to analyze
diﬀerences between species and to analyze these diﬀerences statistically. Several
methods are available to infer phylogenies, the two most popular being Maxi-
mum Parsimony (MP) and Maximum Likelihood (ML) estimation (see [11] for a
comprehensive review). The ML method [6] provides a statistical framework to
answer this question, whereas the MP method does not. We therefore focus on
ML methods.
We are interested in the stability of the inferred phylogeny. Several booststrap
methods have been developed to speciﬁcally address this issue (see [3], [4], [2],
[14] and [7] for a review). Stability is a desirable, if not necessary, property for
a phylogeny: after inferring a tree, we want to draw some conclusions from it.
Since we draw general conclusions, we want the tree to be as stable as possible:
a small modiﬁcation in the data should not drastically change the phylogeny and
invalidate our conclusions, or at least if it does it should only do so with a small
probability. An inferred phylogeny not holding this property has a very limited
utility: no conclusions drawn from it would be useful.
A most common stability problem, from which bootstrap in phylogeny origi-
nates (see [3], [13]), is the support given to a clade. For example if a phylogeny
classes species A in a diﬀerent clade than species B and C, we want to assess the
signiﬁcance of this classiﬁcation: is the clade supported by a lot of evidence or is
it just here by chance ?
Most bootstrap methods are based on re-sampling with replacement [1] and
account only for the observed variability. Bootstrap methods also discard the
relation between the size of the data, the number of species in the study and the
stability of the phylogeny.
In this paper we propose an analytical approach to this issue. Rather than
working on the phylogeny, we work on their likelihood score: stable likelihood
scores are equivalent to a stable ranking and thus to a stable ML phylogeny. We
obtain bounds on the probability than the empirical likelihood wanders too far
away from its average value. One obvious advantage of doing so is to reduce the
studyofphylogeniesandphylogenetictreestothemuchsimplerstudyoflikelihood
scores taking values inR.
Section 2 is devoted to the framework we use. We also introduce the nota-
tions and illustrate them on a toy example. Then, in Section 3, we present our
main result concerning the stability of the likelihood score. Finally in Section 4,
we compare our method to those used in the literature and discuss our results.
Technical proofs of some results are postponed in appendix A.2 FRAMEWORK 3
2 Framework
We introduce here the statistical framework in which we work and the notations
we use and illustrate them on a toy example.
2.1 Notations and deﬁnitions
The data set in our analysis is as×n matrixX = (X ,...,X ) = (X )1 n ij i=1...n,j=1...s
representing a set of molecular sequences aligned over diﬀerent species. n is the
length of the sequence after the alignment (including gaps) and s the number
of species. X takes value in an alphabet A and codes for the state of the ithij
nucleotideofthealignmentinthejthspecies. ThejthlineofX isthenthealigned
sequence corresponding to species j. When working with DNA sequences, A is
usually{A,C,G,T} but it can take others values, for example when working with
proteinsequences. ThestatisticalunitofinterestisthecolumnX ,as-dimensionali
svector valued inA , which codes for the pattern of nucleotide i over all s species.
We assume that the pattern X at ith site is an observed value of randomi
variable X and that the X s are i.i.d. The probability fonction of X is Q.i
Deﬁnition 1. We deﬁne a phylogenetic model, or phylogeny, T as the union of:
(i) the evolution model: the substitution model and its associated parameters,
(ii) the tree topology: a branching pattern and the associated branch lenghts
Under model T, the log-likelihood of an observed pattern x (or equivalently of
a site presenting this pattern) is:
T‘ (x) := logP(x;T)
T TDeﬁnition 2. The empirical (resp. true) mean log-likelihood ‘ (resp. ‘ ) is then
mean of logP(X;T) under the empirical (resp. true) distribution:
X1T‘ = E [logP(X;T)] = logP(X ;T) (1)Q inn n
i
X
T‘ = E [logP(X;T)] = Q(x)logP(x;T) (2)Q
sx∈A
with the empirical distribution of the patterns deﬁned as:
nX1
Q = δn Xin
i=12 FRAMEWORK 4
The empirical distribution is opposed to the unknown true distribution Q of
the patterns, which is unachievable from the data except for a inﬁnite number of
nucleotides, when n =∞.
2We deﬁne in the same way, the empirical (resp. true) variance σ (T) (resp.n
2σ (T)) by:
X12 T 2σ (T) = V [logP(X;T)] = (logP(X ;T)−‘ )Q in n nn
i
X
2 T 2σ (T) = V [logP(X;T)] = Q(x)(logP(x;T)−‘ )Q
sx∈A
We need the expressions of both the empirical and the true log-likelihood and
variance. The goal, as presented in details in 3.3 is to compare not only the
empirical mean log-likelihood to its true average but also the rankings induced on
phylogenetic models by the both the empirical and the true mean log-likelihood.
2.2 Toy Example
To illustrate these quantities, consider a toy example made of s = 4 species called
S ,S ,S and S and n = 3 nucleotides. Although n = 3 is too low to be realistic,1 2 3 4
this example usefully serves our purpose. The data are a 4×3 matrixX:
Species Sites
S A A A1
X = S G G C2
S C C A3
S C C C4
Note that the ﬁrst two patterns are identical and require at least a transition
and a transversion whereas the third one minimal requirement is only a transver-
sion.
We consider only three phylogenies sharing a Kimura two-parameter (K2P)
evolution model (see [9]) with transversion rate α and transition rate β and the
following unrooted unlabeled topology pattern:
t t1 4
t3
tt2 5
Figure 1: Topology with unequal branch lengths : t > t , t > t and t >1 2 4 5 3
t ,t ,t ,t1 2 4 52 FRAMEWORK 5
We chose a K2P evolution model because it is the simplest realistic model:
although the Jukes-Cantor model (see [8])is even simpler, it is too unrealist to
be interesting. The three diﬀerent phylogenies are obtained when considering the
three possible topologies, each topology corresponding to a diﬀerent labeling, as
shown in Fig. 2.
S S S S S S1 3 1 2 1 3
S S S S S S2 4 3 4 4 2
T T T1 2 3
Figure 2: The three topologies obtained by labeling the nodes
We assume that α,β 1 so that the αt s and βt s are also very small andi i
that in each model only one conﬁguration of the inner nodes contributes to the
log-likelihood. In this special case, parsimony and maximum likelihood, the two
main methods, agree: since for a ﬁxed topology an evolutionary change occurs
only with a very low probability, the most likely conﬁguration is the one needing
the less such changes, namely the most parsimonious. All others are far less likely,
and as such, barely contribute to the log-likelihood of the topology.
Since the X are i.i.d, we have:i
T13‘ = 2logP(AGCC;T )+logP(ACAC;T )1 1n
T23‘ = 2logP(AGCC;T )+logP(ACAC;T )2 2n
T33‘ = 2logP(AGCC;T )+logP(ACAC;T )3 3n
To compute these probabilities, we need to specify an ranking on the branch
lengths: we adopted t > t , t > t and t > t ,t ,t ,t . The central branch is the1 2 4 5 3 1 2 4 5
longest and both upward branches are longer than their downward counterpart.
Some simple computations taking advantage of the previous remark then give:
2.3 Likelihood and stability
The toy example is simple so we can easily calculate probabilities in it. We will
start by calculating both the likelihood score and the variance associated to each
model, as deﬁned in Sec. 2.1. The results are shown in Tab. 2
At this point, we ask ourselves which model has the best likelihood score and
which