Unigenic evolution is a large-scale mutagenesis experiment used to identify residues that are potentially important for protein function. Both currently-used methods for the analysis of unigenic evolution data analyze 'windows' of contiguous sites, a strategy that increases statistical power but incorrectly assumes that functionally-critical sites are contiguous. In addition, both methods require the questionable assumption of asymptotically-large sample size due to the presumption of approximate normality. Results We develop a novel approach, termed the Evidence of Selection (EoS), removing the assumption that functionally important sites are adjacent in sequence and and explicitly modelling the effects of limited sample-size. Precise statistical derivations show that the EoS score can be easily interpreted as an expected log-odds-ratio between two competing hypotheses, namely, the hypothetical presence or absence of functional selection for a given site. Using the EoS score, we then develop selection criteria by which functionally-important yet non-adjacent sites can be identified. An approximate power analysis is also developed to estimate the reliability of inference given the data. We validate and demonstrate the the practical utility of our method by analysis of the homing endonuclease I-Bmol , comparing our predictions with the results of existing methods. Conclusions Our method is able to assess both the evidence of selection at individual amino acid sites and estimate the reliability of those inferences. Experimental validation with I-Bmol proves its utility to identify functionally-important residues of poorly characterized proteins, demonstrating increased sensitivity over previous methods without loss of specificity. With the ability to guide the selection of precise experimental mutagenesis conditions, our method helps make unigenic analysis a more broadly applicable technique with which to probe protein function. Availability Software to compute, plot, and summarize EoS data is available as an open-source package called 'unigenic' for the 'R' programming language at http://www.fernandes.org/txp/article/13/an-analytical-framework-for-unigenic-evolution .
Fernandeset al.Algorithms for Molecular Biology2010,5:35 http://www.almob.org/content/5/1/35
R E S E A R C HOpen Access Estimating the evidence of selection and the reliability of inference in unigenic evolution 1,2* 11 21 Andrew D Fernandes, Benjamin P Kleinstiver , David R Edgell , Lindi M Wahl , Gregory B Gloor
Abstract Background:Unigenic evolution is a largescale mutagenesis experiment used to identify residues that are potentially important for protein function. Both currentlyused methods for the analysis of unigenic evolution data analyze‘windows’of contiguous sites, a strategy that increases statistical power but incorrectly assumes that functionallycritical sites are contiguous. In addition, both methods require the questionable assumption of asymptoticallylarge sample size due to the presumption of approximate normality. Results:We develop a novel approach, termed the Evidence of Selection (EoS), removing the assumption that functionally important sites are adjacent in sequence and and explicitly modelling the effects of limited sample size. Precise statistical derivations show that the EoS score can be easily interpreted as an expected logoddsratio between two competing hypotheses, namely, the hypothetical presence or absence of functional selection for a given site. Using the EoS score, we then develop selection criteria by which functionallyimportant yet non adjacent sites can be identified. An approximate power analysis is also developed to estimate the reliability of inference given the data. We validate and demonstrate the the practical utility of our method by analysis of the homing endonucleaseIBmol, comparing our predictions with the results of existing methods. Conclusions:Our method is able to assess both the evidence of selection at individual amino acid sites and estimate the reliability of those inferences. Experimental validation withIBmolproves its utility to identify functionallyimportant residues of poorly characterized proteins, demonstrating increased sensitivity over previous methods without loss of specificity. With the ability to guide the selection of precise experimental mutagenesis conditions, our method helps make unigenic analysis a more broadly applicable technique with which to probe protein function. Availability:Software to compute, plot, and summarize EoS data is available as an opensource package called ‘unigenic’for the‘R’programming language at http://www.fernandes.org/txp/article/13/ananalyticalframeworkfor unigenicevolution.
Background One of the principal reasons for studying molecular evo lution is that the function of a novel protein can be deduced, in part, by comparing it with a similar pre viouslycharacterized protein. But what recourse do we have if the novel protein does not exhibit significant sequence similarity to other proteins? More problemati cally, what if it is similar only to proteins of unknown function? In practice, even when the novel protein shares regions of extensive similarity to proteins of
* Correspondence: andrew@fernandes.org 1 Department of Biochemistry, The University of Western Ontario, N6A 5C1 Canada Full list of author information is available at the end of the article
knownfunction, it may be difficult to elucidate the importance of individual sites in the novel protein.
Unigenic Evolution One innovative experimental approach that can help identify specific domains or residues required for func tion isunigenic evolution, first described and developed by Deminoffet al. [1]. Unigenic evolution can be applied to any protein where the loss of function can be used as a selectable phenotype [15]. The procedure consists of random mutagenesis and amplification of a single wildtype sequence via muta genic polymerase chain reaction (PCR) with subsequent cloning and functional selection [6]. Functional clones
Fernandeset al.Algorithms for Molecular Biology2010,5:35 http://www.almob.org/content/5/1/35
are isolated and characterized by DNA sequencing. In contrast to traditional structurebased mutagenesis screening, unigenic evolution experiments produce an unbiasedestimate of functionallyimportant residues regardless of putative structural role or conservation.
Deminoff’s Analysis The selection process ensures that, in functional clones, amino acids essential for function will be conserved rela tive to nonessential sites. However, differential mutation sensitivity can be caused by more than structural or func tional constraints. Mutation rates of residues may differ due to differential transition/transversion rates, codon usage, and genetic code degeneracy. To correct for these confounding factors, Deminoffet al. developed a statisti cal analysis that compared the expected versus the observed mutation frequency for each codon, where the expected frequencies were derived from a population of clones that had not been subject to selection. Deminoffet al. clearly demonstrated the importance of accounting for nonuniform transition versus transver sion probabilities when computing expected mutational frequencies. To increase the inferential power of their 2 analyses, they also developed a‘slidingwindow’canaly sis, binning together a‘window’of adjacent codons, assuming that residues critical for protein function would be nearby in primary structure. By comparing the prob abilities ofsilentversusmissensemutation in these win dows, regions of either restrained or excessive mutability were identified as hypo or hypermutable, respectively.
Behrsin’s Analysis The subsequent analysis of Behrsinet al. [7] advanced the statistical framework of Deminoffet al. by improv ing three key features. These features were (a) the fixed 2 window size of thecanalysis, and (b) the effect of sam plesize on the codon mutation probability, and (c) accounting for multiple nucleotide mutations per codon. 2 First, window size for thecanalysis was addressed by using windows of different sizes and comparing esti mated falsediscovery rates. The‘best’window was selected via tradeoff between the estimated sensitivity and specificity for classifying hypo or hypermutable residues. Second, nucleotide substitution frequencies were computed using the continuity correction of Yates [8] resulting in more consistent codon mutation fre quencies. Third, codon mutation frequencies were com puted analytically from nucleotide substitution frequencies without the assumption that only one sub stitution per codon was likely.
Further Improvements The statistical framework of Deminoffet al. and the modifications suggested by Behrsinet al. allow for the
Page 2 of 24
reliable identification of hypomutable regions via unigenic evolution. Nonetheless, these stateoftheart analyses suffer from some deficiencies, from a statistical perspective, that could result in either erroneous or mis leading conclusions. The goal of this work is to develop a statistically rigorous method for the analysis of uni genic evolution data, improving upon existing techni ques by
1. relaxing the assumption that sample sizes are large enough such that asymptotic normality neces sarily applies, 2. relaxing the assumption that selectionsensitive regions of a protein are contiguous, 3. clarifying the relationship between Fisherstylep values and NeymanPearson TypeI and TypeII error probabilities with regard to testing hypotheses of functional selection, 4. relaxing the the assumption that the PCR amplifi cation protocol does not meaningfully affect muta tion probabilities, and 5. addressing the ability to of unigenic evolution to detect hypermutability.
We expand upon each of these points, in turn, below. First, both Deminoffet al. and Behrsinet al. equate observed eventrelative countswith the respective event probabilities. This equivalence is effectively true when either sample sizes are asymptotically large or probabil ities are nonextreme (not too close to either zero or one). However, experimentallyfeasible sample sizes are typically limited to the order of 50100 replicates (clones) and even the most mutagenic of PCR condi tions result in low probabilities (≈0.001 to 0.01) of point mutation. Therefore it is unlikely that observed counts have a simple relationship with the event fre quency, even accounting for the continuity correction of Yates [8]. The difficulty of estimating probability para meters from eventcounts when the likelihood of the event is very small is a wellknown problem from the inference of binomial and multinomial frequency para meters [9]. The most obvious consequence of assuming “counts≈probabilities”under these constraints is that 2 the normal approximation, on which thecstatistic is critically dependent, may be invalid enough to yield mis leading results. At the very least, the sampling variance 2 of thecstatistic itself is necessarily quite large. The anticipated parameter ranges above, for example, yield a 2 coefficient of variation forcto be on the order of 100 300%. An additional problem with equating counts and probabilities is that, in doing so, the analysis of Behrsin et al. implicitly conditions on the total number of muta tions as given. We anticipate that the actual number of mutations would be roughly Poisson distributed,