Estimating the evidence of selection and the reliability of inference in unigenic evolution

biomed - Fernandes , Kleinstiver , Edgell , Wahl , Gloor

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

24 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Unigenic evolution is a large-scale mutagenesis experiment used to identify residues that are potentially important for protein function. Both currently-used methods for the analysis of unigenic evolution data analyze 'windows' of contiguous sites, a strategy that increases statistical power but incorrectly assumes that functionally-critical sites are contiguous. In addition, both methods require the questionable assumption of asymptotically-large sample size due to the presumption of approximate normality. Results We develop a novel approach, termed the Evidence of Selection (EoS), removing the assumption that functionally important sites are adjacent in sequence and and explicitly modelling the effects of limited sample-size. Precise statistical derivations show that the EoS score can be easily interpreted as an expected log-odds-ratio between two competing hypotheses, namely, the hypothetical presence or absence of functional selection for a given site. Using the EoS score, we then develop selection criteria by which functionally-important yet non-adjacent sites can be identified. An approximate power analysis is also developed to estimate the reliability of inference given the data. We validate and demonstrate the the practical utility of our method by analysis of the homing endonuclease I-Bmol , comparing our predictions with the results of existing methods. Conclusions Our method is able to assess both the evidence of selection at individual amino acid sites and estimate the reliability of those inferences. Experimental validation with I-Bmol proves its utility to identify functionally-important residues of poorly characterized proteins, demonstrating increased sensitivity over previous methods without loss of specificity. With the ability to guide the selection of precise experimental mutagenesis conditions, our method helps make unigenic analysis a more broadly applicable technique with which to probe protein function. Availability Software to compute, plot, and summarize EoS data is available as an open-source package called 'unigenic' for the 'R' programming language at http://www.fernandes.org/txp/article/13/an-analytical-framework-for-unigenic-evolution .

Informations

Publié par	biomed
Publié le	01 janvier 2010
Nombre de lectures	4
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Fernandeset al.Algorithms for Molecular Biology2010,5:35 http://www.almob.org/content/5/1/35

R E S E A R C HOpen Access Estimating the evidence of selection and the reliability of inference in unigenic evolution 1,2* 11 21 Andrew D Fernandes, Benjamin P Kleinstiver , David R Edgell , Lindi M Wahl , Gregory B Gloor

Abstract Background:Unigenic evolution is a largescale mutagenesis experiment used to identify residues that are potentially important for protein function. Both currentlyused methods for the analysis of unigenic evolution data analyze‘windows’of contiguous sites, a strategy that increases statistical power but incorrectly assumes that functionallycritical sites are contiguous. In addition, both methods require the questionable assumption of asymptoticallylarge sample size due to the presumption of approximate normality. Results:We develop a novel approach, termed the Evidence of Selection (EoS), removing the assumption that functionally important sites are adjacent in sequence and and explicitly modelling the effects of limited sample size. Precise statistical derivations show that the EoS score can be easily interpreted as an expected logoddsratio between two competing hypotheses, namely, the hypothetical presence or absence of functional selection for a given site. Using the EoS score, we then develop selection criteria by which functionallyimportant yet non adjacent sites can be identified. An approximate power analysis is also developed to estimate the reliability of inference given the data. We validate and demonstrate the the practical utility of our method by analysis of the homing endonucleaseIBmol, comparing our predictions with the results of existing methods. Conclusions:Our method is able to assess both the evidence of selection at individual amino acid sites and estimate the reliability of those inferences. Experimental validation withIBmolproves its utility to identify functionallyimportant residues of poorly characterized proteins, demonstrating increased sensitivity over previous methods without loss of specificity. With the ability to guide the selection of precise experimental mutagenesis conditions, our method helps make unigenic analysis a more broadly applicable technique with which to probe protein function. Availability:Software to compute, plot, and summarize EoS data is available as an opensource package called ‘unigenic’for the‘R’programming language at http://www.fernandes.org/txp/article/13/ananalyticalframeworkfor unigenicevolution.

Background One of the principal reasons for studying molecular evo lution is that the function of a novel protein can be deduced, in part, by comparing it with a similar pre viouslycharacterized protein. But what recourse do we have if the novel protein does not exhibit significant sequence similarity to other proteins? More problemati cally, what if it is similar only to proteins of unknown function? In practice, even when the novel protein shares regions of extensive similarity to proteins of

* Correspondence: andrew@fernandes.org 1 Department of Biochemistry, The University of Western Ontario, N6A 5C1 Canada Full list of author information is available at the end of the article

knownfunction, it may be difficult to elucidate the importance of individual sites in the novel protein.

Unigenic Evolution One innovative experimental approach that can help identify specific domains or residues required for func tion isunigenic evolution, first described and developed by Deminoffet al. [1]. Unigenic evolution can be applied to any protein where the loss of function can be used as a selectable phenotype [15]. The procedure consists of random mutagenesis and amplification of a single wildtype sequence via muta genic polymerase chain reaction (PCR) with subsequent cloning and functional selection [6]. Functional clones

© 2010 Fernandes et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Fernandeset al.Algorithms for Molecular Biology2010,5:35 http://www.almob.org/content/5/1/35

are isolated and characterized by DNA sequencing. In contrast to traditional structurebased mutagenesis screening, unigenic evolution experiments produce an unbiasedestimate of functionallyimportant residues regardless of putative structural role or conservation.

Deminoff’s Analysis The selection process ensures that, in functional clones, amino acids essential for function will be conserved rela tive to nonessential sites. However, differential mutation sensitivity can be caused by more than structural or func tional constraints. Mutation rates of residues may differ due to differential transition/transversion rates, codon usage, and genetic code degeneracy. To correct for these confounding factors, Deminoffet al. developed a statisti cal analysis that compared the expected versus the observed mutation frequency for each codon, where the expected frequencies were derived from a population of clones that had not been subject to selection. Deminoffet al. clearly demonstrated the importance of accounting for nonuniform transition versus transver sion probabilities when computing expected mutational frequencies. To increase the inferential power of their 2 analyses, they also developed a‘slidingwindow’canaly sis, binning together a‘window’of adjacent codons, assuming that residues critical for protein function would be nearby in primary structure. By comparing the prob abilities ofsilentversusmissensemutation in these win dows, regions of either restrained or excessive mutability were identified as hypo or hypermutable, respectively.

Behrsin’s Analysis The subsequent analysis of Behrsinet al. [7] advanced the statistical framework of Deminoffet al. by improv ing three key features. These features were (a) the fixed 2 window size of thecanalysis, and (b) the effect of sam plesize on the codon mutation probability, and (c) accounting for multiple nucleotide mutations per codon. 2 First, window size for thecanalysis was addressed by using windows of different sizes and comparing esti mated falsediscovery rates. The‘best’window was selected via tradeoff between the estimated sensitivity and specificity for classifying hypo or hypermutable residues. Second, nucleotide substitution frequencies were computed using the continuity correction of Yates [8] resulting in more consistent codon mutation fre quencies. Third, codon mutation frequencies were com puted analytically from nucleotide substitution frequencies without the assumption that only one sub stitution per codon was likely.

Further Improvements The statistical framework of Deminoffet al. and the modifications suggested by Behrsinet al. allow for the

Page 2 of 24

reliable identification of hypomutable regions via unigenic evolution. Nonetheless, these stateoftheart analyses suffer from some deficiencies, from a statistical perspective, that could result in either erroneous or mis leading conclusions. The goal of this work is to develop a statistically rigorous method for the analysis of uni genic evolution data, improving upon existing techni ques by

1. relaxing the assumption that sample sizes are large enough such that asymptotic normality neces sarily applies, 2. relaxing the assumption that selectionsensitive regions of a protein are contiguous, 3. clarifying the relationship between Fisherstylep values and NeymanPearson TypeI and TypeII error probabilities with regard to testing hypotheses of functional selection, 4. relaxing the the assumption that the PCR amplifi cation protocol does not meaningfully affect muta tion probabilities, and 5. addressing the ability to of unigenic evolution to detect hypermutability.

We expand upon each of these points, in turn, below. First, both Deminoffet al. and Behrsinet al. equate observed eventrelative countswith the respective event probabilities. This equivalence is effectively true when either sample sizes are asymptotically large or probabil ities are nonextreme (not too close to either zero or one). However, experimentallyfeasible sample sizes are typically limited to the order of 50100 replicates (clones) and even the most mutagenic of PCR condi tions result in low probabilities (≈0.001 to 0.01) of point mutation. Therefore it is unlikely that observed counts have a simple relationship with the event fre quency, even accounting for the continuity correction of Yates [8]. The difficulty of estimating probability para meters from eventcounts when the likelihood of the event is very small is a wellknown problem from the inference of binomial and multinomial frequency para meters [9]. The most obvious consequence of assuming “counts≈probabilities”under these constraints is that 2 the normal approximation, on which thecstatistic is critically dependent, may be invalid enough to yield mis leading results. At the very least, the sampling variance 2 of thecstatistic itself is necessarily quite large. The anticipated parameter ranges above, for example, yield a 2 coefficient of variation forcto be on the order of 100 300%. An additional problem with equating counts and probabilities is that, in doing so, the analysis of Behrsin et al. implicitly conditions on the total number of muta tions as given. We anticipate that the actual number of mutations would be roughly Poisson distributed,