An automated stochastic approach to the identification of the protein specificity determinants and functional subfamilies

biomed - Mazin , Gelfand , Mironov , Rakhmaninova , Rubinov , Russell , Kalinina

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

12 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Recent progress in sequencing and 3 D structure determination techniques stimulated development of approaches aimed at more precise annotation of proteins, that is, prediction of exact specificity to a ligand or, more broadly, to a binding partner of any kind. Results We present a method, SDPclust, for identification of protein functional subfamilies coupled with prediction of specificity-determining positions (SDPs). SDPclust predicts specificity in a phylogeny-independent stochastic manner, which allows for the correct identification of the specificity for proteins that are separated on a phylogenetic tree, but still bind the same ligand. SDPclust is implemented as a Web-server http://bioinf.fbb.msu.ru/SDPfoxWeb/ and a stand-alone Java application available from the website. Conclusions SDPclust performs a simultaneous identification of specificity determinants and specificity groups in a statistically robust and phylogeny-independent manner.

Informations

Publié par	biomed
Publié le	01 janvier 2010
Nombre de lectures	196
Langue	English
Poids de l'ouvrage	1 Mo

Extrait

Mazin et al. Algorithms for Molecular Biology 2010, 5:29
http://www.almob.org/content/5/1/29
RESEARCH Open Access
An automated stochastic approach to the
identification of the protein specificity
determinants and functional subfamilies
1 1,3 1,3 1,3 3Pavel V Mazin , Mikhail S Gelfand , Andrey A Mironov , Aleksandra B Rakhmaninova , Anatoly R Rubinov ,
2 2,3*Robert B Russell , Olga V Kalinina
Abstract
Background: Recent progress in sequencing and 3 D structure determination techniques stimulated development
of approaches aimed at more precise annotation of proteins, that is, prediction of exact specificity to a ligand or,
more broadly, to a binding partner of any kind.
Results: We present a method, SDPclust, for identification of protein functional subfamilies coupled with prediction
of specificity-determining positions (SDPs). SDPclust predicts specificity in a phylogeny-independent stochastic
manner, which allows for the correct identification of the specificity for proteins that are separated on a
phylogenetic tree, but still bind the same ligand. SDPclust is implemented as a Web-server http://bioinf.fbb.msu.ru/
SDPfoxWeb/ and a stand-alone Java application available from the website.
Conclusions: SDPclust performs a simultaneous identification of specificity determinants and specificity groups in a
statistically robust and phylogeny-independent manner.
Background multiple sequence alignment of the protein family: it is
The current explosion of data on protein sequences and conserved among proteins that perform exactly same
structures lead to the emergence of techniques that go function and differ between different functional sub-
beyond standard annotation approaches, i.e. annotation groups. In this study, such positions are called SDPs
by close homolog and homology-based family identifica- (Specificity-Determining Positions). Another facet of the
tion. These approaches usually start with a set of related same problem is identification of proteins that have a
sequences and perform a detailed analysis of each align- certain specificity, i.e. refined functional annotation.
ment position [1-15]. One of problems that such analysis Most of techniques dealing with the stated problem
can tackle is analysis of protein specificity. Let us assume reduce the problem of specificity prediction to the identifi-
that a protein family has undergone an ancient duplica- cation of alignment positions that may be important for
tion that resulted in proteins that are related but perform protein specificity. They require the input set of sequences
different functions in the same organism. It is natural to to be divided into groups of proteins having the same spe-
assume that this functional divergence is mediated by cificity (specificity groups) [1,3,4,6,9-15]. A common fea-
mutation of certain amino acid positions. We call these ture of these methods is that they measure the correlation
positions specificity determinants, and this study is between the distribution of amino acids in each position
focused on their identification. We assume that specifi- ofamultiplesequencealignment(MSA)andthepre-
city determinants, after mutation that allow for a new defined groups. Those positions that show relatively high
(sub-)function, should be under strong negative selection correlation are assumed to be important for differences in
to let this newly asserted function to persist. This results specificity between groups. Additionally, SDPpred [6]
is a very specific conservation pattern of the position in a allows for a subsequent prediction of specificity for pro-
teins, whose specificity has not been known a priori.
Some methods do not need prior information on pro-
* Correspondence: olga.kalinina@bioquant.uni-heidelberg.de
2 tein specificity [2,5,7,12]. They start with an automatedCellnetworks, University of Heidelberg, Heidelberg, Germany
© 2010 Mazin et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.Mazin et al. Algorithms for Molecular Biology 2010, 5:29 Page 2 of 12
http://www.almob.org/content/5/1/29
division of the MSA into possible specificity groups. A
common feature of these methods is that they assign
same specificity only to monophyletic clades of the pro-
teins’ phylogenetic tree. This imposes a significant
restriction if the distribution of specificities within the
protein family does not agree with the phylogeny. This
can happen either as a consequence of convergent evo-
lution, or if the phylogeny is not well resolved.
In this paper we address these weaknesses. We pre-
sent a method, SDPclust, that simultaneously identifies
SDPs and divides the alignment into groups of proteins
that have the same specificity in a phylogeny-indepen-
dent manner. Other phylogeny-independent methods to
identify specificity-determining sites have been devel-
oped by Marttinen and co-workers [16] and Reva and
co-workers [17]. We report the benchmarking of the
presented method below.
Methods
Algorithm
Previously, we introduced the concept of specificity-deter-
mining positions (SDPs) [6,18]. Briefly, we say that a posi-
tion of a multiple sequence alignment (MSA) is an SDP, if
amino acids in the corresponding MSA column are con-
served within pre-defined groups of proteins with the Figure 1 Blocks and connections in the SDPclust algorithm.
same specificity (specificity groups) and differ between
such groups. We assume that positions with such conser-
vation pattern account for differences in the specificity
where f (a, i) is the frequency of amino acid a inpbetween proteins from different specificity groups.
group i in position p, f (a) is the frequency of aminopOne can easily note that the definition of SDPs relies
acid a in position p in the whole alignment,f(i) is theon the definition of specificity groups in a protein
fraction of proteins in group i.family. This significantly constrains the applicability of
The main new feature of SDPlight that makes it muchpreviously developed methods. On the other hand, we
faster than SDPpred is the way the correction for thepreviously showed that the identification of specificity
background distribution of the mutual information isgroups can be done using SDPs [6,18]. SDPclust is a
performed. Instead of using shuffling, which is computa-novel method that identifies SDPs in the absence of
tionally inefficient, we pre-calculate the mean and theprior knowledge of specificity groups and simultaneously
variance for any pattern of amino acids in an arbitrarypredicts these groups. At that, SDPclust does not predict
column using an approximation described below. Let usthe protein specificity ab initio, it merely says that pro-
assume that a MSA consists of proteins falling into kteins have coinciding or different specificity.
specificity groups, and, in a given position, amino acid aSDPclust consists of several components, which are
appears in each group i times (j = 1,...,k). Then (1) canajconnected as shown in Figure 1. SDPlight is a fast pro-
be rewritten ascedure to identify SDPs in a MSA that is divided into
specificity groups. The idea of SDPlight is the same as 20 k
1in a previously reported method SDPpred [6], namely, it MI = MI(, j), (2)p ∑∑Nuses the mutual information to measure how close is
 =1 j=1
the distribution of amino acids in a given MSA position
p to the distribution of proteins into specificity groups: where
f (, i)p ⎛ ⎞i nMI = f(, i)log , jpp∑ ∑ ⎜ ⎟f () fi() MI(, j) = i log , (3)(1)p j∈ i∈ ⎜ ⎟inall l amino acids all  j ⎝ ⎠
specificity groupsMazin et al. Algorithms for Molecular Biology 2010, 5:29 Page 3 of 12
http://www.almob.org/content/5/1/29
where n is the size of group j, Σ n = N, N is 20j j = 1,...,k j
1
DM() I==DM( I )the total number of sequences in the MSA. pi ,k∑ 2
N  =1The exact formulae for the expectation value and the
20 i  ii −i⎛  ⎞⎛ ⎞variance of MI are: 1 i ! ⎛ 1 ⎞ ⎛ 1 ⎞ ikp ⎜ ⎟D 1 − i log =⎜ ⎟ ⎜ ⎟ ⎜ ⎟∑∑2 ⎜ ii!!() −i k k i ⎟⎝ ⎠ ⎝ ⎠N  ⎝ ⎠==1 ⎝ i 1 ⎠
20 k
2i1 20  ii −i ⎛ ⎞⎛ ⎞k i ! ⎛ 1 ⎞ ⎛ 1 ⎞ ikMM( I )  pi({ })MI( ,j),  (8)pj  (4)∑∑∑ = 1 − ⎜i log ⎟ +⎜ ⎟⎜ ⎟ ⎜ ⎟N ∑∑2 ⎜ ⎟⎟ii!() −i !!kk i ⎝ ⎠ ⎝ ⎠ N ⎝ ⎠ =1 j=1 i ⎝ ⎠ =1 i=1{}j
i ii −20   1 ii+12i ! ⎛ 1 ⎞2 +−()kk .⎜ ⎟∑∑∑ii!!()i −−i i ! k⎝ ⎠12  1 2 =1 i=1 i=2
20 k⎛
ii−−i 121 ⎛ ⎞ ⎛ ⎞⎛ 2 ⎞ ik ik⎜ 12 2DM() I  DM( I( ,j)) .l1 − iog ilog −MMIp ⎜ ⎟ 12⎜ ⎟ ⎜ ⎟ () ik ,∑∑ 2 ⎜ k i i⎝ ⎠ N ⎝ ⎠ ⎝ ⎠ =1 j=1⎝
(5) The values of M (MI )and D(MI ) are pre-cal-⎞ ik , ik , 20 k ⎟ culated and tabulated, and requiring time O(i )and Oa⎟+2 cov((MI, j),MI( , j )) . 211 2 2∑ ∑ (i ), respectively. We pre-calculate these values for ka⎟jj , =1 , =1 1212 ⎟jj > = 2,...,200; i = 1,...,500 and store them. Then one run12 a >12 ⎠
of the method involves only summing corresponding
pre-calculated values for all 20 amino acids, and for aAt this point we make the approximation that differ-
given alignment of ~100 sequences of length ~400 aaent amino acids are distributed independently in the
groups. This leads to several simplifications: groups it takes approximately 50 ms (AMD Athlon™ 64 Pro-
become equivalent, hence n/n=1/k.Sinceallgroups cessor 3800+).j
become equivalent, instead of taking a sum over all Hav