Algorithms in bioinformatics (CSI 5126)=1Please don t print these lecture notes unless you really need

51 pages

English

Algorithms in bioinformatics (CSI 5126)=1Please don't print these lecture notes unless you really need

Affe - Marcel Turcotte (Turcotte@Site.Uottawa.Ca)

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

51 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

1Algorithms in bioinformatics (CSI 5126)Marcel Turcotte(turcotte@site.uottawa.ca)School of Information Technology and EngineeringUniversity of OttawaCanadaOctober 16, 20091Please don’t print these lecture notes unless you really need to!Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsSmith & WatermanBase conditions,v(i; 0) = 0;i2 0::nv(0;j) = 0;j2 0::mGeneral case,80;> v(i;j 1) +d;:v(i 1;j 1) +s(S (i);S (j)):1 2where d is the cost for a deletion/insertion, while s(a;b) is thesubstitution score.Solution,?v = max[v(i;j) : i n;j m]) Smith & Waterman (1981) J. Mol. Biol. 147:195-197.Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsProbabilistic FrameworkI Recall that a sequence alignment should answer the question:\are the two sequences (evolutionary) related?"I In other words, is the observed sequence alignment the resultof:1. an evolutionary process, where both sequences have evolvedindependtly from a common ancestry, or2. can it be attributable to chance alone; randomly selecting twounrelated sequences could produce a similar alignment score.Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsProtein sequence probabilitiesIt’s useful to consider a \simple" probabilistic model of a proteinsequence, given p , the probability of observing the amino acid a,asuch that,p > 0a20Xp = 1aa=1Let’s de ne the ...

Informations

Publié par	Affe
Nombre de lectures	17
Langue	English

Extrait

0a20Xp = 1aa=1Let’s de ne the ..." />

1Prpt’ttnisaelnodeenuresotsehectlellnyruaessoynuelTurcrcelo!Maeedtis@ettocrut(ettoCSa).cwataot.uteibnifnioamroscit12I5Al6.rigomsth

Algorithms in bioinformatics (CSI 5126)

School of Information Technology and Engineering University of Ottawa Canada

October 16, 2009

Marcel Turcotte (turcotte@site.uottawa.ca)

MaelrcrcTuettorut(ttocis@escitamrofnioibni

General case,

v?= max[v(i,j) :i≤n,j≤m]

0,  v(i,j) = maxvv((ii,j−−1,1j+))+dd,, (i−1,j−1) +s(S1(i),S2(j)). v

⇒Smith & Waterman (1981) Biol.J. Mol.147:195-197.

wheredis the cost for a deletion/insertion, whiles(a,b) is the substitution score. Solution,

v(i,0) = 0,i∈0..n

Base conditions,

Smith & Waterman

v(0,j) = 0,j∈0..m

)aSCawc.toatetu.thmsgori6.AlI512

Mcitamrof

IRecall that a sequence alignment should answer the question: “are the two sequences (evolutionary) related?” Ithe observed sequence alignment the resultIn other words, is of: 1.an evolutionary process, where both sequences have evolved independtly from a common ancestry, or 2.can it be attributable to chance alone; randomly selecting two unrelated sequences could produce a similar alignment score.

Probabilistic Framework

sotrc@stee.itttuoecraruTlttocut(elgorithmsinbioinwa.aacC)IS1562A.

rcMa

pa>0 20 Xpa= 1

a=1 Let’s deﬁne the probability of a sequenceS(1)S(2). . .S(n) as,

Protein sequence probabilities

It’s useful to consider a “simple” probabilistic model of a protein sequence, givenpa, the probability of observing the amino acida, such that,

n pS(1)pS(2). . .pS(n)=YpS(i) i=1

timacsfnroibiosmniirhtAlgo126.CSI5.ca)awattou.etis@ettcour(tteotrcTuel

probabilitiesocttTlruraecMite.te@srcote(tu

> GetAaFrequency(DB); Alanine 7.62 % Arginine 5.19 % Asparagine 4.40 % Aspartic acid 5.27 % Cysteine 1.64 % Glutamine 3.94 % Glutamic acid 6.40 % Glycine 6.87 % Histidine 2.24 % Isoleucine 5.84 % Leucine 9.47 % Lysine 5.96 % Methionine 2.38 % Phenylalanine 4.10 % Proline 4.91 % Serine 7.09 % Threonine 5.64 % Tryptophan 1.23 % Tyrosine 3.18 % Valine 6.62 %

rmfoinio

A common practice consists of estimating the amino acid probabilities using the observed frequencies in a large database. Here are the amino acid frequencies observed for the database Swiss-Prot version 39.

sicatglro62A.isbntimhawa.uottSI51ca)CAminoacids

Probabilistic Interpretation of a Sequence Alignment

S1(1)S1(2). . .S1(n) S2(1)S2(2). . .S2(n)

Consider twoalignedsequences,S1andS2 simplicity,. For ungaped alignments are considered.

The interpretation requires weighting two issues. 1.Sequences are related (Match Model – M); 2.Sequences are unrelated (Random Model – R).

formaticsinbioinwa.aacC)ti.eouttlgorithmSI5126.AecraruTlMotrc@stettcotue(

oibnofnitamrsci

In the match model, we have,

P(S1,S2|M) =Yq(S1(i),S2(i)) i

Match model

whereq(a,b) represents the probability that both residuesaandb have both been derived independently from an ancestral residuec.

courlTcearMtis@ettocrut(ett)CSIa.cattawe.uomhisrotiA.gl1562

nformatics

Random model

Whilst the random model is simply, P(S1,S2|R) =YpS1(i)YpS2(j) i j

but since we assumed that|S1|=|S2|, P(S1,S2|R) =YpS1(i)pS2(i) i

lAogirhtsmniibioa).cwata6.12I5CS@ettocrutou.etiselTuMarcte(trcot

ottate.ua)CSwa.ct(rutoet@eisocttMarcTuelrc

The ratio of the two likelihoods is called anodds-ratio(or likelihood-ratio),

s(a,b) = log(qp(aa,bp)) b represents the log-likelihood ratio that the residue pair (a,b) will occur as an aligned pair, as opposed to unaligned. In the case of proteinss(a,b) represents a 20×20 matrix, known asscore matrixorsubstitution matrix.

where each,

taking the logarithm leads to a quantity known as thelog-odds ratio,

S(S1,S2) =Xlog(q(S1(i),S2(i))) ipS1(i)pS2(i)

P(S1,S2|M)Yq(pSS11((ii)),pSS2(2(ii))) = P(S1,S2|R)i

amroscitibnifniorigomsth12I5Al6.

oiniofmrtimhisbn

I I I I I I

atics

In this view, the total score of alignment is the sum of all the terms for the aligned pairs of residues and gaps. The score is interpreted as the logarithm of the relative likelihood that the sequences are related vs not related. Positive terms represent substitutions are more likely than would be expected by chance. Negative terms represent unfavorable substitutions. Finally, when the two hypotheses are equally likely the log-likelihood ratio will be zero. We see that such substitution matrix can be used for calculating local sequence alignments, since likely alignments will have a positive score and unlikely alignment will have a negative score. Additive scoring scheme means that positions along the sequence are considered independent from one another, i.e. mutations at diﬀerent sites have occurred independently. It’s aworking hypothesis.

raecMrcote(tucottlTur.awattou.etis@etorlg.A2651SI)Cca

What about the Substitution Scores?

ICertain amino acids have similar properties (structure, volume, charge, hydrophobicity, etc.); ILooking at the genetic code, you can see that certain pairs of amino acids are such that the minimum number of mutations at the codon level to change the encoding from one amino acid type to another is only one (Ala and Asp, GCC and GAC), there are pairs that need a minimum of two mutations (Ala and Arg, CGA and GCA) or even three (Asn and Trp, AAC or AAU and UGG); IThe substitution score is expected to reﬂect both of these eﬀects.

The substitution scores that we used were rather arbitrary, either the identity matrix or some hand made matrix. Let’s have a look at scoring schemes that are appropriate for protein sequences.

maorcstiMa)aSCawc..6lA5I21thmsgorioinfinbicruTlecrrut(ettosie@ttcotaot.ute

Univers
Ebooks
Livres audio
Presse
Podcasts
BD
Documents

Algorithms in bioinformatics (CSI 5126)=1Please don't print these lecture notes unless you really need

YouScribe

Le catalogue

Le service

Les conditions