Consultation statistique avec le logiciel
19 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Consultation statistique avec le logiciel

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
19 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Consultation statistique avec le logiciel Why Muto & Osawa (1987) plot is not good? J.R. Lobry 14 mars 2006 Professor Alexander N. Gorban asked me by e-mail on Sun, 23 Oct 2005 14:00:17 +0100 for the following legitimate clarification: About your third remark: ”As pointed by Sueoka (in 1988 IIRC) the Muto & Osawa representation is very bad be- cause the x-axis and y-axis variables are not independent...” From my point of view, here we meet one of the standard misunderstanding of the ”independency” notion. Because we touch this point third time, could you, please, give the definition of independency that you with Sueoka use here. In the standard sense all the variables we are talking about are not independent. I will try here to make this important point clear in a reproducible way. Table des matieres 1 Muto & Osawa (1987) plot 3 2 Sueoka's (1988) critics 3 3 Simulation 4 3.1 G+C content in coding sequences . . . . . . . . . . . . . . . . . . 4 3.2 G+C content in intergenic spaces . . . . . . . . . . . . . . . . . . 5 3.3 CDS versus intergenic spaces .

  • citation rate

  • science citation

  • consultation statistique avec le logiciel

  • coding sequences

  • professor alexander

  • cds versus intergenic spaces

  • sueoka

  • noboru sueoka wrote


Informations

Publié par
Nombre de lectures 9
Langue English

Extrait

Consultation statistique avec le logiciel
Why Muto & Osawa (1987) plot is not good?
J.R. Lobry
14 mars 2006
Professor Alexander N. Gorban asked me by e-mail on Sun, 23 Oct
2005 14:00:17 +0100 for the following legitimate clarification:
About your third remark: ”As pointed by Sueoka (in 1988
IIRC) the Muto & Osawa representation is very bad be-
causethex-axisandy-axisvariablesarenotindependent...”
From my point of view, here we meet one of the standard
misunderstanding of the ”independency”notion. Because
we touch this point third time, could you, please, give the
definition of independency that you with Sueoka use here.
Inthestandardsenseallthevariableswearetalkingabout
are not independent.
I will try here to make this important point clear in a reproducible
way.
Table des mati`eres
1 Muto & Osawa (1987) plot 3
2 Sueoka’s (1988) critics 3
3 Simulation 4
3.1 G+C content in coding sequences . . . . . . . . . . . . . . . . . . 4
3.2 G+C content in intergenic spaces . . . . . . . . . . . . . . . . . . 5
3.3 CDS versus intergenic spaces . . . . . . . . . . . . . . . . . . . . 5
3.4 Genome G+C content . . . . . . . . . . . . . . . . . . . . . . . . 6
4 The part/whole problem 9
5 Comments by Professor Noboru Sueoka 10
6 Comments by Professor Alexander N. Gorban 10
6.1 First attempt for a definition . . . . . . . . . . . . . . . . . . . . 11
6.2 Second attempt for a definition . . . . . . . . . . . . . . . . . . . 11
6.3 a priori information . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.4 External argument for using G+C content . . . . . . . . . . . . . 13
7 Extra e-mail 16
1J.R. Lobry
References 18
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 2/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
1 Muto & Osawa (1987) plot
The following figure is a screen copy of figure 2 from [Muto and Osawa,
1987].
This is a very famous plot within the field of molecular evolution. According
to the Science Citation Index (31-OCT-2005) the paper itself has been quoted
241 times, which is a very high citation rate for this field. Moreover, the figure
has been reproduced, or adapted, page 222 in [Li and Graur, 1991] and page
1414 in [Graur and Li, 2000] .
2 Sueoka’s (1988) critics
One year later [Sueoka, 1988], Noboru Sueoka wrote :
1To be completed, for sure the figure has been reproduced in many more places
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 3/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
So the question is why total G+C is not an ideal variable.
3 Simulation
First of all, let’s use a given seed for the random number generator used in
just to allow for reproducibility :
set.seed(1071966)
Let note n the total number of species under study :
(n <- 500)
[1] 500
So, in the following simulations we have 500 species.
3.1 G+C content in coding sequences
Now, suppose that cds denotes the G+C content in coding sequences. We
take it here from a random sampling in a beta distribution :
cds <- rbeta(n = n, shape1 = 2, shape2 = 2)
hist(cds, col = grey(0.8), xlab = "G+C content", main = paste("Distribution of G+C content\n",
"in the coding sequence of", n, "species"))
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 4/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
3.2 G+C content in intergenic spaces
Now, suppose that itg denotes the G+C content in intergenic spaces. We
take it again from a random sampling in a beta distribution, and this indepen-
dently from the previous sampling for the coding sequences :
itg <- rbeta(n = n, shape1 = 2, shape2 = 2)
hist(itg, col = grey(0.8), xlab = "G+C content", main = paste("Distribution of G+C content\n",
"in the intergenic spaces of", n, "species"))
3.3 CDS versus intergenic spaces
Let’s check that the G+C content in CDS is independent from the G+C
content in intergenic spaces :
plot(x = cds, y = itg, xlab = "G+C content in coding sequences",
ylab = "G+C content in intergenic spaces", las = 1, main = "G+C content in CDS and intergenic spaces")
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 5/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
This seems to be OK, at least the random number generator is not too bad
here.TheG+CcontentinCDSisindependentoftheG+Ccontentinintergenic
space : the knowledge of one does not help much to predict the other one. If
this were genuine data we would perhaps do something like that :
cor.test(itg, cds)
Pearson s product-moment correlation
data: itg and cds
t = 1.9357, df = 498, p-value = 0.05347
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.001283992 0.172797471
sample estimates:
cor
0.08641633
to say that at a critical level of 5% experimental doesn’t allow to reject the
null hypothesis that the linear correlation coefficient is equal to zero.
3.4 Genome G+C content
Inbacterialchromosomes,about80%ofspaceisdevotedtocodingsequences
and the remaining 20% to intergenic spaces. Let note gen the genome G+C
content in our simulation :
gen <- 0.8 * cds + 0.2 * itg
hist(gen, col = grey(0.8), xlab = "G+C content", xlim = c(0, 1),
main = paste("Distribution of G+C content\n", "in the genome of",
n, "species"))
Now, let’s use the genome G+C content as a predictive variable, as in Muto
& Osawa plots :
plot(x = gen, y = cds, xlab = "Genome G+C content", ylab = "G+C content in CDS",
las = 1, main = "CDS versus genome G+C content")
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 6/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdf
'J.R. Lobry
What a nice correlation! Let’s quantify this :
cor(gen, cds)
[1] 0.9714057
cor(gen, cds)^2
[1] 0.943629
Which means that 94.3% of the variability in CDS is taken into account by
the variability in genome. Is it significant?
cor.test(gen, cds)
Pearson s product-moment correlation
data: gen and cds
t = 91.3035, df = 498, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9660022 0.9759609
sample estimates:
cor
0.9714057
Yes, at a critical level of 5%, experimental data are clearly in contradiction
with the null hypothesis that the linear correlation coefficient is equal to zero.
If you consider the p-value here, we have a very highly significant result.
The result is statistically significant but biologically meaningless : all we
have here is that the contribution of CDS to the genomic G+C is important.
More systematically, we can explore the effect of the proportion of CDS on the
squared linear coefficient :
npoints <- 200
props <- seq(from = 0, to = 1, length = npoints)
rs <- sapply(props, function(x) cor(x * cds + (1 - x) * itg, cds)^2)
plot(props, rs, xlab = "Proportion of CDS in genomes", las = 1,
ylab = expression(r^2), type = "l", lwd = 2, main = expression(paste("Influence of the proportion of CDS on ",
r^2)))
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 7/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdf
'J.R. Lobry
And just for the fun the effect of the proportion of CDS on the result of
testing the null hypothesis that the linear correlation coefficient is equal to
zero :
props <- seq(from = 0, to = 1, length = npoints)
pvals <- sapply(props, function(x) cor.test(x * cds + (1 - x) *
itg, cds)$p.value)
plot(props, pvals, xlab = "Proportion of CDS in genomes", las = 1,
ylab = "p-value", type = "l", lwd = 2, main = "Influence of the proportion of CDS on p-values")
abline(h = 0.05, col = "red")
text(0.5, 0.07, expression(alpha == 0.05), col = "red")
Which means that with our simulated dataset, as soon as the proportion
of CDS in genomes is greater than about 10%, then we have to reject the null
hypothesis.Thetwovariablesarenotindependent,butthisjustbyconstruction.
Logiciel R version 2.2.0, 2005-10-06 – qre – Page 8/19 – Compil´e le 2006-03-14
Maintenance : S. Penel, URL : http://pbil.univ-lyon1.fr/R/querep/qre.pdfJ.R. Lobry
4 The part/whole problem
Turning back to the initial question, the problem is not related to the de-
finition of independency, but to what is know as the part/whole problem in
allometric studies. I have taken the following quote from Jim Moore’s site at
http://weber.ucsd.edu/~jmoore/courses/allometry/allometry.html :
Properly speaking, we are wrong to correlate brain weigh

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents