Tutorial computentional Challenges in comparative Genomics

Shaej

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

23 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Informations

Publié par	Shaej
Nombre de lectures	17
Langue	English

Extrait

Computational Challenges
in Comparative Genomics
A Tutorial
BERNARD M.E. MORET
WITH WEBB C. MILLER, PAVEL A. PEVZNER, AND DAVID SANKOFF
1 Introduction
Comparative approaches have long been a mainstay of biology and medicine. In
part, this is due to necessity: many organisms and systems are difﬁcult to pro-
cure or to maintain in the laboratory, while ethical concerns have prevented ex-
perimentation with humans and are reducing experimentation with higher mam-
mals. Thus, in particular, much of what we know about humans has been learned
through animal models. More importantly, comparative approaches embody an
evolutionary approach to biology—and that, ever since Darwin, is what has en-
abled biologists to make sense out of the extremely complex systems they study.
The great pioneer Theodore Dobzhansky famously wrote an essay entitled “Noth-
ing makes sense in biology except in the light of evolution,” in which he argues
that the large amount of data collected by ﬁeld and bench biologists can only
reveal their structure through an analysis based on evolution. Since evolution-
ary processes can only be understood through the comparison of various products
of these processes, comparative approaches must form the foundation of any bi-
ological research method. The case is even more compelling today than when
Dobzhansky wrote his essay: with the advent of high-throughput instruments for
molecular biology and now for other aspects of biology and the life sciences,
the amount of data collected has exploded and continues to grow at an exponen-
tial rate. The kind of meticulous craft used in the early study of, e.g., genomic
sequences, simply cannot keep up with the rate of data collection, nor have ex-
perimental validation methods kept up with high-throughput instruments. We are
thus faced with the necessity of using computational methods to make sense of
the massive data accumulating in genomic, proteomic, metabolomic, morpholog-
ical, physiological, neurological, clinical, and other databases. These computa-
tional methods, be they datamining, machine learning, or combinatorial optimiza-
tion, all rely on basic models derived from our knowledge of a few well studied
organisms, and thus all remain comparative, even when the comparison is not
explicit—changing model parameters is tantamount in most cases to a quantiﬁ-
1cation of differences between the system under study and that used to derive the
original model.
Comparative genomics is faced with what is, for now, the most daunting of
these avalanches of data: genomic data, in the form of sequence data, accumu-
lates at an exponential rate, doubling approximately every year and a half. (That
rate, incidentally, even exceeds Moore’s law, the observation made by Intel’s co-
founder G.E. Moore that the density of transistors on a commodity semiconductor
chip doubles every two years—so that hardware capabilities are, in effect falling
farther and farther behind what may be needed to process the new data.) Whole-
genome sequences are now routinely produced for bacteria and are getting increas-
ingly easy to produce for vertebrate genomes, but a full understanding of even a
single genome, that is, its structure, how its parts interact and are controlled, and
the nature of the evolutionary processes that shape genomes, remains far away. In
the case of the heavily studied human genome, for instance, we have some under-
standing of the coding genes (a bit under 2% of the entire genome by length) and a
fair start on noncoding genes and other conserved elements (a bit under 3% of the
entire genome), but remain nearly clueless about the other 95%. Yet getting to that
apparently modest level has only been possible through comparative approaches.
Comparative approaches are based on the identiﬁcation of conserved patterns—
the study of evolution is indeed just as much the study of conservation. In the case
of comparative genomics, positive or negative selection helps conserve regions of
the genome, while areas under neutral selection are free to vary and thus expected
to diverge more rapidly. Thus the basis of comparative genomics is the identiﬁca-
tion and mapping of conserved regions. Since selection pressure for conservation
is linked to function, the identiﬁcation of regions conserved across a range of
genomes leads naturally to a conjecture that these conserved regions play similar
functional roles. (Such conjectures, at least for now, need to be veriﬁed experi-
mentally.) The conservation of certain groups of genes forming similar pathways
in several organisms leads us to conjecture that the pathways themselves may be
conserved, in which case other related organisms should possess a similar group
of genes; hence, if some, but not all, of these genes have been identiﬁed in a re-
lated organism, it is reasonable to conclude that the other genes are also present
and to conduct a search targeted for these speciﬁc “missing” genes. This princi-
ple is widely used in the identiﬁcation of genes in related species and can enrich
both ends of a pairwise comparison. On the other hand, ﬁnding genes in one sub-
group of organisms that appear to have no similarity to any genes in other related
organisms can lead us to conjecture the occurrence of a lateral gene transfer (es-
pecially in bacteria) or gene duplication at some past time; ﬁnding these genes
2in some unrelated group of organisms can strengthen the conjecture of a lateral
gene transfer. Since lateral gene transfer is thought to play a major role in the
acquisition of drug resistance or virulence in pathogenic bacteria, identifying the
event and the group of genes thus transferred is of crucial importance to human
health; and since transferring useful groups of genes through artiﬁcial means can
lead to improved crops, the same tools are very important in genetic engineer-
ing. Finally, the sequencing of eukaryotic genomes led to the discovery that these
genomes include numerous duplicated regions, some very large and some appar-
ently made of nested duplications. These duplications represent both a serious
obstacle for genome sequencing and, more importantly, a chance to witness evo-
lution in action: because copies may escape selection pressure, it is thought that
gene duplication is the key to the development of novel gene functions.
Now, DNA sequences have been compared ever since the beginning of DNA
sequencing and genetic maps have been compared since the beginning of the 20th
century, yet neither constitutes what is generally viewed today as comparative ge-
nomics. It is the availability of whole-genome sequences that has given rise to this
area of research, which is distinguished both by its potential (since it addresses en-
tire genomes, it can in principle identify patterns not present at local scales and
thus elucidate complex mechanisms that involve many areas of the genome) and
by its scale and complexity. The latter mandate the use of computational methods;
the former provides the impetus that has caused this area to grow enormously in
the last few years. For those researchers working in comparative genomics, the
much quoted wording “postgenomic era” simply means that, now that we can get
our hands on complete genomes for a variety of organisms and even, for simpler
genomes, a variety of individuals, the task of understanding genomes can ﬁnally
begin. Comparative genomics can be used even when the whole-genome sequence
is not known in detail for each organism under study: for instance, in human ge-
netics, one can use the “generic” whole-genome sequence of the human along with
dense SNP array data for speciﬁc individuals to study patterns of change within
the human species and how these patterns relate to phenotypic traits, especially
those linked to inherited disorders or predispositions to certain diseases.
The primary goal of comparative genomics is thus to delimit regions of a
genome and tag each region with a label such as “under positive selection,” ”exon,”
“promoter region,” etc., and to do so by using comparisons with other genomes.
This is similar to working with several very large magazines describing closely
related actions, and written in the same alphabet, but in somewhat different lan-
guages, languages that are for the most part not understood. These magazines also
contains a very large amount of advertising and other nonspeciﬁc content that can
3vary quite a bit from one magazine to the other; we do not understand any of
this content (the so-called “junk” DNA) and often have serious trouble telling it
apart from the “text.” From some limited understanding of a few text passages in
each magazine, we attempt to identify punctuations marks, words, sentences, and
eventually larger motifs (these pages have to do with processing sugars, those with
controlling replication, etc.), all by comparing the texts back and forth, building
models using statistics and combinatorics, and running optimization and machine
learning algorithms on the data. The models attempt to characterize the structure
of words, the syntax of sentences (e.g., how a gene can be formed of exons and
introns and surrounded by control elements), but also how these constructs change
from one magazine to the other.
In this tutorial, we focus on the computational challenges, that is, on the de-
velopment of models and algorithms for the anal