UNIVERSITA' DEGLI STUDI DI ROMA 'TOR VERGATA'

De
Publié par

Niveau: Supérieur, Doctorat, Bac+8
UNIVERSITA' DEGLI STUDI DI ROMA 'TOR VERGATA' FACOLTA' DI SCIENZE MATEMATICHE FISICHE E NATURALI Statistical Mechanics of Unzipping: Bayesian Inference of DNA Sequence Tesi di Dottorato di Ricerca in Fisica Candidata Valentina Baldazzi Relatore Prof. Luca Biferale Co-relatore Simona Cocco (ENS, Paris) Hugues Dreysse (ULP, Strasbourg) Coordinatore di dottorato Prof. Piergiorgio Picozza Anno Accademico 2004-2005

  • steps reconstruction algorithm

  • coordinatore di dottorato

  • specific sequences

  • putative therapeutic function

  • single molecule

  • stretching dna

  • dna sequencing

  • been shown

  • finite force


Publié le : mercredi 20 juin 2012
Lecture(s) : 41
Tags :
Source : scd-theses.u-strasbg.fr
Nombre de pages : 111
Voir plus Voir moins

UNIVERSITA’ DEGLI STUDI DI ROMA ’TOR VERGATA’
FACOLTA’ DI SCIENZE MATEMATICHE FISICHE E NATURALI
Statistical Mechanics of Unzipping:
Bayesian Inference of DNA Sequence
Tesi di Dottorato di Ricerca in Fisica
Candidata
Valentina Baldazzi
Relatore
Prof. Luca Biferale
Coordinatore di dottorato
Co-relatore Prof. Piergiorgio Picozza
Simona Cocco (ENS, Paris)
Hugues Dreysse (ULP, Strasbourg)
Anno Accademico 2004-2005Contents
Introduction i
1 to DNA 4
1.1 Chemical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Double helix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 DNA mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Single molecule experiments . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Stretching DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Strand Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 DNA denaturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 DNA replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.3 DNA sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.4 Single Molecule Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Theoretical models for DNA elasticity and unzipping 28
2.1 Modelling DNA elasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Free Jointed Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.2 The Kratky-Porod model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.3 Worm Like Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Models of DNA unzipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.1 Static model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.2 Dynamical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 MonteCarlo procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.1 Constant force MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.2 velocity MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Sequence reconstruction 47
3.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Sequence Inference: the ideal case . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 ConstructingP(xjS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Normalisation check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.1 Reconstruction program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
iTable of contents
3.3.2 Quality indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.3 Single unzipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.4 Repeated unzippings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.5 Finite temperature analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 Analytical study of inference performances 63
4.1 High force theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 A simple approximation: no stacking interaction . . . . . . . . . . . . . . 64
4.1.2 The case of stacking interactions . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Finite force theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Numerical check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.1 High force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.2 Finite force . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Towards ’real’ data 80
5.1 Spatial and temporal resolution limits . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Multi-steps inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Numerical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.1 Finite temporal resolution generator . . . . . . . . . . . . . . . . . . . . . 84
5.3.2 Multi-steps reconstruction algorithm . . . . . . . . . . . . . . . . . . . . . 84
5.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.1 Preliminary study: jump probability distribution . . . . . . . . . . . . . . . 89
5.4.2 Single unzipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.3 Repeated unzippings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
List of Figures 99
List of Tables 101
Bibliography 103
iiIntroduction
DNA molecules are the support for the genetic information. Speci c sequences, called genes,
codify for proteins that perform most life functions and even make up the majority of cellular
structures.
When genes are altered, the encoded proteins can be unable to carry out their normal functions
and genetic disorders can result. It has been shown that almost all diseases have a genetic
component, whether inherited or resulting from the body’s response to environmental stresses,
like viruses or toxins. In some cases, like cystic brosis or haemophilia, the disease results from
the mutation of a single gene, whereas, in other cases, as for the cholesterol, small genetic
variations become a real disease only in connection with external stimuli.
The knowledge of DNA sequence, therefore, becomes of a central importance both as diag-
nostic and therapeutic tool. Over last fteen years, large efforts have been done to sequence
genomes and in particular the human one.
The ambitious goal proposed by the Human Genome Project, in fact, has attracted the attention
of several groups all around the world, so providing the right incentive for large improvements
in understanding biological processes and conceiving better technical devices. Renewed interest
has been devoted to the comprehension of the function of each gene and the role played by
faulty ones in disease causation.
Currently, gene tests are available to detect mutated sequence. Some tests are used to clarify a
diagnosis and direct a physician towards appropriate treatments, while others allow families to
identify people at high risk for conditions that may be preventable.
Genes themselves can be applied to treat diseases. Speci c DNA sequences, codifying for known
genes, can be introduced in cells in order to replace or supplement a defective gene, or to induce
the secretion of a protein that has a putative therapeutic function.
In principle, any disease may be a candidate for gene therapy. For the moment, research and clin-
ical trials have mainly addressed to inherited diseases, such as cystic brosis [2] or haemophilia
[3], and cancer, with the aim to destroy cancerous cells or to stop tumor growth, suppressing
their proliferation [4]. Genes can also induce the regeneration of damaged tissues, reduce the
reject in organ transplantation [5], or offer new treatments for AIDS [6], alone or in conjunction
with conventional drugs.
The large interest raised by genomics has been accompanied by a parallel improvement
in methods for DNA sequencing and gene expression analysis. The Human Genome Project
itself would not have been possible without huge technological efforts. It was clear in fact that
its important targets could not have been manually achieved but large enhancements in DNA
sequencing performances and automatized procedures were necessary. For the rst time experts
1Introduction
in engineering, physics, chemistry and computer science were brought into close contact: the
existing DNA sequencing technology has been improved and alternative approaches conceived.
Traditional strategy is based on the so called Sanger method: the DNA molecule is divided
in fragments ( 500 base pairs) and for each one a set of copies of different sizes is synthe-
sised. Each replica has a common extremity and a base-speci c uorescent label on the other
end. The entire population is separated by length using gel electrophoresis and the sequence is
reconstructed. This method is now fully automatized and correctly predicts 99.9% of the bases.
Nevertheless the quest for alternative (faster or cheaper) sequencing methods is still an active
eld of research.
Recently, various single molecule experiments have been carried out, allowing a direct inves-
tigation of DNA mechanics and protein-DNA interaction. In contrast to more traditional ones,
these new experiments can give access to dynamical information usually hidden by ensemble
averaging, such as intermediate metastable states or uctuations at the scale of the individual
molecule. Remarkably, sequence content highly affects kinetics. Signature of a sequence de-
pendence have been found in several biological processes, among which the digestion of a DNA
molecule by an exonuclease [7, 8], translocation through nanopores [9, 11], DNA polymeriza-
tion [12] and mechanical unzipping [13, 14].
The question whether they can be used as an alternative sequencing method is the subject
of an open debate in which this work intervenes.
If successfully implemented, in fact, single molecule approaches could bring enormous advan-
tages to actual sequencing methods. Regarding standard strategies, the possibility to work
on a single molecule would eliminate laborious cloning and replication steps. Moreover, elec-
trophoresis would be avoided and longer read lengths and faster speed analysis could be achieved.
This work focuses on mechanical unzipping experiments in which the two complementary
strands of a DNA molecule are pulled apart by the application of a force. The rst experiment
was performed by Bockelmann et al., in 1997, using a glass microneedle. The molecule was
opened with a constant velocity while the force force necessary to the opening was measured.
With a recent experimental apparatus, the same group showed that the force signal is correlated
to the average sequence on the scale of ten base pairs. Remarkably, the force signal is affected
by the substitution of one base pair, when adequately located along the sequence [13]. In
2003 Danilowicz et al. performed an analogous experiment using a constant force setup [14].
The distance between the two strands extremities as a function of the time is characterised
by rapid increments followed by long pauses, where the unzipped length remains constant.
Several repetitions have shown that position and duration of these metastable states are largely
reproducible thus providing a sort of ’ ngerprint’ of the sequence.
The theoretical description of the DNA mechanical unzipping, both at constant opening ve-
locity and constant force, has been extensively developed in previous works [15, 17, 16, 18].
Models have been able to reproduce the experimental signal given the DNA sequence. It is nat-
ural to ask whether one could, inversely, get informations on the sequence from experimental
data. This is the question we address in this work.
We propose a method based on Bayesian inference: the conditional probability of a sequence,
given an observed experimental signal, is computed and optimised. The sequence that maximise
the probability represents our prediction. The con dence of the reconstruction are studied as a
function of different parameters, such as the applied force, the number of opening experiments
2performed and the temporal resolution. Experimental data are obtained by a Monte Carlo pro-
cedure in a force constant setup and then passed to a second algorithm that, ignoring the DNA
sequence, tries to nd it out.
The rst part of the thesis is devoted to a detailed introduction to the chemical and mechan-
ical properties of DNA, both in its single and double-stranded form. Unzipping experiments are
then introduced and different setups are carefully described and discussed.
In the second section, we introduce some standard polymer models that are commonly used to
describe the elasticity of DNA molecules and to interpret results from single molecule experi-
ments. Mechanical unzipping can be understood combining the effects of DNA elasticity and
base pairing interaction. The second half of the chapter, therefore, deals with the theoretical
description of the unzipping, either statical,i:e: the average opening signal, and dynamical. We
nally present a numerical implementation of the dynamical model via a Monte Carlo proce-
dure, both for a constant velocity and constant force device.
The last three chapters constitute the original part of the thesis: they are devoted to the de-
scription of the reconstruction model, its implementation and the analysis of numerical results.
We suppose, at rst, an in nite spatial and temporal resolution and the effects of the thermal
noise on the prediction reliability are discussed. The opening dynamics, in fact, is perfectly
detected but two different experiments do not give the same result because of thermal uctua-
tions. Kinetics and sequence are not univocally related and this uncertainty intrinsically affects
the con dence of the prediction. A theoretical study of reconstruction reliability is also carried
out.
In chapter 5, a more realistic perspective is adopted: actual unzipping experiments obviously
do not have a perfect resolution and the information available for the sequencing is reduced. The
main limitations to the temporal and the spatial sensibility are discussed and some promising
progresses reported. We show that, from our point of view, the temporal resolution represents
the major dif culty and we propose a modi ed reconstruction model that tries to take it into
account. The quality of the prediction is studied as a function of different parameters, in order
to nd out if and under which conditions the original quality can be recovered.
3Chapter 1
Introduction to DNA
This section is devoted to an introduction to DNA, at different levels of description. In the
rst part the molecule is described with atomic detail, starting from its elementary units, the
nucleotides. The chemistry behind the double helix structure is explained, the stabilising factors,
the role of solution and observed alternative structures are discussed. General references for this
section can be found in [19, 20].
In the second section, a more physical point of view is assumed: DNA is regarded as a
long polymer and its mechanical behaviour under stretching and torque is studied. Thanks
to new revolutionary techniques, scientists have been able to directly manipulate single DNA
molecules, avoiding bulk averages. The elastic properties and mechanical parameters of an
individual polymer have been measured, shedding light on mechanisms, such as protein-DNA
interaction or gene regulation, of clear biological importance.
The third and last part is devoted to the study of strand separation, a central process in DNA
replication and repair. Biochemical studies have revealed important details of this mechanism
and recently a single molecule approach has also been applied. Several experiences have shown
a marked sequence dependence and typical forces necessary to separe the two strands have been
estimated in the range of pN.
1.1 Chemical Properties
1.1.1 Double helix
Watson-Crick structure
DNA is a very long macromolecule made up of deoxiribonucleotides, where the sugar and the
phosphate groups perform a structural role, whereas the bases carry the genetic information.
The backbone of the DNA is constant throughout the molecule: the 3’-hydroxyl of a sugar is
joined to the 5’-hydroxyl of the adjacent one through a phosphate group, via successive dehy-
dration synthesis reactions.
The variable part of DNA is instead constituted by the bases, bonded to the sugar. There are
four different types of bases: two purines, adenine (A) and guanine (G), and two pyrimidines,
thymine (T) and cytosine (C). The order of bases along the polynucleotide is not restricted in
any way and the precise sequence contains the genetic information of the cell.
4Chemical Properties
Figure 1.1: Chemical and geometrical structure of a DNA molecule: the backbone of the double
helix is made up of sugar and phosphate groups whereas the bases carry the genetic information.
As a consequence of backbone’s structure, DNA chain has a polarity. One hand of the chain has
a 5’-OH group and the other a 3’-OH group not linked to another nucleotide. By convention the
0 0base sequence is written in the 5 ! 3 direction, so, for example, ACG means that the 5’-OH
group is on the adenine and the 3’-OH group is on the guanine.
The three dimensional structure of the DNA is peculiar and was rst deduced in 1953 by
James Watson and Francis Crick, from the analysis of x-rays diffraction photographs of DNA
bers. DNA is a double helix: two polynucleotide chains, running in opposite directions, are
wound around a common axis in a right-handed way. The bases occupy the core of the helix
whereas the backbone winds around the outside, forming major and minor grooves that permit
the access to base pairs for interactions. The diameter of the helix is 20 A and the planes of the
bases are nearly perpendicular to the helix axis. Adjacent bases are separated by 3.4 A along
the helix, so the helical structure repeats after 10 residues.
The two chains are held together by hydrogen bonds between pairs of bases. The regular helical
structure imposes rigid steric conditions on the pairing of bases, xing the distance between
the glicolic bonds on the two strands to be 10.85 A. Watson and Crick found that a purine-
pyrimidine pair tted perfectly in this space, whereas it was not enough for two purines and too
much for two pyrimidines. Moreover, the formation of hydrogen bonds requires hydrogen atoms
to be in de ned positions. Only two base pairings could satisfy all constraints: either adenine
with thymine or guanine with cytosine. The orientations and the distances between these bases
permit, in fact, a strong interaction without any loss in the symmetry of the structure. The two
base pairs are not energetically equivalent: cytosine interacts with guanine via three hydrogen
bonds, while adenine and thymine form just two hydrogen bonds.
Although its strength, base pairing alone can not account for the observed DNA stability. In
51. Introduction to DNA
water, in fact, a broken hydrogen bond between two bases could be always replaced by a com-
pensating H-bond with the solvent that, without any energetic change, would soon disrupt the
helical structure. So other factors must contribute to DNA stabilisation.
A rst important contribution comes from stacking interactions, between adjacent bases on
the same strand. Investigations have shown that bases have an intrinsic tendency to stack to-
gether that is enhanced by an aqueous solvent. It is now thought that stacking interaction are
a form of Van der Waals interaction, driven by hydrofobic forces. It is important to remark
that stacking interactions are sequence-dependent: different sets of bases have distinct stacking
energies, so that, for example, AT stacks differently than TA. A complete overview of
values has been recently done by Santa Lucia [21], collecting results from different works and
techniques. The complete stacking matrix is shown in table 1.1; the complementarity between
strands leads to 10 independent values.
@ i+1
@ A T C G
i @
A -1,0 -0.88 -1.44 -1.28
T -0.58 -1.0 -1.30 -1.45
C -1.45 -1.28 -1.84 -2.17
G -1.30 -1.44 -2.24 -1.84
thTable 1.1: Santa Lucia’s uni ed stacking matrix. Lines represent i base and columns the
th (i + 1) one. Values, expressed in kcal/mol, refer to a temperature of 37 C and to a sodium
concentration 1 M.
Helix stabilisation is made dif cult by electrostatic repulsion between charged phosphate
groups in the backbone. In this context, solvent conditions can play an important role: the
presence of positive ions in solution can shield the electrostatic eld, increasing the stabilisation
+of the structure. Experiments have shown that DNA stability increases with theNa concentra-
++ ++ ++tion. Even better, it is proved that divalent ions, such asMg ,Mn andCo , speci cally
bind to phosphate groups, shielding agents in a more effective way than monovalent ones. One
++singleMg ion has an effect comparable to that of 100 to 1000 sodium ions.
Alternative DNA structures
Many different helical geometries can be built around W-C base pairing: DNA can assume dif-
ferent forms depending on solvent composition and base sequence. The structure proposed
by Watson and Crick (usually called B form) is the standard conformation usually taken by
DNA, at low salt concentration. It corresponds to the form present in normal aqueous solution
6

Soyez le premier à déposer un commentaire !

17/1000 caractères maximum.