La lecture à portée de main
Découvre YouScribe en t'inscrivant gratuitement
Je m'inscrisDécouvre YouScribe en t'inscrivant gratuitement
Je m'inscrisDescription
Sujets
Informations
Publié par | Thesee |
Nombre de lectures | 88 |
Langue | English |
Extrait
AVERTISSEMENT
Ce document est le fruit d'un long travail approuvé par le
jury de soutenance et mis à disposition de l'ensemble de la
communauté universitaire élargie.
Il est soumis à la propriété intellectuelle de l'auteur. Ceci
implique une obligation de citation et de référencement lors
de l’utilisation de ce document.
Toute contrefaçon, plagiat, reproduction illicite encourt une
poursuite pénale.
➢ Contact SCD Nancy 1 : theses.sciences@scd.uhp-nancy.fr
LIENS
Code de la Propriété Intellectuelle. articles L 122. 4
Code de la Propriété Intellectuelle. articles L 335.2- L 335.10
http://www.cfcopies.com/V2/leg/leg_droi.php
http://www.culture.gouv.fr/culture/infos-pratiques/droits/protection.htm S
S
A
A
T
R
I
A
S
V
R
´Ecole doctorale IAEM Lorraine
Lagrangian Relaxation
SolvingNP-hard Problems in Computational
Biology via Combinatorial Optimization
Stefan Canzar
Deutsch-Franzo¨sische Doppelpromotion
`THESE EN CO-TUTELLE
zur Erlangung des Grades
pour l’obtention du
des Doktors der Ingenieurswissenschaften (Dr.-Ing.)
der Naturwissenschaftlich-Technischen Fakult¨aten
der Universita¨t des Saarlandes
Doctorat de l’Universit´e Henri Poincar´e – Nancy 1
D´epartement de formation doctorale en informatique
UFR STMIA (sp´ecialit´e informatique)
Max-Planck-Institut fu¨r Informatik
Laboratoire Lorrain de Recherche en Informatique et ses Applications — UMR 7503
I
E
E
V
N
I
S
N
I
U
SDate de Soutenance : 14 Novembre 2008
Composition du Jury
Pr´esident : Prof. Ren´e Schott
Rapporteurs : Prof. Dr. Knut Reinert
Dr. Gregory Kucherov
Prof. Dr. Ernst Althaus
Directeurs : Prof. Dr. Ernst Althaus
Dr. habil. Jens Gustedt
Prof. Dr. Kurt MehlhornAbstract
This thesis is devoted to twoNP-complete combinatorial optimization problems arising
in computational biology, the well-studied multiple sequence alignment problem and the
new formulated interval constraint coloring problem. It shows that advanced mathe-
matical programming techniques are capable of solving large scale real-world instances
from biology to optimality. Furthermore, it reveals alternative methods that provide
approximate solutions.
In the first part of the thesis, we present a Lagrangian relaxation approach for the
multiple sequence alignment (MSA) problem. The multiple alignment is one common
mathematical abstraction of the comparison of multiple biological sequences, like DNA,
RNA, orproteinsequences. Iftheweight ofamultiplealignment ismeasuredbythesum
of the projected pairwise weights of all pairs of sequences in the alignment, then finding
a multiple alignment of maximum weight isNP-complete if the number of sequences is
not fixed. The majority of the available tools for aligning multiple sequences implement
heuristicalgorithms;nocurrentexactmethodisabletosolvemoderatelylargeinstances
or instances involving sequences exhibiting a lower degree of similarity.
Wepresentabranch-and-bound(B&B)algorithmfortheMSAproblem.Weapproximate
theoptimalintegersolutioninthenodesoftheB&BtreebyaLagrangianrelaxationofan
ILPformulationforMSArelativetoanexponentiallargeclassofinequalities,thatensure
that all pairwise alignments can be incorporated to a multiple alignment. By lifting
these constraints prior to dualization the Lagrangian subproblem becomes an extended
pairwise alignment (EPA) problem : Compute the longest path in an acyclic graph,that
is penalized a charge for entering “obstacles”. We describe an efficient algorithm that
solves the EPA problem repetitively to determine near-optimal Lagrangian multipliers
viasubgradient optimization. Thereformulation ofthe dualized constraints with respect
to additionally introduced variables improves the convergence rate dramatically. We
account for the exponential number of dualized constraints by starting with an empty
constraint pool inthefirstiteration towhichweaddcutsineach iteration, that aremost
violated by theconvex combination of asmall numberofprecedingLagrangian solutions
(including the current solution). In this relax-and-cut scheme, only inequalities from the
constraint pool are dualized.
i6
ii Abstract
The interval constraint coloring problem appears in the interpretation of experimental
data in biochemistry. Monitoring hydrogen-deuterium exchange rates via mass spec-
troscopy is a method used to obtain information about protein tertiary structure. The
outputof these experiments providesaggregate dataabout the exchange rate of residues
inoverlappingfragmentsoftheproteinbackbone.Thesefragmentsmustbere-assembled
in order to obtain a global picture of the protein structure. The interval constraint col-
oring problem is the mathematical abstraction of this re-assembly process.
The objective of the interval constraint coloring problem is to assign a color (exchange
rate) to a set of integers (protein residues) such that a set of constraints is satisfied.
Each constraint is made up of a closed interval (protein fragment) and requirements on
the number of elements in the interval that belong to each color class (exchange rates
observed in the experiments).
We introduce apolyhedral description of the interval constraint coloring problem, which
serves as a basis to attack the problem by integer linear programming (ILP) methods
and tools, which perform well in practice. Since the goal is to provide biochemists with
all possible candidate solutions, we combine related solutions to equivalence classes in
an improved ILP formulation in order to reduce the running time of our enumeration
algorithm. Moreover, we establish the polynomial-time solvability of the two-color case
by the integrality of the linear programming relaxation polytopeP, and also present
a combinatorial polynomial-time algorithm for this case. We apply this algorithm as
a subroutine to approximate solutions to instances with arbitrary but fixed number of
colors and achieve an order of magnitude improvement in runningtime over the (exact)
ILP approach.
We show that the problem isNP-complete for arbitrary number of colors, and we
provide algorithms that, given an instance withP =∅, find a coloring that satisfies
all the coloring requirements within±1 of the prescribed value. In light of ourNP-
completeness result, this is essentially the best one can hope for. Our approach is based
on polyhedral theory and randomized rounding techniques.
In practice, data emanating from the experiments are noisy, which normally causes the
instancetobeinfeasible,and,insomecases,evenforcesP tobeempty. Todealwiththis
problem, the objective of the ILP is to minimize the total sum of absolute deviations
from the coloring requirements over all intervals. The combinatorial approach for the
two-color case optimizes the same objective function. Furthermore, we use this combi-
natorial method to compute, in a Lagrangian way, a boundon the minimum total error,
which is exploited in a branch-and-bound manner to determine all optimal colorings.
Alternatively, we study a variant of the problem in which we want to maximize the
number of requirements that are satisfied. We prove that this variant isAPX-hard even
in the two-color case and thus does not admit a polynomial time approximation scheme
(PTAS)unlessP =NP.Therefore,weslightly(byafactorof(1+ǫ))relaxthecondition
on when a requirement is satisfied and propose a quasi-polynomial time approximation
scheme (QPTAS) which finds a coloring that “satisfies” the requirements of as many
intervals as possible.Zusammenfassung
Die vorliegende Dissertation widmet sich zweiNP-vollst¨andigen kombinatorischen Op-
timierungsproblemen aus der Bioinformatik, dem intensiv erforschten Problem des Mul-
tiplen Sequenzalignments (englisch: multiple sequence alignment) sowie dem neu for-
mulierten Intervallinduzierten Farbungsproblem (englisch: interval constraint coloring).¨
Sie zeigt, dass mithilfe von fortgeschrittenen Methoden der mathematischen Program-
mierung durchaus auch hoherdimensionale Probleminstanzen der Biologie exakt gelost¨ ¨
werden ko¨nnen. Daru¨berhinaus beschreibt sie alternative Ans¨atze, die es erlauben, ap-
proximativeLosungen(mitundohneGarantiebzgl.Approximationsgute)zubestimmen.¨ ¨
ImerstenTeil dieserDissertation stellen wireinenAlgorithmus fu¨rdasMultiple Sequen-
zalignment (MSA)vor,derdemKonzepteiner Lagrange-Relaxierung folgt. DasMultiple
Alignment ist eine gebrauchliche mathematische Abstraktion des Vergleichs von meh-¨
reren biologischen Sequenzen, wie etwa DNA, RNA oder Proteinsequenzen. Berechnet
sichdasGewichteinesMultiplenAlignmentsalsdieSummederGewichte allerprojezier-
ten Sequenzpaare in dem Alignment, so ist die Bestimmung eines Multiplen Alignments
maximalen GewichtesNP-vollstandig, falls die Anzahl der Sequenzen nicht als fest vor-¨
gegeben betrachtet wird. Die Mehrzahl der verfu¨gbaren Programme verfolgt deshalb
einen heuristischen Ansatz; keine bisher vorgestellte Methode ist in der Lage, moderat
¨große Instanzen oder Instanzen, die Sequenzen von einem niedrigen A