Relaxation Lagrangienne : resoudre des problèmes NP-dur en bioinformatique par des méthodes d'optimisation combinatoire, Lagrangian Relaxation : solving NP-hard Problems in Computational Biology via Combinatorial Optimization

Thesee - Stefan Canzar

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

139 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sous la direction de Ernst Althaus, Jens Gustedt, Kurt Mehlhorn
Thèse soutenue le 14 novembre 2008: Universität des Saarlandes, Nancy 1
Cette thèse est dédiée à la résolution de deux problèmes d'optimisation combinatoire NP-complets surgissant en bioinformatique, à savoir le problème classique d'alignement de séquences, ainsi qu'un problème nouvellement formalisé, le problème de coloration par contraintes d'intervalles ou Interval Constraint Coloring Problem (ICP). Nous montrons dans cette thèse qu'il est possible de résoudre des instances réelles et de grande taille de ces problèmes apparaissant en biologie, et ceci par des méthodes de programmation mathématiques avancées. Nous démontrons également l'existence de méthodes plus efficaces, permettant d'obtenir des solutions approchées pour ces mêmes problèmes. Dans la première partie de la thèse, nous présentons un algorithme pour la solution du problème classique de l'alignement de séquences, basé sur la relaxation lagrangienne. Le problème de l'alignement de séquences est une abstraction mathématique courante du problème de comparaison de séquences biologiques, comme l'ADN, l'ARN ou les séquences de protéines. Si le poids d'un alignement séquentiel multiple est calculé comme la somme des poids des paires de séquence projetées de l'alignement considéré, alors le problème de déterminer un alignement de poids maximal est NP-complet, tant que le nombre de séquences n'est pas fixé. La plupart des logiciels disponibles pour la résolution du problème d'alignement de séquences se focalisent donc sur des approches heuristiques. Aucune méthode n'est actuellement capable de résoudre efficacement des instances de taille moyenne, ou des instances comportant des séquences d'un faible taux de similitude, de ce problème de manière exacte. Nous présentons un nouvel algorithme pour la résolution du problème de l'alignement de séquences, basé sur la technique de séparation et évaluation (branch-and-bound ou B&B). Nous approchons la solution optimale en nombres entiers dans les nœuds de l'arbre B&B par une relaxation lagrangienne de la formulation en tant que PLNE du problème d'alignement de séquences multiples. Un nombre exponentiel d'inégalités supplémentaires doit alors être vérifié afin de garantir que l'alignement de séquences multiples peut être reconstruit sans conflits à partir des alignements individuels. En renforcant ces inégalités avant de les dualiser, le sous-problème lagrangien devient le problème d'alignement par paires étendu: il s'agit alors de trouver le plus long chemin dans un graphe acyclique, auquel sont ajoutés des pénalités lors du passage à travers certaines zones ``obstacles''. Nous introduisons un algorithme efficace permettant de résoudre ce problème de manière répétitive, afin de trouver une bonne approximation des multiplicateurs de Lagrange via optimization du sous-gradient. La reformulation des inégalités dualisées par rapport à des variables supplémentaires améliore de manière significative le taux de convergence de l'algorithme. Nous adressons le problème du nombre exponentiel d'inégalités par une approche itérative. L'ensemble des contraintes est vide au début. Après chaque itération, nous rajoutons à cet ensemble ces inégalités qui sont le plus violées par la combinaison convexe d'un petit nombre des solutions lagrangiennes précédentes (y compris la solution courante). Conformément au schéma relax-and-cut, nous dualisons exclusivement les inégalités présents dans l'ensemble des contraintes décrit précédemment. L'ICP est un problème qui se pose lors de l'analyse et l'interprétation de données expérimentales en biochimie. L'analyse du taux d'échange hydrogène/deutérium par spectrométrie de masse est une des méthodes utilisées pour obtenir des informations sur la structure tertiaire des protéines. Les résultats de ces expériences se présentent sous forme de taux d'échange des résidus dans les segments chevauchés de la protéine. Ces segments doivent être recollés afin d'obtenir une vision globale de la structure de la protéine. L'ICP est l'abstraction mathématique de ce process de recombinaison. L'objectif de l'ICP est d'attribuer une couleur (taux d'échange) à un ensemble d'entiers (résidus de protéines) de telle manière à ce qu'un ensemble de contraintes est vérifié. Chaque contrainte représente un intervalle fermé (segment d'une protéine) ainsi qu'un ensemble d'exigences supplémentaires concernant le nombre d'éléments qui doivent appartenir à chacune des catégories de couleur (taux d'échanges observés lors des expériences). Nous montrons que le problème peut être résolu par des méthodes de programmation linéraire en nombres entiers (PLNE), et nous utilisons un algorithme d'énumération implicite qui s'avère efficace pour la plupart des problèmes qu'on rencontre dans la pratique. Puisque notre motivation est de fournir aux biochimistes une liste exhaustive des solutions potentielles, une version améliorée de notre approche PLNE consiste à regrouper des solutions similaires dans des classes d'équivalence, ceci afin d'établir une version améliorée et plus performante de la procédure d'énumération. Nous démontrons ensuite la solvabilité du cas particulier à deux couleurs par la contrainte de solution en nombres entiers du polytope P, défini à travers la relaxation linéaire, tout en proposant un algorithme de résolution de complexité polynomielle pour ce cas précis. Cet algorithme sert ensuite de base pour l'établissement de solutions approchés d'instances de dimension quelconque mais fixe (pour le moment sans garantie sur la qualité de la solution obtenue). Nous obtenons ainsi une amélioration d'un ordre de grandeur en termes de performance par rapport à la solution exacte, basée sur l'approche PLNE. Nous démontrons que ce problème est NP-complet pour un nombre arbitraire de couleurs. Nous établissons ensuite un algorithme qui, étant donné P?Ø, est capable de déterminer une coloration satisfaisante toutes les contraintes données avec un écart maximal de ±1 des valeurs cible. Vue la complexité en NP du problème, il ne semble pas possible d'obtenir des solutions d'une qualité sensiblement supérieure. Notre approche est essentiellement basée sur la théorie des polyèdres et des techniques d'arrondissage randomisés. Les données obtenues lors des expériences biologiques réelles sont souvent bruitées, ce qui entraine le plus souvent l'insolvabilité du problème, voire même P=Ø. Afin d'adresser ce problème, l'objectif de la PLNE est de minimiser la somme des déviations absolues des contraintes de coloration sur l'intégralité des intervalles. L'approche particulière à deux couleurs optimise en effet cette même fonction. En outre, nous utilisons cette approche combinatoire pour déterminer, d'une façon lagrangienne, une borne sur l'erreur minimale, qui sera ensuite utilisée dans un algorithme de type branch-and-bound pour déterminer toutes les colorations optimales. Nous proposons une variante du problème précédent, dans laquelle nous essayons de maximiser le nombre contraintes qui peuvent être satisfaits en même temps. Nous démontrons que ce problème est APX-dur et qu'il ne permet donc pas de schéma d'approximation polynomial sauf si P=NP. C'est pourquoi nous relaxons légèrement le critère de satisfiabilité (par un facteur (1+e) et décrivons par la suite un schéma d'approximation d'une complexité quasi-polynomiale, permettant de ``satisfaire'' le plus grand nombre de contraintes possible.
-Relaxation Lagrangienne
This thesis is devoted to two NP-complete combinatorial optimization problems arising in computational biology, the well-studied multiple sequence alignment problem and the new formulated interval constraint coloring problem. It shows that advanced mathematical programming techniques are capable of solving large scale real-world instances from biology to optimality. Furthermore, it reveals alternative methods that provide approximate solutions. In the first part of the thesis, we present a Lagrangian relaxation approach for the multiple sequence alignment (MSA) problem. The multiple alignment is one common mathematical abstraction of the comparison of multiple biological sequences, like DNA, RNA, or protein sequences. If the weight of a multiple alignment is measured by the sum of the projected pairwise weights of all pairs of sequences in the alignment, then finding a multiple alignment of maximum weight is NP-complete if the number of sequences is not fixed. The majority of the available tools for aligning multiple sequences implement heuristic algorithms; no current exact method is able to solve moderately large instances or instances involving sequences exhibiting a lower degree of similarity. We present a branch-and-bound (B&B) algorithm for the MSA problem. We approximate the optimal integer solution in the nodes of the B&B tree by a Lagrangian relaxation of an ILP formulation for MSA relative to an exponential large class of inequalities, that ensure that all pairwise alignments can be incorporated to a multiple alignment. By lifting these constraints prior to dualization the Lagrangian subproblem becomes an extended pairwise alignment (EPA) problem: Compute the longest path in an acyclic graph, that is penalized a charge for entering ``obstacles''. We describe an efficient algorithm that solves the EPA problem repetitively to determine near-optimal Lagrangian multipliers via subgradient optimization. The reformulation of the dualized constraints with respect to additionally introduced variables improves the convergence rate dramatically. We account for the exponential number of dualized constraints by starting with an empty constraint pool in the first iteration to which we add cuts in each iteration, that are most violated by the convex combination of a small number of preceding Lagrangian solutions (including the current solution). In this relax-and-cut scheme, only inequalities from the constraint pool are dualized. The interval constraint coloring problem appears in the interpretation of experimental data in biochemistry. Monitoring hydrogen-deuterium exchange rates via mass spectroscopy is a method used to obtain information about protein tertiary structure. The output of these experiments provides aggregate data about the exchange rate of residues in overlapping fragments of the protein backbone. These fragments must be re-assembled in order to obtain a global picture of the protein structure. The interval constraint coloring problem is the mathematical abstraction of this re-assembly process. The objective of the interval constraint coloring problem is to assign a color (exchange rate) to a set of integers (protein residues) such that a set of constraints is satisfied. Each constraint is made up of a closed interval (protein fragment) and requirements on the number of elements in the interval that belong to each color class (exchange rates observed in the experiments). We introduce a polyhedral description of the interval constraint coloring problem, which serves as a basis to attack the problem by integer linear programming (ILP) methods and tools, which perform well in practice. Since the goal is to provide biochemists with all possible candidate solutions, we combine related solutions to equivalence classes in an improved ILP formulation in order to reduce the running time of our enumeration algorithm. Moreover, we establish the polynomial-time solvability of the two-color case by the integrality of the linear programming relaxation polytope P, and also present a combinatorial polynomial-time algorithm for this case. We apply this algorithm as a subroutine to approximate solutions to instances with arbitrary but fixed number of colors and achieve an order of magnitude improvement in running time over the (exact) ILP approach. We show that the problem is NP-complete for arbitrary number of colors, and we provide algorithms that, given an instance with P?Ø, find a coloring that satisfies all the coloring requirements within ±1 of the prescribed value. In light of our NP-completeness result, this is essentially the best one can hope for. Our approach is based on polyhedral theory and randomized rounding techniques. In practice, data emanating from the experiments are noisy, which normally causes the instance to be infeasible, and, in some cases, even forces P to be empty. To deal with this problem, the objective of the ILP is to minimize the total sum of absolute deviations from the coloring requirements over all intervals. The combinatorial approach for the two-color case optimizes the same objective function. Furthermore, we use this combinatorial method to compute, in a Lagrangian way, a bound on the minimum total error, which is exploited in a branch-and-bound manner to determine all optimal colorings. Alternatively, we study a variant of the problem in which we want to maximize the number of requirements that are satisfied. We prove that this variant is APX-hard even in the two-color case and thus does not admit a polynomial time approximation scheme (PTAS) unless P=NP. Therefore, we slightly (by a factor of (1+e)) relax the condition on when a requirement is satisfied and propose a quasi-polynomial time approximation scheme (QPTAS) which finds a coloring that ``satisfies'' the requirements of as many intervals as possible.
Source: http://www.theses.fr/2008NAN10093/document

Sujets

Relaxation lagrangienne

Informations

Publié par	Thesee
Nombre de lectures	88
Langue	English

Extrait

AVERTISSEMENT

Ce document est le fruit d'un long travail approuvé par le
jury de soutenance et mis à disposition de l'ensemble de la
communauté universitaire élargie.

Il est soumis à la propriété intellectuelle de l'auteur. Ceci
implique une obligation de citation et de référencement lors
de l’utilisation de ce document.

Toute contrefaçon, plagiat, reproduction illicite encourt une
poursuite pénale.

➢ Contact SCD Nancy 1 : theses.sciences@scd.uhp-nancy.fr

LIENS

Code de la Propriété Intellectuelle. articles L 122. 4
Code de la Propriété Intellectuelle. articles L 335.2- L 335.10
http://www.cfcopies.com/V2/leg/leg_droi.php
http://www.culture.gouv.fr/culture/infos-pratiques/droits/protection.htm S
S
A
A
T
R
I
A
S
V
R
´Ecole doctorale IAEM Lorraine
Lagrangian Relaxation
SolvingNP-hard Problems in Computational
Biology via Combinatorial Optimization
Stefan Canzar
Deutsch-Franzo¨sische Doppelpromotion
`THESE EN CO-TUTELLE
zur Erlangung des Grades
pour l’obtention du
des Doktors der Ingenieurswissenschaften (Dr.-Ing.)
der Naturwissenschaftlich-Technischen Fakult¨aten
der Universita¨t des Saarlandes
Doctorat de l’Universit´e Henri Poincar´e – Nancy 1
D´epartement de formation doctorale en informatique
UFR STMIA (sp´ecialit´e informatique)
Max-Planck-Institut fu¨r Informatik
Laboratoire Lorrain de Recherche en Informatique et ses Applications — UMR 7503
I
E
E
V
N
I
S
N
I
U
SDate de Soutenance : 14 Novembre 2008
Composition du Jury
Pr´esident : Prof. Ren´e Schott
Rapporteurs : Prof. Dr. Knut Reinert
Dr. Gregory Kucherov
Prof. Dr. Ernst Althaus
Directeurs : Prof. Dr. Ernst Althaus
Dr. habil. Jens Gustedt
Prof. Dr. Kurt MehlhornAbstract
This thesis is devoted to twoNP-complete combinatorial optimization problems arising
in computational biology, the well-studied multiple sequence alignment problem and the
new formulated interval constraint coloring problem. It shows that advanced mathe-
matical programming techniques are capable of solving large scale real-world instances
from biology to optimality. Furthermore, it reveals alternative methods that provide
approximate solutions.
In the ﬁrst part of the thesis, we present a Lagrangian relaxation approach for the
multiple sequence alignment (MSA) problem. The multiple alignment is one common
mathematical abstraction of the comparison of multiple biological sequences, like DNA,
RNA, orproteinsequences. Iftheweight ofamultiplealignment ismeasuredbythesum
of the projected pairwise weights of all pairs of sequences in the alignment, then ﬁnding
a multiple alignment of maximum weight isNP-complete if the number of sequences is
not ﬁxed. The majority of the available tools for aligning multiple sequences implement
heuristicalgorithms;nocurrentexactmethodisabletosolvemoderatelylargeinstances
or instances involving sequences exhibiting a lower degree of similarity.
Wepresentabranch-and-bound(B&B)algorithmfortheMSAproblem.Weapproximate
theoptimalintegersolutioninthenodesoftheB&BtreebyaLagrangianrelaxationofan
ILPformulationforMSArelativetoanexponentiallargeclassofinequalities,thatensure
that all pairwise alignments can be incorporated to a multiple alignment. By lifting
these constraints prior to dualization the Lagrangian subproblem becomes an extended
pairwise alignment (EPA) problem : Compute the longest path in an acyclic graph,that
is penalized a charge for entering “obstacles”. We describe an eﬃcient algorithm that
solves the EPA problem repetitively to determine near-optimal Lagrangian multipliers
viasubgradient optimization. Thereformulation ofthe dualized constraints with respect
to additionally introduced variables improves the convergence rate dramatically. We
account for the exponential number of dualized constraints by starting with an empty
constraint pool intheﬁrstiteration towhichweaddcutsineach iteration, that aremost
violated by theconvex combination of asmall numberofprecedingLagrangian solutions
(including the current solution). In this relax-and-cut scheme, only inequalities from the
constraint pool are dualized.
i6
ii Abstract
The interval constraint coloring problem appears in the interpretation of experimental
data in biochemistry. Monitoring hydrogen-deuterium exchange rates via mass spec-
troscopy is a method used to obtain information about protein tertiary structure. The
outputof these experiments providesaggregate dataabout the exchange rate of residues
inoverlappingfragmentsoftheproteinbackbone.Thesefragmentsmustbere-assembled
in order to obtain a global picture of the protein structure. The interval constraint col-
oring problem is the mathematical abstraction of this re-assembly process.
The objective of the interval constraint coloring problem is to assign a color (exchange
rate) to a set of integers (protein residues) such that a set of constraints is satisﬁed.
Each constraint is made up of a closed interval (protein fragment) and requirements on
the number of elements in the interval that belong to each color class (exchange rates
observed in the experiments).
We introduce apolyhedral description of the interval constraint coloring problem, which
serves as a basis to attack the problem by integer linear programming (ILP) methods
and tools, which perform well in practice. Since the goal is to provide biochemists with
all possible candidate solutions, we combine related solutions to equivalence classes in
an improved ILP formulation in order to reduce the running time of our enumeration
algorithm. Moreover, we establish the polynomial-time solvability of the two-color case
by the integrality of the linear programming relaxation polytopeP, and also present
a combinatorial polynomial-time algorithm for this case. We apply this algorithm as
a subroutine to approximate solutions to instances with arbitrary but ﬁxed number of
colors and achieve an order of magnitude improvement in runningtime over the (exact)
ILP approach.
We show that the problem isNP-complete for arbitrary number of colors, and we
provide algorithms that, given an instance withP =∅, ﬁnd a coloring that satisﬁes
all the coloring requirements within±1 of the prescribed value. In light of ourNP-
completeness result, this is essentially the best one can hope for. Our approach is based
on polyhedral theory and randomized rounding techniques.
In practice, data emanating from the experiments are noisy, which normally causes the
instancetobeinfeasible,and,insomecases,evenforcesP tobeempty. Todealwiththis
problem, the objective of the ILP is to minimize the total sum of absolute deviations
from the coloring requirements over all intervals. The combinatorial approach for the
two-color case optimizes the same objective function. Furthermore, we use this combi-
natorial method to compute, in a Lagrangian way, a boundon the minimum total error,
which is exploited in a branch-and-bound manner to determine all optimal colorings.
Alternatively, we study a variant of the problem in which we want to maximize the
number of requirements that are satisﬁed. We prove that this variant isAPX-hard even
in the two-color case and thus does not admit a polynomial time approximation scheme
(PTAS)unlessP =NP.Therefore,weslightly(byafactorof(1+ǫ))relaxthecondition
on when a requirement is satisﬁed and propose a quasi-polynomial time approximation
scheme (QPTAS) which ﬁnds a coloring that “satisﬁes” the requirements of as many
intervals as possible.Zusammenfassung
Die vorliegende Dissertation widmet sich zweiNP-vollst¨andigen kombinatorischen Op-
timierungsproblemen aus der Bioinformatik, dem intensiv erforschten Problem des Mul-
tiplen Sequenzalignments (englisch: multiple sequence alignment) sowie dem neu for-
mulierten Intervallinduzierten Farbungsproblem (englisch: interval constraint coloring).¨
Sie zeigt, dass mithilfe von fortgeschrittenen Methoden der mathematischen Program-
mierung durchaus auch hoherdimensionale Probleminstanzen der Biologie exakt gelost¨ ¨
werden ko¨nnen. Daru¨berhinaus beschreibt sie alternative Ans¨atze, die es erlauben, ap-
proximativeLosungen(mitundohneGarantiebzgl.Approximationsgute)zubestimmen.¨ ¨
ImerstenTeil dieserDissertation stellen wireinenAlgorithmus fu¨rdasMultiple Sequen-
zalignment (MSA)vor,derdemKonzepteiner Lagrange-Relaxierung folgt. DasMultiple
Alignment ist eine gebrauchliche mathematische Abstraktion des Vergleichs von meh-¨
reren biologischen Sequenzen, wie etwa DNA, RNA oder Proteinsequenzen. Berechnet
sichdasGewichteinesMultiplenAlignmentsalsdieSummederGewichte allerprojezier-
ten Sequenzpaare in dem Alignment, so ist die Bestimmung eines Multiplen Alignments
maximalen GewichtesNP-vollstandig, falls die Anzahl der Sequenzen nicht als fest vor-¨
gegeben betrachtet wird. Die Mehrzahl der verfu¨gbaren Programme verfolgt deshalb
einen heuristischen Ansatz; keine bisher vorgestellte Methode ist in der Lage, moderat
¨große Instanzen oder Instanzen, die Sequenzen von einem niedrigen A