Partition - Biologie

Puher - Bernard Bellon

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

7 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

¨˘££˙£˛˛å CABIOS Vol. 4 no. 1. 1988 Pages 111-115___________________________________________________________________________Construction of restriction mapsBernard BellonABSTRACTA computer program is described, which constructs maps of restriction endonucleasecleavage sites in linear or circular DNA molecules, given the fragment lengths in single anddouble digestions with two enzymes. The algorithm is based upon a partition method and avery simple rule to chain fragments. The program is written in Prolog II.INTRODUCTIONConstructing restriction maps from gel electrophoresis data is a common and timeconsuming activity for molecular biologists. Several programs and their associatedalgorithms, which are intended to perform this task have been described (Parker et al, 1977;Stefik, 1978; Pearson, 1982; Fitch et al, 1983; Durand and Bregegere, 1984; Nolan et al,1984; Polner et al, 1984). All the algorithms described so far, except for two (Fitch et al,1983; Nolan et al, 1984) are based upon permutations of the set of double or single digestionfragment lengths.We present herein an algorithm based upon properties of double digestion fragments withrespect to single digestion fragments which uses set partitions and rules to chain the elementsof partitions to construct maps.PROPERTIESSymbols denoting terms of set theory are usual mathematical abbreviations (Comtet, 1970).Let N be an ordered set of d integers {c }; a m-partition of N, (m d), is ...

Informations

Publié par	Puher
Nombre de lectures	23
Langue	English

Extrait

CABIOSVol. 4 no. 1.1988 Pages111-115 ___________________________________________________________________________ Construction of restriction maps

Bernard Bellon ___________________________________________________________________________

ABSTRACT A computer program is described, which constructs maps of restriction endonuclease cleavage sites in linear or circular DNA molecules, given the fragment lengths in single and double digestions with two enzymes. The algorithm is based upon a partition method and a very simple rule to chain fragments. The program is written in Prolog II.

INTRODUCTION Constructing restriction maps from gel electrophoresis data is a common and time consuming activity for molecular biologists. Several programs and their associated algorithms, which are intended to perform this task have been described (Parker et al, 1977; Stefik, 1978; Pearson, 1982; Fitch et al, 1983; Durand and Bregegere, 1984; Nolan et al, 1984; Polner et al, 1984). All the algorithms described so far, except for two (Fitch et al, 1983; Nolan et al, 1984) are based upon permutations of the set of double or single digestion fragment lengths. We present herein an algorithm based upon properties of double digestion fragments with respect to single digestion fragments which uses set partitions and rules to chain the elements of partitions to construct maps.

PROPERTIES Symbols denoting terms of set theory are usual mathematical abbreviations (Comtet, 1970). Let N be an ordered set of d integers {c }; a m-partition of N, (m£d), is a set of m subsets or i blocks {a } such as: i Èa ={c }and"i,j aÇa =Æ i ii j The number of m-partitions of N is the Stirling number of second kind (Comtet, 1970): 1 j jd S(d,m)=å(m-j) for0(-1) C£j£m m m! j m! where Cdenotes the binomial coefficient. m j!(m-j)! Consider a restriction map with two enzymes A and B: { a} fragmentssize of enzyme A cleavageiÎ [1..m] i { b} fragmentssize of enzyme B cleavagejÎ[1..n] j { c} fragmentssize of double enzyme AB cleavagekÎ[1..d] k We can state that a m-partition {c} and a n-partition {c} of {c } exist with the i,m j,nk following properties;

I-"i a4 cand"j b4 c i i,mj j,n Let be min(a )= a- eand max(a)= a+ e, where e is the experimental error. The sign4 i ii i indicates thata andc arecompatible within the experimental errors. It is the same i i,m property if we order {a }, {c}, {b } and {c}. The definition of compatible is : i i,mj j,n a4{c }Û i j,i,m Smin(c )£ min(a)£Smax(c ) j j,i,mi jj,i,m orSmin(c )£ max(a)£Smax(c ) j j,i,mi jj,i,m or min(a)£S min(c)£ max(a) i jj,i,m i or min(a)£S) max(c£) max(a i jj,i,m i

II-Each singlet of a partition is either an internal element of a t-uplet of the other partition or the first or the last element of an extreme t-uplet of the other partition and therefore a terminal element. Construct the reduced partitions, r-m-partition and r-n-partition in the following manner: remove all the singlets from the m-partition, and remove the singlets of the n-partition from the t-uplets of the the m-partition; it is the r-m-partition. Build symetrically the r-n-partition. II-a-In the case of a linear molecule, r-m-partition and r-n-partition have an overall total of only two singlets, the other elements being doublets. Bothsinglets can be either in r-m-partition or in r-n-partition, or each singlet can be in each reduced partition.The singlets are the elements that link the terminal fragments and a doublet of one reduced partition has its two elements in two different doublets of the other reduced partition: all the elements of r-m-partition and r-n-partition are different. "i1, i2c¹c and"i,j c¹c and"j1, j2c¹c i1,m i2,mi,m j,nj1,n j2,n In the case of a circular molecule there are only doublets that have the same property. Note that if the element number is equal to one for the two reduced partitions then the two singlets in the case of a linear molecule are equal; in the case of a circular molecule the two doublets are also equal. II-b- Theelements of the singlets or doublets of the reduced partitions give the fragment order. The doublets of the reduced partitions are such that a doublet of one partition has its two elements, one in a doublet and the other in the next doublet of the other reduced partition. In the case of linear molecule a terminal element is a singlet and not a doublet. The doublets contain two elements which link the t-uplets of the partitions, one to the precedent and the other to the next. Hence a method to chain the fragments is available.

ALGORITHM The algorithm is divided in two main parts. We assume that all the elements of the double digestion are different and ordered, and the sets of single digest fragment sizes are ordered. Part one:Construction of all the compatible m-partitions and n-partitions. An exhaustive survey of all the partitions will prove an inefficient algorithm: hence the partitions are constructed ordered subset by ordered subset with a bactracking method (Floyd,

1967) in two major steps. Every potential found subset for the m-partition is tested with the elements of {a }: in the case where one element of {a } is compatible, the search is persued i i further by constructing a new subset; on the contrary, no compatibility is found and therefore we add an element of {c } greater than the last one added and not already utilized; we k backtrack on this subset if there is a failure after exhaustion of elements; if a complete failure is obtained for this subset, this one is discarded and a backtracking is then performed on the previous subset. When the m-partition is complete, it is ordered and the property one is verified. The same procedure is applied for all the compatible n-partitions. It is worth noting that with the backtracking method and the compatibility condition, the exploration of many unsuccessful paths stops at the beginning. Part two:Take a m-partition and a n-partition, reduce them, and if property II-A of reduced partitions is satisfied, find the fragments order and so on for all the possible pairs. Take a pair of m-partition and n-partition, construct the reduced partitions: verify the property (II-a) for these r-m-partition and r-n-partition (two singlets and the other elements are doublets in the case of linear molecule and each doublet different from one another). If the property is verified construct the double digestion fragments list by utilizing the property (II-b): Case of a linear molecule: Assume that the r-m-partition contains a singlet s, take the t-uplet u of the m-partition in which s is contained and displace s at the end of the t-uplet: insert between each element of u the name of the second enzyme B and add at the end the name of the first enzyme A and put this in the list. Remove from the r-m-partition the element s and find the doublet (s,s )in 1 2 the r-n-partition in which s is contained, assume that s=s: find in the n-partition the t-uplet u 1 in which (s ,s) is contained, remove sand displace sat the end; insert between each 1 21 2 element of u the name of the first enzyme A, add at the end the name of the second enzyme B and put this at the end of the list, remove (s,s )from the r-n-partition. The element which 1 2 links is now s. Continue alterning r-k-partitions until one reduced partition is empty and the 2 other contain one element: a singlet. Proceed in the same manner except that the name of the enzyme is not added to the element, put this last element in the list. The list isone restriction map. If the reduced partitions cannot be reduced to the empty set, this pair of partitions is not a map. If there is no singlet in the r-m-partition begin with the r-n-partition and exchange the parts of r-m-partition and r-n-partition. Case of a circular molecule: All the elements of reduced partition are doublets: take the first doublet (s,s )of the r-m-1 2 partition, assume that the element which links the next doublet is s: find the t-uplet u of the 1 m-partition in which the doublet (s,s )is contained, displace sat the end of u and sat the 1 21 2 beginning, insert the name of the second enzyme between each element of u and add at the end the name of the first enzyme, put this element in the list, remove (s,s )from the r-m-1 2 partition and continue in the same manner as for a linear molecule until one reduced partition

is empty and the other contains one doublet. Proceed in the same way except remove the last element which is sfirst element of the beginning doublet, and do not add the name of an 2, enzyme at the end.

SYSTEM AND METHODS Minimal hardware requirements - Apple Macintosh 512K, Macintosh Plus, Macintosh SE, Macintosh II or Lisa emulated with Macworks. - An Imagewriter or Laserwriter printer is needed to obtain printouts of the data. - A second disk drive or a hard disk is suggested. Software The program text in Prolog II is available on request either on listing paper or on disk for the Apple Macintosh Microcomputer; please send a 3"1/2 disk and an international stamped adressed envelope in this latter case. You must have the Prolog II language installed on your computer. The algorithm text to produce the compatible partitions is also available in Pascal.

IMPLEMENTATION The program is written in Prolog II (Colmerauer et al, 1983; Giannesini et al, 1986). This language is interpreted: facilities of adding new rules is possible. The text can be easily carried on any other computer with only few modifications which are commands for output on screen or printer. The program allows you to provide the enzyme names, the size of the fragments of single and double digestions, the percentage of error by intervals, an eventual correction for these: then the program performs ordered sets, calculates compatible partitions and gives the double digest lists that are the minimum number of solutions. Every element of a partition is an ordered subset such that a solution which permutes internal digest fragments is not given; for example if we have (...B-100-A-200-A-400-A-300-B...) the possibility of (...B-100-A-400-A-200-A-300 B...) is ignored. After the values of the sizes of the single and double digests were entered just as the percentages of the errors, the data and results are written on the screen of the microcomputer (figure 1). Comments The above example is only pedagogic and aims to pay special attention to the definition of compatibility between a subset of double digest fragments and a single digest fragment, and to the application of the rule to chain the partitions (figure 2). A pascal version of the program, which will be included in a software for molecular biologists is in preparation. With this version, the example given by Fitchet al(1983) takes the following execution time: - in the case of an error equal to 1%, one solution which minimizes the mean deviation between the double digest and each single digest is found (the same as the first given by Fitch

et al) in 20 seconds on the Apple Macintosh Microcomputer . - in the case of two different and largest errors; the first equal to 5% for length lower than 100, the second equal to 3% for length greater than 100, four solutions are found in 1' 05". In the case of a real biological example, the digestion of phage lambda by EcoR1 and Hind III (very often used as a size marker on gels) it takes 5' 30" to propose solutions assembling the twelve double digest fragments with an error value set at 3% for all fragments.

DISCUSSION Comparison The basic idea of this algorithm of constructing a subset of the elements of the double digest and finding a single digest fragment compatible with this one, is similar to the method explained by Fitchet alHerein is described a general algorithm to construct (1983). partitions of the double digest fragment set where is inserted a rule to select a subset compatible with an element of a single digest; an another rule is applied to select compatible partitions: each found partition does not necessarily give a solution. In the algorithm of Fitch et al(1983) the application of more selection rules to find consistent subsets, which reduces the number of examined paths, leads to a lowest number of compatible partitions. These rules increase the time of testing each subset, but I think that the algorithm of Fitchet al(1983) is more performant in the case of large errors. In the Pascal version of the present program hybridization data between single digests and double digest can be taken into account; these constraints were added to the compatibility rule and the execution time of an example with hybridization data was always reduced just as the number of compatible partitions. The possibility of adding the rule C.3 of Fitchet al(1983), which defines a consistent subset of more than two elements as a subset which must contain n-2 potential singlets in the alternate single digest, will make the algorithm better and more performant in the case of many fragments or large errors. In comparison with algorithms based upon permutations, two advantages can be taken into account: the treatment of the errors applied to each pair of compatible subset of double digest and an element of single digest seems more realistic than a global cumulation of errors.The theoretical maximum number of paths to be examined is reduced: indeedåS = S(n,k)+ S(n,(n-k+1)) is always lower than n! or (n-k)!(k+1)!: for instance, with a double digest consisting of ten fragments and a single digest of five fragments these different values are: S5!*6!=86400.S=65352 10!=3628800 Note that we have to add the combinatorial steps to align the double digest with the single digests at the value of5!*6! if the method uses permutations of the single digests fragments (Pearson, 1982). In practice the number of explored paths is drastically reduced by using the compatibility rule and employing backtracking method. The rules to construct a map from a pair of compatible partitions with the help of the reduced partitions is novel, very simple, fast, and can be easily programmed.

Concluding remarks The Prolog version of the program could easily be modified in order to add rules describing additional information and only select solutions that satisfy these constraints; according to this information some rules can be added during the construction of the partitions and therefore the exploration number to construct k-partitions, which are compatible and satisfy this added information, is more decreased. Note that the program constructs maps from restriction cleavage fragment size for only two enzymes. The case of restriction digest with more than two enzymes (the usual case) can be taken into account: with three enzymes A, B, C for instance, one has just to run the program with A, B and A-B restriction fragments sizes, then with A, C and A-C data, search for solutions which give the same result for enzyme A map, and if necessary verify with B, C and B-C data. Prolog II is at the moment an interpreted language and execution time of a program is not very fast, however it is a very good language to build models of algorithms. As already mentioned a Pascal version is in preparation: with this version time execution is at least reduced by an hundredfold factor.

Acknowledgements The author would like to thank Dr Bernard Jacq for critical reading of the manuscript. The Prolog II language is a trademark of PrologIA 430 Avenue Marechal De Lattre de Tassigny13009 MARSEILLEFrance References Colmerauer A., Kanoui H., Van Caneghem M.Technique et Sciences Informatiques(1983) Vol. 2, N∞ 4271-311 Comtet L.Analyse combinatoire (1970) Presses Universitaires de France - Paris Durand R., Bregegere F.(1984)Nucleic Acids Research12 703-716 Fitch W.M., Smith T.F., Ralph W.W.(1983)Gene22 19-29 Floyd R.W. (1967)Journal of the Association for Computing Machinery Vol. 14, N∞ 4636-644

Giannesini F., Kanoui H., Pasero R., Van Caneghem M. Prolog(1986)International computer science seriesAddison Wesley Publishing Company

Nolan G.P, Maina C.V., Szalay A.A.(1984)Nucleic Acids Research12 717-729

Parker R.C., Watson R.M and Vinograd J. (1977)Proc. Natl. Acad. Sci. USA 74851-855

Pearson W.P.(1982)Nucleic Acids Research10 217-227

Polner G., Dorgai L., Orosz L.(1984)Nucleic Acids Research12 227-236 Stefik M. (1978)Artificial Intelligence11 85-114

Name of the clone AssayLINEAR RESTRICTION MAP of Alu1 and EcoR1 Restriction fragment size of Alu1 510 690 700 Restriction fragment size of EcoR1 100 290 590 1100 Restriction fragment size of the TWO enzymes 100 200 300 400 500 600 List of errors Size greater than1000 1% Size from1000 to 5007% Size smaller than500 10% 100-EcoR1-400-Alu1-200-EcoR1-600-Alu1-500-EcoR1-300 100-EcoR1-500-Alu1-600-EcoR1-200-Alu1-400-EcoR1-300 100-EcoR1-300-EcoR1-400-Alu1-500-Alu1-200-EcoR1-600 100-EcoR1-300-EcoR1-400-Alu1-200-EcoR1-600-Alu1-500 Minimum number of solutions is4

Fig.1: Example of screen output on the Apple Macintosh microcomputer. ___________________________________________________________________________ Construct one solution by hand: the partitions compatible with the experimental errors are: Alu1 3-partition (200.400) (100.600) (300.500)1 (100.500) (300.400) (200.600)2 (100.400) (200.600) (300.500)3 (500) (100.300.400) (200.600)4 EcoR1 4-partition (100) (300) (500) (200.400.600)5 (100) (300) (600) (200.400.500)6 (100) (300) (200.400) (500.600)7 Take the pair (1,5): r-3-partition1 = (200.400) (600) r-4-partition4 = (200.400.600) The singlet and doublet property is not verified,try another pair. Take the pair (4,7): r-3-partition = (400) (200 600)<- (500)(100.300.400) (200.600) r-4-partition = (200.400) (600)<- (100)(300) (200.400) (500.600) The singlet and doublet property is verified. Construct the list: 400 -> (100.300.400)-> 100-B-300-B-400-A-(400)(200.600) 400 -> (200.400)-> 200-B-(200.400)(600) 200 -> (200.600)-> 600-A-(200 600) 600 -> (500.600)-> 500(600) list: 100-B-300-B-400-A-200-B-600-A-500

Fig.2: Construction of one solution by hand.