◦ ◦ ◦ TECHNISCHE UNIVERSITÄT MÜNCHEN◦ ◦ ◦ ◦
◦ ◦
◦ ◦ ◦
◦ ◦ ◦ ◦ FAKULTÄT FÜR INFORMATIK
◦ ◦ ◦
Lehrstuhl für Effiziente Algorithmen
FastStructureSearchingforComputationalProteomics
Hanjo Täubig
Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität
München zur Erlangung des akademischen Grades eines
Doktors der Naturwissenschaften (Dr. rer. nat.)
genehmigten Dissertation.
Vorsitzender: Univ. Prof. Dr. Arndt Bode
Prüfer der Dissertation:
1. Univ. Prof. Dr. ErnstW. Mayr
2. Univ. Prof. Dr. Dr. h.c. mult. Wilfried Brauer, em.
Die Dissertation wurde am 27.02.2006 bei der Technischen Universität München
eingereicht und durch die Fakultät für Informatik am 17.04.2007 angenommen.metho
General
Problem
Information
Applications]:
[Analysis
P
Categories
Storage
umerical
and
pro
tors:
SciencesBiolog
CCS
De
Algorithms
IndexingDictionaries,
according
H.3.3
y]:
Retriev
sub
Searc
ject
al
Searc
p
J.3
CM
and
F.2.2
and
Classication
erms:
of
Measuremen
(1998)
and
and
Indexing
cumen
ds
Complexit
[Information
and
and
Nonn
al]:
to
h
Do
t
A
descri
Information
Algorithms
Retriev
and
Clustering,
ProblemsP
ltering,
attern
h
Matc
cess
hing
[Computer
H.3.1
Life
[Information
Medical
Storage
y
and
Genetics
Retriev
T
al]:
Algorithms,
Con
sign,
ten
t,
t
erformance
Analysis
iipattern
it
ost-translational
pro
and
v
for
onsible
lik
of
qu
protein
ectiv
v
mostly
t
structural
aims
e
In
b
for
the
s,
e
sux
of
h
queries
of
query
jorit
ti
kno
size
e
um
v
h
d
text
the
genomics
for
problems
h
tribution
end
lik
t
in
sev
approac
etter
is
(de-
a
ets.
osited
that
ts
as
information
to
me
ts
In
On
metho
our
these
o
h
30.000
c
exceeds
predisp
These
ucle
the
sheer
allo
,
existing
diseases,
adv
e
ds
a
solving
a
ati
pro
apply
v
databases
w
proteins.
t
v
re
in
hemical
While
as
ar
ts.
of
ey
en
structures
to
mak
seconds.
and
gene
functional
is
x-ra
ap
ecia
y)
rely
structure
v
Data
e
y
The
primary
y
ipating
and
da
in
sequencing
hand,
pro
t
results
results
e
sev
t
en
of
are
w
rates.
ctions
y
Cele
more
e
and
to
the
macromolecules
w
of
gigab
or
uge
to
ers
acids.
t
genome.
metho
ersit
to
y
database
t
This
n
exploiting
y
tages
the
ng
and
used
Genome
hing
dications,
in
did
c
t
pro-
amino
w
erforming
trees
m
to
n
biop
ject
RNA
of
main
on
a
that
approac
n
searc
the
uge
reason
the
ecicit
metho
celebrated
only
reac
h
The
e
in
u
w
or
ties,
ys,
a
allo
p
erform
great
ries
i
data
of
on
ucleic
a
signican
whic
b
b
m
d
crystallograph
x-
of
sp
adapted
e
alphab
ere
the
curing
and
the
t
the
t
(PDB),
of
b
kb
s
d
are
aluated
for
st
The
the
in
the
the
to
molecular
area.
and
the
e
exp
the
a
cesses.
of
apparen
while
th
pro-
and
other
ma
structures
the
b
y
b
ds
h
cases,
onen
that
al
rea
T
as
da
in
,
Human
holds
olv
than
restricted
structures,
bio
the
wledge
of
hemical
database
detecting
t
lik
en
uman
y
proteins
ytes.
osition
h
n
n
the
b
ic
emphasize
a
urgen
Their
need
r
fast
div
ds
ariet
wing
y
searc
seque
the
crea
of
of
structures.
e
thesis
Although
at
b
the
but
an
alternativ
of
c
indexi
splicing
metho
resp
commonly
p
in
2003
matc
mo
for
gene
problems
is
structural
of
and
prerequisi
omput
not
onal
e
teomics.
pro
particular,
p
e
vide
sux
the
to
acids
related
ast
structural
uc
of
um
olymers
Genomics
e
er
and
information
The
dieren
con
ould
is
functions,
no
the
el
dep
h
w
fast
on
hing
al
h
c
databases
o
e
sp
PDB.
of
existing
y
ds
Pro
vide
the
se
disfunction.
c
tan
times
allo
th
Starting
order
k
min
the
te
a
hours,
en
ev
to
da
molecular
our
to
h
of
ws
b
p
olyp
standard
the
e
pt
within
understanding
The
des
structure
e
based
n
a
the
r
acids
lized
success,
tree
termined
h
relations
extended
y
y
t
metho
y
for
ust
pro
y
imate
hing
NMR
sp
b
l
ectroscop
alphab
in
These
w
ets
the
on
dep
discretization
had
translation-
in
rotation-in
of
arian
Protein
measures
an
represen
Bank
th
agen
structure
whic
the
of
bac
i
one.
that
metho
the
w
illness.
ev
source
b
partic
applying
structure
ructural
eco
to
used
PDB
in
comparing
to
results
usefulness
established
y's
ols
resp
this
biology
On
one
In
the
ject
completion
it
progress
therefore
and
matc
structural
the
genomics
erimen
rese
demonstrate
arc
signican
h.
reduction
Since
the
its
time
rst
comparable
da
are
ys,
duced.
th
the
e
hand,
PDB
eral
has
ha
wit-
e
nessed
e
a
found
rapid
y
gro
approac
wth
that
at
iii
exp
Abstracth
yields
is
used
of
heuristics.
ral
ecause
trast
for
ds,
on
y
y
ols
a
con
y
list
in
metho
curren
are
iden
ltered
automatic.
is
b
fast
to
(sparse)
sequence-based
whic
result
partition
Another
ec
b
al
tribution
whic
in
is,
a
con
they
to
d
tly
of
metho
fully
missing
the
other
out
approac
tifying
The
frequen
h
t
based
motifs
the
and
computation
for
the
partitioning
similarit
a
matrix
database
h
of
a
protein
using
structures
sp
according
t
to
clustering
structural
gorithm.
similarit
ivelieving
hard
curren
Mel
to
supp
forw
for
regarding
Ec
hers
me
Griebsc
pro
of
help
u
an
ermanen
ou
the
f
-
t
olk
W
lter
eagues
and
to
excellen
w
able
to
am
and
y
ideas.
y
friends.
for
ould
are
mem
lo
aluable
graphics,
Group,
y
a
urthermore,
Alexander
y
y
oth
Holzapfel,
him
ÿler,
hic
t
of
of
ne
Also,
I
and
sp
on
and
the
not
ort
ossible
writi
grateful
y
m
brother
ak
i
advisor
v
for
not
indebted
thank
o
t
T
the
and
nshine
in
h
ers
supp
help
creation
t
am
ort
to
particular
friend
v
h.
Jens
w
the
thank
l,
te
of
hael
Heun,
,
I
Hans-
en
Bernd
Karl-Heinz
Thomas
W
a
hreiter
My
ac
of
fundamen
Buc
science,
t
hemistry
and
am
absolutely
e
deserv
Kusser
computer
w
thanks,
protein
an
F
w
b
a
great
without
the
thesis
ys
I
g,
to
to
Moritz
.
m
m
No
and
o
Klaus
fruitful
pushed
the
and
ons
the
am
Last,
eing
I
Ma
to
t
Fisc
yr,
eing
the
for
kno
me
t
Y
the
y
former
Thank
t
m
b
y
v
e.
of
ort
ha
the
Ecien
of
supp
I
Algorithms
thank
of
ul
in
m
through
dear
Stefan
Alexander
o
lic
khardt,
F
out
I
Ernst,
an
couple
to
Hal
m
last
former
V
ac
e
Mic
er
F
ears.
e
Klaus
Udo
p
eitz,
Sv
Joac
am
Brenner,
K
Lic
t,
and
Nie
for
and
Sc
a
First
Sc
kinger.
for
lot
e
coll
hing
een
the
Arno
tals
insigh
computer
h
mathematics,
all,
c
r
.
in
I
Jan
grateful
this
Uw
h
Römers
theoretical
Anselm
e
for
w
t
ecial
ork
science
the
this
structure
imp
ject.
ork
or
for
un
ould
eliev
t
and
exist
supp
lot
during
their
last
ts
da
ort.
of
freedom
n
am
I
thank
indebted
to
m
realize
family
Maaÿ
The
without
of
Johannes
y
y
Holger
w
m
y
father
for
Täub
wn
g
disc
me
w
ard
ssi
sa
I
ed
and
da
Ernst
.
b
but
deeply
least,
true
w
p
t
Moreo
Anja
c
wledgmen
write
b
eople.
thankful
osub,
h
v
her
er,
b
I
patien
am
and
thankful
b
to
in
Er
all
nst
time.
Ba
ou
y
m
er
su
for
.
generous
y
tec
so
hnical
uc
supp
for
ort.
our
F
v
or
v
great
Avi.
.
.
n
.
.
.
c
.
.
.
.
.
.
Computational
.
.
.
.
.
.
and
Determines
.
.
.
.
.
.
.
.
s
.
1.1.2
.
F
.
.
.
Structure
.
.
.
.
.
.
.
.
.
ject
.
.
.
.
.
.
.
.
.
.
.
F
cids
.
.
Nucleic
.
.
.
and
.
.
.
2.3.2
.
.
.
DNA
.
.
.
Bank
.
.
.
F
.
.
.
.
5
-T
.
.
.
31
The
.
.
Sup
Sequenc
.
.
.
.
.
.
.
i
.
2.2
.
.
F
.
.
.
.
.
.
2.2.9
.
.
.
.
.
.
.
.
2.2.1
.
.
.
dules
.
.
.
.
the
.
.
.
.
.
.
.
.
1.2
.
.
.
.
.
.
.
.
.
.
.
of
.
Bac
.
.
.
.
.
.
2.4
.
.
.
.
.
.
.
.
.
P
.
.
.
.
Motiv
.
.
.
.
A
Biop
.
1
.
.
.
.
.
.
.
.
.
.
.
2
16
.
Structure,
Genome
.
.
.
Determines
.
.
2.2.7
.
.
.
.
unc
.
.
.
.
.
.
.
.
.
.
.
.
20
.
Reading
.
.
.
.
.
.
.
.
2
.
.
.
.
.
F
of
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.2.10
.
.
no
.
.
.
Basic
.
.
.
Proteins
.
.
.
.
20
.
cids
.
tral
.
.
.
.
.
.
2.2.2
.
.
.
Bac
2.3.1
.
.
.
.
.
.
.
.
Bioinformatics
.
.
.
1
.
.
.
.
Biosyn
.
.
Imp
.
.
.
.
.
6
.
.
.
of
22
.
RNA
one
.
Proteome
.
.
.
.
.
.
.
.
.
.
Protein
.
.
.
.
.
.
.
.
.
.
2.2.4
.
.
.
.
3
.
29
.
Matc
.
.
.
.
2
.
.
.
1
.
.
.
F
.
oundation
.
.
29
2.1
.
.
(T
olymers
.
.
.
.
.
.
.
.
.
1.1.1
.
.
.
.
.
.
.
ation
.
.
.
.
.
.
.
Human
.
.
2.2.6
1.1.3
ersecondary
.
Motifs,
.
Domains
.
.
e
.
.
.
Pro
.
.
.
Structure
17
.
Comme
In
t
.
.
F
.
.
.
.
.
.
.
t
.
.
.
.
.
.
.
on
.
5
.
.
.
Proteins
.
.
.
.
.
.
.
.
2.2.8
.
urther
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1.1.4
.
.
.
.
.
20
termination
Protein
.
olding
.
.
.
.
Structure
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
5
Protein
.
unctions
Ami
.
.
.
A
.
.
.
as
.
.
.
Mo
.
.
.
of
.
.
.
.
.
duction
.
.
.
.
.
.
2.3
.
A
.
and
.
Cen
.
Dogma
.
.
.
.
.
.
.
.
.
.
5
.
.
.
Hi
.
.
.
toric
.
.
20
kground
DNA
.
RNA
.
.
2
.
.
.
.
.
.
.
Structural
.
.
.
1
.
.
.
and
.
.
.
xix
.
.
.
Proteomics
.
.
20
The
Protein
.
thesis
.
.
.
.
.
.
.
.
.
.
.
.
ortance
.
.
.
.
.
2.2.3
.
.
.
ormation
.
.
.
the
2.3.3
the
and
kb
Structure
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1.1
.
.
23
.
The
.
Data
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
.
.
.
Protein
.
.
.
.
.
.
23
.
Algorithmic
.
oundation
.
3.1
.
attern
.
hing
.
.
.
.
3
.
.
.
.
.
.
.
Bio
.
.
.
ts
.
.
.
hemical
.
.
.
.
.
.
ten
Preface
tro
.
.
.
De
.
.
.
.
.
.
.
3.2
.
tomic
.
.
.
rees
.
ries)
11
.
2.2.5
.
Sec
.
ond
.
ary
.
Structure
.
Elemen
.
ts
.
.
.
.
.
.
.
.
.
.
.
.
vii
.
Con
+Σ.
ore:
.
hing
.
.
.
.
.
.
.
s
.
hing
.
.
.
.
.
Insertions
.
.
.
Fi
.
tication
erage
.
.
T
.
rees
.
4.1
of
Issues
.
.
.
Represen
.
.
.
.
.
of
.
.
.
ks
.
.
.
The
.
.
.
.
anc
.
Construction
.
.
113
Searc
.
.
.
.
.
.
77
and
.
.
ree
.
P
.
.
.
.
.
.
Sequences
the
.
.
.
.
Computation
.
.
.
.
.
.
.
of
.
.
.
.
.
.
.
.
.
.
.
Similarit
.
.
.
.
100
rees
.
Measures
.
.
.
.
.
.
.
of
.
.
4.6.3
.
.
.
.
.
Sux
Arithmetic
.
.
.
.
Ge
73
ein
Bac
.
.
.
.
.
s
.
et
.
.
.
eptide
p
.
on
Prop
.
.
.
.
.
.
.
4.7.4
.
.
hing
.
Structure
.
.
.
.
.
.
.
.
oleran
T
.
.
.
.
.
.
4.7.7
.
.
.
.
42
4.8
.
.
.
.
ormal
.
y
.
.
nc
37
.
4.5.3
.
.
.
.
.
.
the
.
.
.
.
.
Searc
Metho
.
Measures
.
.
.
.
Applications
.
.
.
.
Angl
.
.
Classication
.
.
4.6.1
.
.
.
.
.
ork
.
.
.
.
67
Computation
.
.
Measures
.
.
.
.
quen
.
.
.
.
T
Protein
y
.
34
.
.
.
40
Database
.
.
.
.
.
D
Searc
.
.
.
in
.
43
.
.
.
Sux
Structure
rie
Enco
.
ones
.
.
.
.
Databases
.
.
.
.
Le
.
is
A
Structure
.
.
.
.
.
4.7.2
W
P
4.4.3
Sux
.
.
emen
.
T
78
.
of
.
.
.
.
Searc
.
.
.
.
.
.
.
.
.
.
Searc
4.2.2
.
.
.
.
.
50
.
A
.
F
.
.
.
.
4.7.5
.
Common
.
.
.
.
.
.
.
4.7.6
.
Searc
.
.
.
.
.
.
55
.
.
.
Bond
.
.
.
.
hing
.
Deletions
.
.
.
.
.
.
.
.
.
.
Ev
.
.
.
.
.
4.5.2
.
Quali
.
and
.
.
.
orsion
.
.
4.8.1
.
.
.
.
.
.
.
.
Distributions
.
Dra
.
.
.
.
.
.
4.8.2
.
Fingers
.
.
.
yp
.
.
Curren
.
.
.
.
90
.
Classic
.
ge
56
.
.
.
Protei
.
.
.
.
.
42
4.8.4
.
.
.
.
.
.
olyp
.
.
.
.
.
.
.
Sux
Motifs
.
5.1
.
Motifs
.
.
.
.
Similarit
.
.
.
.
.
.
.
.
.
Previous
.
Motif
.
.
.
.
.
.
.
.
.
105
Di
A
.
ructures
Similarit
.
.
.
.
.
.
.
.
5.1.3
43
r
.
Substructur
33
.
.
.
the
.
.
.
.
110
-Based
Classication
.
.
.
.
3.4
.
.
.
.
.
.
.
4
The
.
.
4.2.1
.
.
.
(P
.
.
.
ext
String
.
.
hing
.
.
.
hing
.
.
.
T
.
.
.
.
.
4.4.2
.
Prot
.
neralized
.
.
4.7
T
Searc
.
via
.
ded
.
kb
.
.
Structure
.
.
.
.
.
.
.
s
.
.
.
.
.
.
.
37
4.7.1
.
s
.
(Information)
.
M
.
The
.
Alphab
.
.
.
.
Previous
.
.
.
.
77
.
Construction
Sux
the
49
olyp
.
Angles
Im
T
ork
.
l
.
.
.
tation
.
.
.
.
4.7.3
.
erties
.
the
Structure
AST
.
.
.
.
.
.
rees
.
.
.
.
.
.
.
hing
.
.
.
.
.
.
.
TRICIA
.
.
80
41
Exact
.
hing
.
.
.
.
Searc
.
.
.
.
.
.
.
Amino
.
4.5
.
.
.
and
.
cid
.
eature
.
.
.
tations
.
.
81
.
Finding
.
Longest
.
Substructure
.
.
.
.
.
.
.
.
.
.
.
.
.
84
.
T
.
t
.
hing
3.3
.
.
.
.
.
.
.
.
.
.
.
.
.
4.5.1
.
.
.
of
.
.
.
Angles
.
.
.
.
86
.
Searc
.
with
.
and
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
88
4.3
Applications
.
.
.
.
.
.
aluation
.
.
.
rees)
.
55
.
Service
.
F
.
.
.
Denition
.
t
.
Computation
.
.
.
T
.
.
.
Angles
.
.
90
.
Zi
.
Fingers
.
.
.
.
.
.
.
.
55
.
4.2
.
Angle
.
.
.
.
.
.
.
.
.
.
.
.
.
wbac
.
.
.
.
90
.
Searc
-T
Zinc
.
of
.
.
.
.
of
-T
.
e
.
.
.
.
.
.
.
.
.
.
.
.
t
.
.
4.8.3
.
hing
.
Zinc
.
n
.
rs
.
.
4.6
.
ds
.
of
.
.
.
n
.
.
.
y
.
.
.
.
.
.
92
.
Other
4.4
.
.
.
.
.
.
.
P
.
.
.
.
.
.
.
eptide
.
.
.
.
.
.
.
es
.
.
.
.
5
.
and
T
105
.
Iden
.
of
.
.
.
.
67
.
.
.
String-Based
.
.
.
y
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1.1
.
W
.
on
.
Detection
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.6.2
5.1.2
.
of
st
v
.
St
e-Based
.
.
.
y
.
.
.
.
.
.
.
.
.
.
.
.
.
.
106
.
Extraction
.
F
.
e
4.4.1
t
.
e
.
.
.
.
of
.
.
.
.
.
.
.
Sux
.
67
.
.
.
Angle
5.2
ree
Structure
Similarit
.
.
.
CONTENTS
Compact
rees
.
.
.
.
.
.
105
Measures
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2.1
.
SCOP
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
.
4.6.4
.
The
113
istance
viii

CCHCy
En
.
.
6
.
of
.
Results
.
.
.
Con
Query
.
En
.
of
.
.
.
.
.
.
.
.
.
.
.
Remarks
List
y
List
.
C
.
s
.
tries
.
cron
TH
Index
.
115
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
114
.
115
.
.
117
and
P
.
119
.
PR
.
tries
.
of
.
i
.
D
.
TH
.
E
.
/
.
133
5.2.2
Glossary
.
.
.
.
.
5.4
.
.
.
.
.
.
.
.
Database
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2.3
.
5.3
116
.
Conclusion
tainedness
A
.
of
Similarit
AST
.
Results
.
B
.
of
.
OSITE
CA
127
ix
The
.
.
.
.
.
.
List
.
SCOP
.
tr
.
e
.
129
.
List
.
CA
.
En
.
131
.
List
.
Abbreviations
.
A
.
yms
.
Bibliograph
.
135
.
159
.
161
.
CONTENTSCONTENTS
x