Efficient use of a protein structure annotation database [Elektronische Ressource] : application to packing analysis / von Kristian Rother
146 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Efficient use of a protein structure annotation database [Elektronische Ressource] : application to packing analysis / von Kristian Rother

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
146 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Efficient use of a Protein Structure AnnotationDatabaseApplication to packing analysisDISSERTATIONzur Erlangung des akademischen Gradesdoctor rerum naturalium(Dr. rer. nat.)im Fach Biologieeingereicht an derMathematisch-Naturwissenschaftlichen Fakultät IHumboldt-Universität zu BerlinvonHerr Dipl.-Biochem. Kristian Rothergeboren am 11.4.1977 in BerlinPräsident der Humboldt-Universität zu Berlin:Prof. Dr. Christoph MarkschiesDekan der Mathematisch-Naturwissenschaftlichen Fakultät I:Prof. Thomas Buckhout, PhDGutachter:1. Prof. Dr. Cornelius Frömmel2. Prof. Dr. Ulf Leser3. Janusz M. Bujnicki, PhD, DHabileingereicht am: 21. Mai 2006Tag der mündlichen Prüfung: 20. September 2006AbstractIn this work, a multitude of data on structure and function of proteins is compiled andsubsequently applied to the analysis of atomic packing. Structural analyses often requirespecific protein datasets, based on certain properties of the proteins, such as sequencefeatures, folds, or resolution. Compiling such sets using current web resources istedious because the necessary data are spread over many different databases. To facilitatethis task, Columba, an integrated database containing annotation of protein structureswas created. Columba integrates sixteen databases, including PDB, KEGG, Swiss-Prot,CATH, SCOP, the Gene Ontology, and ENZYME.The data in Columba revealed that two thirds of the structures in the PDB databaseare annotated by many other databases.

Sujets

Informations

Publié par
Publié le 01 janvier 2007
Nombre de lectures 68
Langue English
Poids de l'ouvrage 2 Mo

Extrait

Efficient use of a Protein Structure Annotation
Database
Application to packing analysis
DISSERTATION
zur Erlangung des akademischen Grades
doctor rerum naturalium
(Dr. rer. nat.)
im Fach Biologie
eingereicht an der
Mathematisch-Naturwissenschaftlichen Fakultät I
Humboldt-Universität zu Berlin
von
Herr Dipl.-Biochem. Kristian Rother
geboren am 11.4.1977 in Berlin
Präsident der Humboldt-Universität zu Berlin:
Prof. Dr. Christoph Markschies
Dekan der Mathematisch-Naturwissenschaftlichen Fakultät I:
Prof. Thomas Buckhout, PhD
Gutachter:
1. Prof. Dr. Cornelius Frömmel
2. Prof. Dr. Ulf Leser
3. Janusz M. Bujnicki, PhD, DHabil
eingereicht am: 21. Mai 2006
Tag der mündlichen Prüfung: 20. September 2006Abstract
In this work, a multitude of data on structure and function of proteins is compiled and
subsequently applied to the analysis of atomic packing. Structural analyses often require
specific protein datasets, based on certain properties of the proteins, such as sequence
features, folds, or resolution. Compiling such sets using current web resources is
tedious because the necessary data are spread over many different databases. To facilitate
this task, Columba, an integrated database containing annotation of protein structures
was created. Columba integrates sixteen databases, including PDB, KEGG, Swiss-Prot,
CATH, SCOP, the Gene Ontology, and ENZYME.
The data in Columba revealed that two thirds of the structures in the PDB database
are annotated by many other databases. The remaining third is poorly annotated, par-
tially because the according structures have only recently been published, and partially
because they are non-protein structures.
The Columba database can be searched by a data source-specific web interface at
www.columba-db.de. Users can thus quickly select PDB entries of proteins that match
the desired criteria. Rules for creating datasets of proteins efficiently have been derived.
These rules were applied to create datasets for analyzing the packing of proteins.
Packinganalysismeasureshowmuchspacethereisbetweenatoms. Thisindicatesregions
where a high local mobility of the structure is required, and errors in the structure. In
a reference dataset, a high number of atom-sized cavities was found in a region near the
protein surface. In a transmembrane protein dataset, these cavities frequently locate in
channels and transporters that undergo conformational changes. A dataset of ligands
and coenzymes bound to proteins was packed as least as tightly as the reference data.
By these results, several contradictions in the literature have been resolved.
Keywords:
protein structure, databases, data integration, data quality, annotation, protein packingZusammenfassung
Im Rahmen dieser Arbeit wird eine Vielzahl von Daten zur Struktur und Funktion von
Proteinen gesammelt. Anschließend wird in strukturellen Daten die atomare Packungs-
dichte untersucht. Untersuchungen an Strukturen benötigen oftmals maßgeschneiderte
Datensätze von Proteinen. Kriterien für die Auswahl einzelner Proteine sind z.B. Eigen-
schaften der Sequenzen, die Faltung oder die Auflösung einer Struktur. Solche Datensätze
mit den im Netz verfügbaren Mitteln herzustellen ist mühselig, da die notwendigen Da-
ten über viele Datenbanken verteilt liegen. Um diese Aufgabe zu vereinfachen, wurde
Columba, eine integrierte Datenbank zur Annotation von Proteinstrukturen, geschaffen.ba integriert insgesamt sechzehn Datenbanken, darunter u.a. die PDB, KEGG,
Swiss-Prot, CATH, SCOP, die Gene Ontology und ENZYME.
Von den in Columba enthaltenen Strukturen der PDB sind zwei Drittel durch viele
andere Datenbanken annotiert. Zum verbliebenen Drittel gibt es nur wenige zusätzliche
Angaben, teils da die entsprechenden Strukturen erst seit kurzem in der PDB sind, teils
da es gar keine richtigen Proteine sind.
Die Datenbank kann über eine Web-Oberfläche unter www.columba-db.de spezifisch
füreinzelneQuelldatenbankendurchsuchtwerden.EinBenutzerkannsichaufdieseWeise
schnell einen Datensatz von Strukturen aus der PDB zusammenstellen, welche den ge-
wählten Anforderungen entsprechen. Es wurden Regeln aufgestellt, mit denen Datensätze
effizient erstellt werden können.
Diese Regeln wurden angewandt, um Datensätze zur Analyse der Packungsdichte von
Proteinen zu erstellen. Die Packungsanalyse quantifiziert den Raum zwischen Atomen,
und kann Regionen finden, in welchen eine hohe lokale Beweglichkeit vorliegt oder welche
Fehler in der Struktur beinhalten. In einem Referenzdatensatz wurde so eine große Zahl
von atomgroßen Höhlungen dicht unterhalb der Proteinoberfläche gefunden. In Trans-
membrandomänen treten diese Höhlungen besonders häufig in Kanal- und Transportpro-
teinen auf, welche Konformationsänderungen vollführen. In proteingebundenen Liganden
und Coenzymen wurde eine zu den Referenzdaten ähnliche Packungsdichte beobachtet.
Mit diesen Ergebnissen konnten mehrere Widersprüche in der Fachliteratur ausgeräumt
werden.
Schlagwörter:
Proteinstruktur, Datenbanken, Datenintegration, Datenqualität, Annotation,
PackungsdichteAcknowledgements
I would like to thank the following persons for their support during my graduation stud-
ies: Ulf Leser for continuous supervision of the Columba project. Robert Preißner for
many helpful advice, Silke Trissl for her engagement in the Columba Web interface and
uncounted bugfixes in Columba, Raphael Bauer for writing and unscrambling all BioPerl-
related stuff, Heiko Müller for advice in creating the data model, Stefan Günther for
writing the CATH module and analyzing DNA-protein interfaces, Elke Michalsky for
mathematical advice and providing data on ligands from the PDB, Philipp Hussels for
implementing the XML interface, Thomas Steinke for maintaining the Columba web-
server, Patrick May and Ina Koch for providing PTGL data, Eike Staub and Antje
Krause for providing SYSTERS data, Bingyu Zhu for inspiring the Columba database
name, Lindy-Lynn Bright for advice on my English, Peter Hildebrand for his patience
and martial arts entertainment, Pico for remaining calm most of the time, Björn Peters
for advice on general aspects of scientific research, the whole AG Proteinstrukturtheorie,
the Open-Source community for most of the software i used, and all the love and trust
from Nils Goldmann and my family. Special acknowledgements go to Prof. Frömmel,
who never hesitated in providing me with the most challenging (and thereby interesting)
questions.
This project was supported by the German Ministry of Education and Research
(BMBF), grant no. 0312705B.
ivContents
1 Preface: knowledge, its conservation and mining 1
I Introduction 3
2 Basic processes in life can be explored through protein structures 3
2.1 Proteins are ubiquitous to life 3
2.2 Protein structure data can be used to address important questions 3
3 Data on protein structures is organized in databases 6
3.1 The Protein Data Bank PDB 6
3.2 Fold and family classification databases 8
3.3 Protein sequence databases 8
3.4 Databases describing enzymatic and metabolic function of proteins 9
3.5 Non-redundant subsets of protein databases 9
3.6 Small molecular compound databases 9
3.7 Other useful resources on protein structures 10
4 Data integration gathers biological data by technical means to use it effi-
ciently 11
4.1 Challenges in integration of biological data and possible solutions 11
4.2 Relational databases are a key technology for data integration 14
4.3 Existing integrated databases on protein structures 16
5 Application: Packing analysis of protein structure datasets gives clues
about their quality and function 18
5.1 Packing density of protein atoms 18
5.2 Internal cavities in structures 19
5.3 Sites where packing has functional consequences 20
6 Tasks addressed in this work 22
II Material and Methods 23
7 Annotation on protein structures from 16 databases is integrated in the
Columba data warehouse 23
7.1 Data sources integrated in Columba 24
7.2 Constructing a star shaped data model around PDB entries 25
7.3 Modular annotation workflow filling the database 28
7.4 Analyzing completeness and redundancy of data in Columba 31
7.5 Ways to access data in the Columba database 33
8 Packing of distinct regions in protein structures is quantified by an im-
proved Voronoi procedure 37
8.1 Datasets used for atom packing analysis 37
8.2 Definition of atomic subsets in protein molecules 39
8.3 Calculation of local atomic packing densities 42
8.4 Identification and localization of atom-sized cavities 44
vIII Results and Discussion 46
9 In the Columba database, two thirds of the entries in the Protein Data
Bank are well-annotated 46
9.1 Quantity and quality of the data in Columba 46
9.2 Completeness of secondary annotation on the PDB 50
9.3 Redundancy within the data quantified by Shannon entropy and maxi-
mum redundancy 56
9.4 Discussion of the implementation of the Columba integrated database 58
9.5 The Columba database is made available via a web interface, dump files,
and third-party software 61
9.6 To create a protein structure dataset, seven questions need to be an-
s

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents