Automatic validation of glycan sequences in distributed databases [Elektronische Ressource] / presented by Hiren Joshi

ruprecht-karls-universitat_heidelberg - Hiren Joshi

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

133 pages

English

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Biologie

Informations

Publié par	ruprecht-karls-universitat_heidelberg
Publié le	01 janvier 2008
Nombre de lectures	28
Langue	English
Poids de l'ouvrage	3 Mo

Extrait

Dissertation
submittedtothe
CombinedFacultiesfortheNaturalSciencesandforMathematics
oftheRuperto-CarolaUniversityofHeidelberg,Germany
forthedegreeof
DoctorofNaturalSciences
presentedby
MastersinBiomedicalEngineering,HirenJoshi
borninLondon,UnitedKingdom
Oral-examination: TBAAutomaticvalidationofglycansequences
indistributeddatabases
Referees:
Prof. Dr. Sabine. Strahl
Prof. Dr. Roland. Eils4Automatic validation of glycan sequences in
distributed databases

Hiren Joshi

First supervisor: Prof. Sabine Strahl

The study of glycosylation is an emerging field that aims to understand the structure, synthesis and
function of the commonly found molecules known as glycans. This integrative area of study covers
many sub-domains such as chemistry, biology and informatics.

Structurally, glycans are aggregate molecules, composed of a set of monosaccharides linked together
through glycosidic linkages to form oligo or polysaccharides. Glycans are commonly found conjugated
to other molecules such as proteins and lipids, or as free molecules in their own right.

The biosynthesis of glycans is not template-based — a simple reading of the genome or the proteome
will not yield any information as to the total set of glycans in a system (also known as the glycome). In
fact, the biosynthesis of glycans follows a complex pathway that depends on many factors, resulting in
unique glycosylation profiles for sub-systems and tissues.

In order to overcome the inability to predict the complete glycome, efforts have been made to develop
glycan databases that collect the observed glycans for a diverse set of systems. These databases are
critically important to the development of the field, as they represent the knowledge base; the ability to
query and search this information is an integral part of the toolset that researchers use to make further
discoveries.

The open model for database curation is attractive for the low-maintenance, decentralised nature of
data collection. However, the use of open curation has implications for the quality of the data, which
must be maintained through the implementation of validation procedures. This thesis is an investigation
into the use of automatic validation techniques within openly curated databases.

One possible application for this technology is within the EUROCarbDB project — a Europe wide
effort to build an openly curated database that allows for easy deposition of glycan structures as well as
the associated primary data used to establish each structure.

The main aims of this thesis are to establish a method for automatically validating structures deposited
into a glycomic database in order to maintain its quality; to ascertain whether it is possible to use these
validation methodologies to obtain an estimate of the size of the human glycome; and to promote the
distributed nature of the EUROCarbDB project, so as to provide a testing ground for the validation
techniques.

The main achievements in this thesis are many. First and foremost, I developed an algorithm for
validating structures against pathway data. Also, I investigated the ability of the pathway and
enzymatic data to explain the synthesis of human structures. Anomalous enzymatic data was identified
for further examination, and the size of the human glycome was successfully estimated. Finally, a
networking layer was established within the EUROCarbDB, necessitating the future use of the
validation algorithms. Automatische Validierung von Glykanstrukturen in
verteilten Datenbanken
(Automatic validation of glycan sequences in distributed
databases)
Hiren Joshi

Betreuer: Prof. Sabine Strahl

Der Forschungsbereich Glykobiologie integriert wissenschaftliche Erkenntnisse aus der Chemie,
Biologie und der Informatik mit dem Ziel, die Struktur, Synthese und die Funktionen von Glykanen zu
untersuchen.

Glykane bestehen aus Monosacchariden, welche mittels glykosidischen Verbindungen zu Oligo- oder
Polysacchariden verkettet werden. Man findet sie meistens gebunden an Proteine oder Lipide oder als
freie Moleküle. Die Struktur und Synthese der Gesamtheit aller Glykane (das Glykom) ist nicht direkt
im Genom oder Proteom kodiert, sondern erfolgt durch eine Enzymkaskade und wird durch Faktoren
wie Expressionsprofil der Enzyme und Verfügbarkeit von Monosaccharid-Substraten beeinflusst.

Das Resultat ist ein heterogenes Glykosylierungsprofil in den verschiedenen Gewebetypen und
anatomischen Strukturen.

Datenbanken dienen der Archivierung von funktionellen und strukturellen Informationen über Glykane
und ermöglichen deren Analyse mittels rechnerischen Ansätzen. Diese Datenbanken, besonders die
Qualität der darin enthaltenen Daten, sind von zentraler Bedeutung für die Forschungsgemeinschaft.

Für die Sammlung von dezentral produzierten Forschungsergebnissen in Datenbanken ist ein offenes
Kurationsmodell attraktiv, da die Erzeuger der Daten diese selbständig in die Datenbasis einfügen. Um
eine hohe Qualität und Konformität der Daten zu gewährleisten ist die Entwicklung automatisierter
Methoden erforderlich.

Die Zielsetzung dieser Arbeit ist die Entwicklung von Methoden zur Qualitätskontrolle in Datenbanken
mittels automatischen Ansätzen zur Validierung von Glykanstrukturen. Eine Anwendung der
entwickelten Verfahren zur Herleitung der Größe des menschlichen Glykoms und Integration der
entwickelten Technologien findet sich im Rahmen des EUROCarbDB Projektes. Dieses Projekt befasst
sich mit der Etablierung einer offen kuratierten Datenbank zur Archivierung von Glykanstrukturen und
assoziierten strukturaufklärenden Messdaten.

Die Ergebnisse umfassen die Entwicklung eines Algorithmus zur Validierung von Glykanstrukturen
mittels Informationen über deren Biosynthese. Nicht verifizierbare Enzymdaten und Glykanstrukturen
wurden identifiziert und eine Größenbestimmung des humanen Glykoms durchgeführt. Des Weiteren
wurde EUROCarbDB mit Netzwerk-Funktionalität ausgestattet, welche die in dieser Arbeit
entwickelten Methoden anwendet.
!
Contents
Contents i
ListofFigures v
ListofTables vii
Abbreviations ix
Symbols xiii
Nomenclature xv
Preface xvii
1 Glycobiologyandglycobioinformatics 1
1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Disordersanddisease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Glycobioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Theneedforglycandatabases . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Centralisedcuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Opencuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Biologyofstructuralveriﬁcation 11
2.1 Biosynthesisofglycans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Bioinformaticanalysesofpathwayinformation . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Biasinexistingstructuraldatabases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Humansystemasamodelsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Methods 19
3.1 Enzymedata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Datacollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Dataveriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
i3.2 Pathwaydata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Datacollection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Dataveriﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Mathematicaloperatorsforglycans . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Glycosidasemodelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Validationofsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Softwaremodelsandimplementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Modelfordatastorage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Softwaremodelforanalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3eimplementationforanalysis . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Summary . . . . . .