Thèse présentée pour obtenir le titre de

-

Documents
199 pages
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Niveau: Supérieur, Doctorat, Bac+8
Thèse présentée pour obtenir le titre de Docteur de l'Université Louis Pasteur Strasbourg 1 Discipline : Sciences du Vivant Spécialité : Bioinformatique par Ravi Kiran Reddy KALATHUR Approche systématique et intégrative pour le stockage, l'analyse et la visualisation des données d'expression génique acquises par des techniques à haut débit, dans des tissus neuronaux (An integrated systematic approach for storage, analysis and visualization of gene expression data from neuronal tissues acquired through high-throughput techniques) Soutenue publiquement le 15 Janvier 2008 devant le jury : Directeur de thèse Olivier POCH, IGBMC, Illkirch Rapporteur interne Brigitte KIEFFER, IGBMC, Illkirch Rapporteur externe Christian GRIMM, UZH, Zurich Rapporteur externe Pascal BARBRY, IPMC, Valbonne Examinateur Jean-Marie WURTZ, IGBMC, Illkirch Membre invité Thierry LÉVEILLARD, INSERM unit 592, Paris

  • tissus neuronaux

  • has helped

  • visualisation des données d'expression génique

  • database

  • olivier has

  • always being there

  • management system


Sujets

Informations

Publié par
Publié le 01 janvier 2008
Nombre de visites sur la page 42
Langue English
Signaler un problème

Thèse présentée pour obtenir le titre de


Docteur de l’Université Louis Pasteur Strasbourg 1
Discipline : Sciences du Vivant
Spécialité : Bioinformatique

par


Ravi Kiran Reddy KALATHUR

Approche systématique et intégrative pour le stockage, l’analyse et la
visualisation des données d’expression génique acquises par des techniques
à haut débit, dans des tissus neuronaux
(An integrated systematic approach for storage, analysis and visualization of
gene expression data from neuronal tissues acquired through high-throughput
techniques)

Soutenue publiquement le 15 Janvier 2008 devant le jury :


Directeur de thèse Olivier POCH, IGBMC, Illkirch
Rapporteur interne Brigitte KIEFFER, IGBMC, Illkirch
Rapporteur externe Christian GRIMM, UZH, Zurich Pascal BARBRY, IPMC, Valbonne
Examinateur Jean-Marie WURTZ, IGBMC, Illkirch
Membre invité Thierry LÉVEILLARD, INSERM unit 592, Paris Dedicated to my Mother
Acknowledgements

I would like to thank the members of my jury Pascal Barbry, Christian Grimm, Jean-
Marie Wurtz, Brigitte Kieffer and Thierry Léveillard for taking time out of their
schedules to review my thesis.
I would also like to thank the director of the IGBMC – Dino Moras and Jean-Claude
Thierry for their support.
At the outset, I would like to thank my guide and mentor Olivier Poch for
giving me the opportunity to work in his laboratory over the past three years. He has
been a source of inspiration and very patient with me, especially over the past few
months. Working with Olivier has been a positive and enriching experience for me
and will remain with me for a long time to come.
I am immensely grateful to Julie for patiently proofreading our articles and
chapters of my thesis and I am indebted to her for this.
All the members of my laboratory have been a great source of strength and support to
me. A special thanks to Wolfgang Raffelsberger for his constant guidance. Nicolas
Wicker and I worked together towards two of my projects and he has helped me with
my thesis to a great extent.
I would like to thank Nicolas and Guillaume for their extensive help with
respect to the RETINOBASE project. I would also like to thank Raymond Ripp for
helping me with RETINOBASE and guiding me throughout the writing of this
chapter.
I would like to thank Radhouene for helping me to translate my resume in
French and Anne for helping me with thesis formatting and for asking me “ça va!”
every morning.
Laetitia and Naomi were the first end users of the database that I designed
during my thesis; I appreciate their comments and suggestions.
My stay in the lab would not be complete if I did not mention the help from
my other colleagues including Luc, Vero, Annaick, Yann, Laurent-Philippe, Yannick-
Noel, Fred, Laurent, Odile, Ngoc-Hoan, David, Sophie and Emmanuel.
Once again I would like to sincerely thank Olivier Poch and his team, without
their help this thesis would not have been possible.
I would like to thank Thierry Léveillard for his valuable suggestions regarding
RETINOBASE.
Thanks! Serge for maintaining our servers. I would like to thank Latetia for
helping me out with my documents to the university and travel arrangements.
This entire project has been supported by RETNET funded through the European
Union research programme MRTN-CN-2003-504003. I would like to thank all the
members of RETNET for their valuable suggestions and guidance.
On a personal front I would like to thank Raj and Ullas and their families for
creating for me a home away from home and for always being there for me. All my
friends here at the IGBMC who have been a stronghold and a great support to me, I
am glad I have friends like you. Murugan, Beena, Jai, Amit and Harshal thanks for
being around!
Last but not least my heartfelt gratitude goes out to my dear parents, my
brother and sister-in-law for always being there for me. This would not have been
possible without you, and so I dedicate this work to you.
List of abbreviations
Abbreviations (Computer and Statistics)
AIC: Akaike Information Criterion
BIC: Bayesian Information Criterion
CEM: Competitive Expectation Maximisation
CGED: Cancer Gene Expression Database
CSS: Cascading Style Sheets
DBMS: DataBase Management System
EM: Expectation Maximisation
FTP: File Transfer Protocol
HTML: Hypertext Markup Language
LDAP (netscape): Lightweight Directory Access Protocol
MLA: Maximum Likelihood Approximation
ODBMS: Object Database Management System
PHP: Hypertext Preprocessor
RDBMS: Relational Database Management System
SQL: Structured Query Language


Abbreviations (Biology and Bioinformatics)
CEL: Contains information about each probe on the chip is extracted from the image
data by the Affymetrix image analysis software.
dChip: DNA-Chip Analyzer
DDBJ: DNA Data Bank of Japan
DNA: Deoxyribonucleic acid
EMBL: The European Molecular Biology Laboratory
EST: Expressed Sequence Tag
FASABI: Functional And Statistical of Biological Data
GEO: Gene Expression Omnibus
GO: Gene Ontology
GOLD: Genomes OnLine Database
KEGG: Kyoto Encyclopedia of Genes and Genomes
MAS 5.0: Affymetix Microarray Suite 5.0
MIAME: Minimum Information About a Microarray Experiment
OMIM: Online Mendelian Inheritance in Man
PCR: Polymerase Chain Reaction
RD: Retinal Disease
RETNET: European Retinal Research Training Network
RISC: RNA-induced silencing complex
RMA: Robust Multi-array Analysis
RNA: Ribonucleic acid
SAGE: Seiral Analysis of Gene Expression
SIEGE: Smoke Induced Epithelial Gene Expression
SNP: Single Nucleotide polymorphism
SPR/SPRR: Small Proline-Rich Proteins
SR: Serin/Arginine
UTR: UnTranslated Region

List of contents

Chapter 1. Avant-Propos.............................................................................................1
Introduction..................................................4
Chapter 2. Biology and Bioinformatics......................................5
2.1 Central dogma of molecular biology...5
2.2 Timeline of major events in informatics, molecular biology and bioinformatics6
2.3 Genome sequencing...........................................................................................11
2.4 Interdependency between computational and experimental techniques............12
Chapter 3. Biological Databases...............14
3.1 General introduction to databases......................................................................14
3.1.1 Object Database Management System (ODBMS)......14
3.1.2 Relational Database Management System (RDBMS)15
3.1.1.1 Advantages of relational model...........................................................16
3.2 Biological databases................................17
3.2.1 Specific features of biological databases....................18
3.2.2 Database challenges20
3.3 Classification of biological databases................................................................21
3.4 Important databases for molecular biology........................23
3.4.1 Nucleic acid sequence databases.23
3.4.2 Protein sequence databases: the UniProt databases....24
3.4.3 Protein structure database ...........................................................................25
3.4.4 Pathway database: KEGG database............................27
3.4.5 Gene Ontology database.............28
3.4.5.1 DAVID bioinformatic resources ..........................................................30
3.5. Sequence Retrieval Software (SRS).................................31
3.5.1 Entrez -National Center for Biotechnology Information (NCBI)...............32
3.6 UCSC genome browser for gene localization....................33
3.7 Microarray databases .........................................................................................35
3.7.1 Types of Microarray databases...................................35
3.7.1.1 General repositories.............35
3.7.1.1.1 Gene Expression Omnibus (GEO)................35
3.7.1.1.2 ArrayExpress.................................................................................37
3.7.1.2 Microarray database system.................................................................38
3.7.1.2.1 BioArray Software Environment (BASE)....38
3.7.1.2.2 Mediante........................39
Chapter 4. Transcriptome .........................................................41
4.1 Regulation of gene expression...........................................41
4.1.1 Transcriptional control in eukaryotic cells.................42
4.1.2 Posttranscriptional controls.........................................44
4.1.2.1 Capping................................................................45
4.1.2.2 Splicing46
4.1.2.3 RNA transport and localization ...........................................................46
4.1.2.4 Proteins that bind to the 5’ and 3’ untranslated regions of mRNAs
mediate negative translational control.............................46
4.1.2.5 Gene expression can be controlled by a change in mRNA stability....47
4.1.2.6 RNA interference is used by cells to silence gene expression.............48
4.2 Tools for mining the transcriptome....................................................................48
4.2.1 Gene-by-Gene methods...............48
4.2.2 Global methods ...........................................................................................49
4.2.2.1 Expressed Sequence Tag (EST) sequencing........................................49
4.2.2.2 Serial Analysis of Gene Expression (SAGE).......50
4.2.2.3 Microarray based methods...52
4.3 Applications of microarrays...............................................56
Chapter 5. Microarray Data Analysis......................................57
5.1 Experimental design...........................................................57
5.1.1 Replicates....................................57
5.2 Data preprocessing and normalization...............................58
5.2.1 Different algorithms used in preprocessing of oligonucleotide array data .59
5.3 Biological analysis through differential expression...........................................60
5.3.1 Two-conditional setting and independent multi-conditional setting ..........61
5.3.2 Cut-off and multiple testing........................................63
5.4 Data clustering ...................................64
5.4.1 Hierarchical clustering................................................65
5.4.2 Partitioning clustering: K-means clustering66
5.4.3 Mixture Model clustering ...........................................66
5.5 Introduction to meta-analysis.............67
5.5.1 Meta analysis in gene expression studies....................................................67
Chapter 6. Retina .......................................69
6.1 Retinal development..........................70
6.2 Overview of retinal anatomy and physiology....................71
6.3 Common retinal diseases ...................................................................................72
6.3.1 Retinitis pigmentosa (RP)...........73
6.3.2 Glaucoma ....................................................................................................74
6.3.3 Age related macular degeneration (AMD)..................74
6.3.4 Leber's congenital amaurosis (LCA)..........................75
6.3.5 Retinoblastoma (Rb) ...................................................................................75
6.3.6 Diabetic Retinopathy (DR).........76
6.4 Models for the congenital retinal disorders.......................76
6.5 Retinal transcriptome.........................78
Materials and Methods ..............................................................................................81
Chapter 7: Informatic and Bioinformatic resources..............................................82
7.1 Informatics resources.........................82
7.1.1 Calculation and data storage options..........................82
7.2 Data sources: Gene Expression Omnibus database...........83
7.3 Programming languages.....................................................................................83
7.3.1 Normalization software...............84
7.3.2 R and Bioconductor....................84
7.4 DAVID software for gene ontology analysis.....................86
Results and Discussion...............................................................................................87
Chapter 8. A Maximum Likelihood Approximation method for Dirichlet's
parameter estimation (Publication 1).......................................88
8.1 Scientific context88
8.2 Datasets used and identification of clusters .......................................................89
8.2.1 Biological significance of the clusters........................90
8.3 Discussion..........................................................................92
Chapter 9. Multi-Dimensional Fitting for transcriptomic data analysis
(Publication 2)............................................95
9.1 Scientific context................................................................95
9.2 Datasets ..............................................................................................................96
9.2.1 Pre-processing of the datasets.....98
9.3 Analysis of the results obtained after MDF transformation...............................98
9.4 Bio-analysis of the probesets exhibiting 0 to 6 shifts......101
9.4.1 Functional analysis of the 0-6 shifts group ...............................................102
9.4.1.1 Enriched GO terms in neuronal target matrix....103
9.4.1.2 Enriched GO terms in non-neuronal target matrix ............................103
9.4.1.3 Enriched GO terms in common criteria.............104
9.5 Cross-checking results with simple classical method......................................104
9.6 Conclusion .......................................................................104
9.7 Discussion........105
Chapter 10. Architecture, data query system and data visualization aspects of
gene expression database: RETINOBASE (Publication 3) ..................................108
10.1 Scientific context............................................................................................108
10.2 Overview of a microarray experiment...........................109
10.3 RETINOBASE architecture...........................................................................110
10.4 Experiments available in RETINOBASE......................118
10.5 Data analysis of the experiments...118
10.6 Querying the RETINOBASE.........121
10.6.1 Gene Information....................................................................................121
10.6.2 Experiment information..........124
10.6.3 Signal intensity system analysis..............................124
10.6.4 Fold change System Analysis .................................................................125
10.7 Signal Intensity visualization system.............................125
10.8 Downloading results and User manual..........................131
10.9 Future developments......................................................................................131
10.10 Conclusions..................................132
10.11 Example applications of RETINOBASE.....................133
10.11.1 RetChip .................................................................133
Conclusions and perspectives..................136
References.................................................................................................................140
Annexe.......................152
List of Figures
Figure 1. The Central Dogma of Molecular Biology.....................................................6
Figure 2. Illustration of the rapid growth in the number of sequenced genomes. .......12
Figure 3. The interplay and co-dependency of experimental and computational . .....13
Figure 4. Relational database model for Gene Subsystem catalogue. .........................16
Figure 5. Classical and systems biology roles of life-science databases. ....................18
Figure 6. Yearly growth of protein structures in PDB. ................................................26
Figure 7. Representation of retinol metabolism in animals in KEGG PATHWAY
database................................................................................28
Figure 8. Example of gene annotation with GO. .........................................................30
Figure 9. Screenshot of SRS web server at IGBMC....................32
Figure 10. Entrez nodes ...............................................................33
Figure 11. UCSC genome browser interface...............................34
Figure 12. Screenshot of a typical DataSet record.......................................................36
Figure 13. Architecture of ArrayExpress database......................38
Figure 14: Simplified schematic overview of software structure of BASE.................39
Figure 15. Steps at which eukaryotic gene expression can be controlled....................42
Figure 16. Different hierarchies in DNA packaging.................................43
Figure 17. Gene control region of a typical eukaryotic gene.......44
Figure 18. Possible post-transcriptional controls on gene expression. ........................45
Figure 19. Serial analysis of gene expression (SAGE) library construction. ..............51
Figure 20. An example of a cDNA Microarray Experiment. ......................................53
Figure 21. The Use of Oligonucleotide Arrays............................................................54
Figure 22. Illustrates the relationship between perfect and mismatch probe sequences.
..............................................................................................55
Figure 23. Origin of different types of replication.......................58
Figure 24. Methods for quantification of differential gene expression in replicated
experiments. .........................................................................................................61
Figure 25. A drawing of a section through the human eye with a schematic
enlargement of the retina......................69
Figure 26. Development of the eye from the neural tube through the optic vesicles and
the inverted optic cup forming the retina.............................................................70
Figure 27. Schema of the layers of the retina..............................72
Figure 28. Graphical representation of number retinal disease genes that are mapped
(blue) and identified (red). ...................................................................................73
Figure 29. Three clusters found using the MLA method and not found with the
moments method..................................................................................................91
Figure 30. Additional cluster obtained by only MLA method.....92
Figure 31. Illustration of the general principle of the MDF in the context of
transcriptomics target and reference matrices......................................................96
Figure 32. Number of probesets that have at least one shift to zero after MDF..........99
Figure 33. The cumulative curves of the number of probesets..................................100
Figure 34. After MDF, the number of probesets per number of shifts is plotted as a
histogram for the neuronal and the non-neuronal target matrices. ....................101
Figure 35. Venn diagram representing the number of probesets present in 0-6 shifts
group in both the target matrices and common probesets..................................102
Figure 36. Multi-dimensional fitting (MDF) probesets characterization using GO
analysis...............................................................................103
Figure 37. RETINOBASE home page.......109