Characterizing association parameters in genetic family-based association studies [Elektronische Ressource] / by Stefan Böhringer
193 pages
English

Characterizing association parameters in genetic family-based association studies [Elektronische Ressource] / by Stefan Böhringer

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres
193 pages
English
Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

Informations

Publié par
Publié le 01 janvier 2009
Nombre de lectures 15
Langue English
Poids de l'ouvrage 2 Mo

Extrait

Characterizing Association Parameters in Genetic
Family-based Association Studies
Dissertation by
Stefan B¨ohringer
Institut fur¨ Humangenetik, Universit¨at Duisburg-Essen,
Hufelandstr. 55, 45122 Essen, Germany
correspondence@s-boehringer.org
Submitted to the Department of Statistics
of the University of Dortmund
in Fulfillment of the Requirements for the Degree of Doktor der Naturwissenschaften
February, 2009
12
”Ich bin zwar nur ein Droschkengaul, -
doch philosophisch regsam;
der Fress-Sack h¨angt mir kaum ums Maul,
so werd ich ub¨ erlegsam.
Ich schwenk ihn her, ich schwenk ihn hin,
und bei dem trauten Schwenken
geht mir so manches durch den Sinn,
woran nur Weise denken.
Ich bin zwar nur ein Droschkengaul, -
doch sann ich oft voll Sorgen,
wie ich den Hafer br¨acht’ ins Maul,
der tief im Grund verborgen.
Ich schwenkte hoch, ich schwenkte tief,
bis mir die Ohren klangen.
Was dort in Nacht verschleiert schlief,
ich konnt’ es nicht erlangen.
Ich bin zwar nur ein Droschkengaul, -
doch mag ich Trost nicht missen
und sage mir: So steht es faul
mit allem Erdenwissen;
es frisst im Weisheitsfuttersack
wohl jeglich Maul ein Weilchen,
doch nie erreicht’s - oh Schabernack -
die letzten Bodenteilchen.”
– Christian Morgenstern
– Dedicated to my nuclear family3
Contents
1. Introduction 5
2. Principles of Genetic Association Studies 9
2.1. Genotypes 9
2.2. Shaping of genotype distributions 9
2.3. Phenotypes 15
2.4. Further aspects 16
3. Methods in statistical genetics 18
3.1. Linkage analysis 18
3.2. Association studies 19
3.3. The family based association test 23
3.4. Segregation analysis 24
3.5. Notes on predictive sampling strategies 28
3.6. Genotype-based methods 29
3.7. Practical aspects in association mapping 31
3.8. Wrapup 31
4. Notation and assumptions 32
4.1. Data 32
4.2. Assumptions used in the likelihood framework in this thesis 35
5. Likelihood framework 38
5.1. The Penetrance Model 38
5.2. Likelihood for a candidate region 39
5.3. Extensions 44
5.4. Parameter estimation 47
5.5. Statistical testing 48
6. Properties of the likelihood 49
6.1. Identifiability of conditionally decomposable likelihoods 49
6.2. Ideny of likelihood L 50
6.3. Checking identifiability conditions for a concrete example 58
6.4. Consistency of MLEs 61
6.5. Identifiability of the random effects models 63
7. Bayesian approach 65
7.1. Data augmentation 65
7.2. Collapsing 66
7.3. Sampling strategy 67
7.4. Prior distributions and deterministic relationships 67
7.5. Densities for sampling distributions 68
?7.6. Updating H 69
7.7. Up of β 70
8. Simulation study 71
8.1. Parameter estimation and comparison of family structures 71
8.2. P under the null hypothesis 72
8.3. Ascertainment 72
8.4. Family structure 72
8.5. Haplotype analysis 75
8.6. Asymptotic normality of parameter estimates 754
8.7. Power comparison with the FBAT statistic 80
8.8. Random effects model 82
8.9. MCMC simulations 83
8.10. Computational issues 84
9. Alzheimer’s disease 87
9.1. The ApoE locus 87
9.2. The ApoE/Alzheimer’s data set 88
10. Discussion 96
10.1. Robustness 96
10.2. Guidance of experimental design 96
10.3. Ascertainment 97
10.4. Haplotype effects 97
10.5. Limitations 97
10.6. Biological relevance 98
10.7. Genome wide association scans 99
10.8. Future work 101
11. Acknowledgments 102
Appendix A. Abbreviations and Glossary 103
Appendix B. Haplotype reconstruction 105
B.1. C code for construction ofH(G) in nuclear families 105
Appendix C. Appendix simulations 111
C.1. Implementation of the likelihood 111
C.2. Estimation of Fisher-Information 117
C.3. Data simulations for a single observed locus 122
C.4. Simulations for the two locus case 168
C.5. MCMC convergence plots 181
Appendix D. Alzheimer’s data set 185
References 1895
1. Introduction
Scope of this Thesis
Human genetics tries to elucidate how genetic variation explains variation in observable
human traits. Recent developments have led to a change of paradigm in the analysis of
complextraits,i.e.traitsthatdonotfollowMendelianinheritanceandoccurcommonlyin
the population (e.g. hypertension, diabetes, obesity, dementia). First, the whole genome
is investigated to identify genetic regions that may be associated with disease outcome
(genome wide step). In followup studies these regions are characterized more finely and
often genetic models for a given region based on the given disease are built (fine mapping
step) [41, 14, 65]. This thesis aims to improve the fine mapping of a genetic region based
on family data.
Background
The sequencing of the human genome [53, 106] paved the way for genome wide analyses
in complex disorders by allowing to characterize common genetic variation. Further de-
velopments were driven by the observation that many diseases are defined as the tail of
normalphenotypedistributions(e.g.bloodpressure-hypertension,bodyweight-obesity,
IQ - dementia, etc.). This supported the idea that the same common genetic variation
influencing normal traits should also be causal in common disease [65] (common gene -
common disease hypothesis).
Recent technological and conceptual developments make it now possible to analyze the
whole genome in individuals with respect to common genetic variation [60, 75, 41, 14]
(genotyping). In order to optimize the amount of genotyping, representative, so called
tagging markers (or tagging SNPs, see below; see e.g. [9, 14]; tagging stage) are chosen to
each represent a small genetic region. The first question is: Do the data support a genetic
contribution to the disease? In typical studies, ca. 500.000 tagging SNPs are investigated
and analyzed one by one. This imposes a multiple testing problem, which requires low p-
−7values (p≈10 ; genome wide significance) for individual tests to be deemed significant.
Generalmethods likeBonferroni-Holm correctionor thefalse discovery rateare employed6
as well as methods exploiting information about the genetic setting such as correlations
between tagging SNPs (see e.g. [89, 44]).
Another goal is to understand the underlying biology that causes disease. Therefore, the
genome wide analysis is usually followed by a fine mapping step that investigates regions
identified in the first step more closely. Fine mapping might include adding markers and
replication of findings in an independent sample, sometimes based on a different study
design (family based vs. case control). Often, the statistical focus is still on testing rather
thanonestimationorprediction.Forexample,thetransmissiondisequilibriumtest(TDT)
and its extensions test for genetic association [91] (a recent review [104] lists close to 200
extensions). As these tests are robust against population stratification (see below), they
sacrificeinformationthatmayotherwiseimprovegeneticinference.Thisthesisintroduces
ageneticmodelthatallowsdirectbiologicinterpretation.Thefrequencyofacausalgenetic
variant at a marker (disease allele) that might be unobserved is estimated as well as its
penetrance on disease, modeled by logistic regression. These parameter estimates can
guide follow-up experiments in a direct manner. A full likelihood framework is presented,
which separates this work from earlier related methods that also use latent genotypes
(see [108, 107, 2]). The model can be applied to random samples of families as well as to
families sampled on the basis of multiple affected members.
Approach
In this thesis, a latent class model for family based association studies is proposed. The
model parameters are the joint distribution of observed markers and an unobserved true
disease locus in a genomic region and a penetrance parameter measuring the impact of
the putative disease allele on disease risk. An extension accounts for markers that are
not linked (i.e. marker observations are independent) to the current region by modeling
them via a random effect. First, a full likelihood setting of the model is presented and
asymptotic properties are studied. Additionally, a Bayesian framework for the model is
presented. In the Bayesian setting, a-priori knowledge that a given region is associated7
with a disease outcome can be incorporated. Model properties are assessed in simulations
and the model is applied to an Alzheimer’s data set.
Results
For the likelihood framework, identifiability is shown as well as consistency of parameter
estimates. Results of the simulations show that parameters of interest can be precisely es-
timated with practically relevant sample sizes. Comparisons of different family structures
showthatthemodelisrobusttovariationsinfamilystructure,aresultthatisrelevantfor
study design. To investigate how much power is sacrificed in a robust testing framework,
acomparisonstudywasconducted,showingbigpoweradvantagesforthelatentgenotype
model.Thissuggests,thatthemodelconsideredinthisthesiscanalsocontributetowards
gene identification in terms of power after the validity of assumptions has been checked.
An application to an Alzheimer’s dementia data set applies the methods to a real world
problem. Results from this data set agree with prior findings. This is reassuring as the
Alzheimer’s genetics exceeds t

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents