Fine scale mapping
Download
1 / 59

FINE SCALE MAPPING - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

FINE SCALE MAPPING. ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003. Outline. Introduction: fine scale mapping using high-density SNP haplotype data. Bayesian framework. Gene trees and the coalescent process. Genetic heterogeneity and shattered gene trees.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'FINE SCALE MAPPING' - tom


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fine scale mapping

FINE SCALE MAPPING

ANDREW MORRIS

Wellcome Trust Centre for Human Genetics

March 7, 2003


Outline
Outline

  • Introduction: fine scale mapping using high-density SNP haplotype data.

  • Bayesian framework.

  • Gene trees and the coalescent process.

  • Genetic heterogeneity and shattered gene trees.

  • Markov chain Monte Carlo (MCMC) algorithm.

  • SNP genotype data.

  • Example: cystic fibrosis.


Introduction
Introduction

  • Candidate region of the order of 1Mb in length.

  • Refine location of putative disease locus within region.

  • Make use of high-density maps of single nucleotide polymorphisms (SNPs).

  • Type sample of affected cases and unaffected controls.


Once upon a time
Once upon a time…

  • Disease predisposition determined by single locus in candidate region.

  • Each case chromosome carries a copy of a disease allele, resulting from a single recent mutation event at disease locus.

  • Each control chromosome carries a copy of the ancient normal allele at the disease locus.


In an ideal world
In an ideal world…

  • Excess sharing of SNP haplotypes in the vicinity of the disease locus, among cases and not among controls.

  • Decreased probability of sharing as distance from disease locus increases.

  • Approximate location of disease locus inferred.


Problems
Problems…

  • Gene tree and ancestral haplotypes are unknown.

  • Marker mutations lead to mismatch of alleles within preserved regions.

  • Multiple disease genes, multiple mutations, and dominance.


Example cystic fibrosis cf
Example: Cystic fibrosis (CF)

  • Fully penetrant recessive disorder, incidence ~1/2500 live births in white populations, less common in other populations.

  • Preliminary linkage analysis suggested 1.8Mb candidate region for a single CF gene on chromosome 7q31.

  • More recently, a 3bp deletion, ΔF508, has been identified in the CFTR gene at ~0.88Mb into the candidate region.

  • Now known that ΔF508 accounts for ~66% of all chromosomal mutations in individuals with CF.

  • Remainder of CF chromosomes carry copies of many other rare mutations in the same gene.

  • 23 RFLPs used to identify haplotypes in 92 control chromosomes and 94 case chromosomes, 62 of which have been confirmed to carry ΔF508.


Challenges
Challenges…

  • The ΔF508 locus does not lie at the centre of the region of high LD.

  • Non-ΔF508 case chromosomes are not expected to share the same founder marker haplotype.

  • Useful test-data set for fine-scale mapping methods…


Challenges1
Challenges…

  • The ΔF508 locus does not lie at the centre of the region of high LD.

  • Non-ΔF508 case chromosomes are not expected to share the same founder marker haplotype.

  • Useful test-data set for fine-scale mapping methods…



Bayesian framework 1
Bayesian framework (1)

  • Assume disease locus exists in candidate region: aim is then to estimate its location.

  • Approximate the posteriordistribution of location.

  • Allows assignment of probabilities that disease locus lies in any particular area of the candidate region.


Bayesian framework 2
Bayesian framework (2)

  • Aim is to approximate the posterior density of location of the disease locus, given SNP haplotypes in cases A and controls U, denoted f(x|A,U).

  • Depends on other model parameters M, including gene tree, population haplotype frequencies, etc…

  • Recover marginal posterior density by integration over these nuisance parameters,

    f(x|A,U) = ∫f(x,M|A,U)dM


Bayesian framework 3
Bayesian framework (3)

  • By Bayes’ Theorem…

    f(x,M|A,U) = C f(A,U|x,M) f(x,M)

  • Normalising constant.

  • Likelihood of haplotype data given model parameters M and location x.

  • Prior density of M and x.


Bayesian framework 31
Bayesian framework (3)

  • By Bayes’ Theorem…

    f(x,M|A,U) = C f(A,U|x,M) f(x,M)

  • Normalising constant.

  • Likelihood of haplotype data given model parameters M and location x.

  • Prior density of M and x.


Bayesian framework 32
Bayesian framework (3)

  • By Bayes’ Theorem…

    f(x,M|A,U) = C f(A,U|x,M) f(x,M)

  • Normalising constant.

  • Likelihood of haplotype data given model parameters M and location x.

  • Prior density of M and x.


Bayesian framework 33
Bayesian framework (3)

  • By Bayes’ Theorem…

    f(x,M|A,U) = C f(A,U|x,M) f(x,M)

  • Normalising constant.

  • Likelihood of haplotype data given model parameters M and location x.

  • Prior density of M and x.


Control chromosomes
Control chromosomes

  • Assumed to carry an ancient normal allele at the disease locus.

  • Effects of recent shared ancestry of less importance, so simple model assumed:

    f(A,U|x,M) = f(A|x,M) f(U|h)

  • The likelihood, f(U|h), depends only on population SNP haplotype frequencies, h.

  • For many SNPs, the number of possible haplotypes is large, so frequencies are parameterised in terms of allele frequencies and first-order LD between pairs of adjacent loci.


Gene trees
Gene trees

  • Representation of the recent shared ancestry of case chromosomes at the disease locus.

  • Star shaped tree: each case chromosome descends independently from founder. Assumes there is too much information in sample about ancestral recombination and mutation events.

  • Bifurcating tree: shared ancestral recombination and mutation events between chromosomes appear only once in their shared ancestry.


Gene trees1
Gene trees

  • Representation of the recent shared ancestry of case chromosomes at the disease locus.

  • Star shaped tree: each case chromosome descends independently from founder. Assumes there is too much information in sample about ancestral recombination and mutation events.

  • Bifurcating tree: shared ancestral recombination and mutation events between chromosomes appear only once in their shared ancestry.


Tree specification
Tree specification

  • Topology T: the branching pattern of the tree.

  • Branch lengths, τ, determined by the waiting times, w, between merging events in the gene tree.

  • Scaled in units of 2N generations, where N is effective population size.

Root

Leaf nodes


Prior probability model
Prior probability model

  • Uniform prior probability model for population haplotype frequencies, the location of disease locus, and the effective population size.

  • Each gene tree topology has equal prior probability.

  • Prior probability model reduces to:

    f(x,M) = C f(w)

  • Need prior probability model for waiting times between merging events.


The coalescent process 1
The coalescent process (1)

  • Time between merging event from k to k-1 lineages.

  • Scaled in units of 2N generations.

  • Exponential distribution with rate k(k-1)/2.


The coalescent process 11
The coalescent process (1)

  • Time between merging event from k to k-1 lineages.

  • Scaled in units of 2N generations.

  • Exponential distribution with rate k(k-1)/2.

Exponential: rate 8x7/2 = 28

Expected time: 0.0357


The coalescent process 12
The coalescent process (1)

  • Time between merging event from k to k-1 lineages.

  • Scaled in units of 2N generations.

  • Exponential distribution with rate k(k-1)/2.

Exponential: rate 7x6/2=21

Expected time: 0.0476


The coalescent process 13
The coalescent process (1)

  • Time between merging event from k to k-1 lineages.

  • Scaled in units of 2N generations.

  • Exponential distribution with rate k(k-1)/2.

Exponential: rate 2x1/2=1

Expected time: 1


The coalescent process 2
The coalescent process (2)

  • Assumes constant effective population size, N.

  • Flexible: can allow for exponential population growth and population sub-structure.

  • Assumes sample is ascertained at random from the population. Problem: case chromosomes ascertained because they carry a copy of the disease mutation.

  • Assumes sample has single common ancestor. Problem: genetic heterogeneity.


The shattered coalescent model
The shattered coalescent model

  • Generalisation of the coalescent process to allow branches of the gene tree to be removed.

  • Introduce indicator variable, zb, for each node, b, taking the value 1 if b has a parent in the gene tree and 0 otherwise.

  • Allows for singleton leaf nodes, corresponding to sporadic case chromosomes, and disconnected sub-trees, corresponding to independent mutation events at the same disease locus.

  • Assume number of branches of gene tree not removed in the shattered coalescent process given by binomial distribution, with shattering parameterρ.


Ancestral haplotypes
Ancestral haplotypes

  • Haplotypes, I, carried by internal nodes of the gene tree are unknown.

  • To calculate posterior probability, need to integrate over distribution of possible ancestral haplotypes, which depends on gene tree and other model parameters.

  • Treated as augmented data in Bayesian framework: enters posterior probability through likelihood…

    f(x|A,U) = ∫ ∫ f(x,M,I|A,U)dMdI

    and…

    f(x,M,I|A,U) = C f(A,U,I|x,M) f(x,M)


Likelihood calculations
Likelihood calculations

  • If node has no parent in shattered gene tree, treat as a random chromosome from the population (sporadic or founder for mutation).

  • If node has parent in genealogy, depends on marker haplotype carried by the parental node, and the occurrence of recombination and mutation events along the connecting branch.


Likelihood calculations1
Likelihood calculations

  • If node has no parent in shattered gene tree, treat as a random chromosome from the population (sporadic or founder for mutation).

  • If node has parent in genealogy, depends on marker haplotype carried by the parental node, and the occurrence of recombination and mutation events along the connecting branch.


Mcmc algorithm 1
MCMC algorithm (1)

  • Need to calculate joint posterior distribution f(x,h,T,w,z,N,ρ,I|A,U).

  • Parameter space extremely complex, so cannot be calculated analytically.

  • Markov chain Monte Carlo (MCMC) algorithm approximates the posterior distribution by sampling from f(x,h,T,w,z,N,ρ,I|A,U).

  • Computationally intensive, but becoming more practical with improvements in computing power.

  • Can handle missing SNP data: treat as augmented data in the same way as ancestral haplotypes.


Mcmc algorithm 2
MCMC algorithm (2)

  • Let S denote current set of model parameters {x,h,T,w,z,N,ρ,I}.

  • Propose “small” change to model parameters, S*.

  • Accept S* in place of S with probability f(S*|A,U)/f(S|A,U).

  • If S* is not accepted, the current parameter S is retained.

  • Initial burn-in to allow convergence of f(S|A,U) from random starting parameter set.

  • Subsequent sampling period, parameter set recorded every rth step of the algorithm: each recorded output represents a random draw from f(S|A,U).


Mcmc algorithm 3
MCMC algorithm (3)

Tree height

Location

ρ

N

101 0.47374 2557.62766 4.24189612 10849.19083 0.78104 -1769.51173 102 0.40629 2112.19993 4.16846454 8804.63049 0.79777 -1788.66623 103 0.46534 1679.71719 4.30423786 7229.90233 0.75364 -1854.19049 104 0.48211 2229.24788 4.33740414 9669.14899 0.78009 -1763.70173 105 0.43808 2402.10599 4.29011844 10305.31919 0.82178 -1760.56671 106 0.44607 2275.33453 4.03331587 9177.14285 0.82601 -1775.90300 107 0.41822 3016.70273 4.39000994 13243.35496 0.77768 -1844.20629 108 0.40934 2534.50113 4.07270615 10322.27832 0.81590 -1861.97411 109 0.41032 3122.91416 4.25386813 13284.46504 0.82479 -1814.27448 110 0.45020 3209.14218 4.34316471 13937.83307 0.78422 -1801.44160

Log posterior

probability


Mcmc algorithm 31
MCMC algorithm (3)

Tree height

Location

ρ

N

101 0.47374 2557.62766 4.24189612 10849.19083 0.78104 -1769.51173 102 0.40629 2112.19993 4.16846454 8804.63049 0.79777 -1788.66623 103 0.46534 1679.71719 4.30423786 7229.90233 0.75364 -1854.19049 104 0.48211 2229.24788 4.33740414 9669.14899 0.78009 -1763.70173 105 0.43808 2402.10599 4.29011844 10305.31919 0.82178 -1760.56671 106 0.44607 2275.33453 4.03331587 9177.14285 0.82601 -1775.90300 107 0.41822 3016.70273 4.39000994 13243.35496 0.77768 -1844.20629 108 0.40934 2534.50113 4.07270615 10322.27832 0.81590 -1861.97411 109 0.41032 3122.91416 4.25386813 13284.46504 0.82479 -1814.27448 110 0.45020 3209.14218 4.34316471 13937.83307 0.78422 -1801.44160

Log posterior

probability


Cystic fibrosis revisited
Cystic fibrosis: revisited

  • Assume a fixed recombination rate of 0.5cM per Mb and a marker mutation rate of 2.5 x 10-5 per locus, per generation.

  • Each run of MCMC algorithm begins with 20,000 step burn-in period: thrown away.

  • Subsequent 200,000 step sampling period, output recorded every 50th step of the algorithm: 4000 outputs.

  • Two analyses of CF data performed: control chromosomes (92) and (i) ΔF508 case chromosomes (62) only; (ii) all case chromosomes (94).



Cystic fibrosis genetic heterogeneity
Cystic fibrosis: genetic heterogeneity

  • Structure of shattered gene tree provides information about genetic heterogeneity at disease locus.

  • For each output of MCMC algorithm, record shattered gene tree.

  • For each pair of chromosomes, record whether they appear in the same sub-tree.

  • Over all outputs, estimate probability that each pair of chromosomes carry the same allele at the disease locus.

  • Cluster chromosomes according to these probabilities: cladogram to represent genetic heterogeneity.


Snp genotype data
SNP genotype data

  • SNP haplotype rarely available.

  • Could infer haplotypes from SNP genotype data: PHASE, SNPHAP, HAPLOTYPER algorithms.

  • Better to treat haplotypes as augmented data in Bayesian framework…

    f(x|G) = ∫ ∫ ∫ ∫ f(x,M,I,A,U|G)dMdIdAdU

    and…

    f(x,M,I,A,U|G) = C f(A,U,I|x,M) f(x,M)


Cystic fibrosis revisited again
Cystic fibrosis: revisited – again!

  • Create genotype data from original CF haplotype data.

  • Pair together case chromosmes at random.

  • Pair together control chromosomes at random.

  • Total sample: 46 controls and 47 cases.



Limitations
Limitations

  • Computationally intensive – limited to sample sizes ~100 cases and controls with up to 20 SNPs.

  • Alternative approach: do not model gene tree explicitly – estimate shattered gene tree using standard clustering methods.


Summary
Summary

  • High density SNP map of the human genome now available.

  • Fine scale mapping of disease loci requires effective modelling of shared ancestry of sample of case and control chromosomes.

  • Methods exist for haplotype and genotype data: MCMC algorithms are very computationally intensive and are currently limited to relatively small sample sizes.

  • Further development is necessary…