1 / 37

Coalescent Module- Faro July 26th-28th 04 www.coalescent .dk

Coalescent Module- Faro July 26th-28th 04 www.coalescent .dk. Monday H: The Basic Coalescent W: Forest Fire W : The Coalescent + History, Geography & Selection H : The Coalescent with Recombination Tuesday H: Recombination cont. W: The Coalescent & Combinatorics

nevan
Download Presentation

Coalescent Module- Faro July 26th-28th 04 www.coalescent .dk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coalescent Module- Faro July 26th-28th 04 www.coalescent.dk Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography & Selection H: The Coalescent with Recombination Tuesday H: Recombination cont. W: The Coalescent & Combinatorics HW: Computer Session H: The Coalescent & Human Evolution Wednesday H: The Coalescent & Statistics HW:  Linkage Disequilibrium Mapping

  2. Zooming in!(from Harding + Sanger) 3*109 bp *5.000 b-globin (chromosome 11) 6*104 bp *20 Exon 3 Exon 1 Exon 2 3*103 bp 5’ flanking 3’ flanking *103 ATTGCCATGTCGATAATTGGACTATTTTTTTTTT 30 bp

  3. Human Migrations From Cavalli-Sforza,2001

  4. Data: b-globin from sampled humans. From Griffiths, 2001 Assume: 1. At most 1 substitution per position. 2.No recombination Reducing nucleotide columns to bi-partitions gives a bijection between data & unrooted gene trees. C G

  5. Simplified model of human sequence evolution. Past 0.2 Rate of common ancestry: 1 Wait to common ancestry: 2Ne Mutation rate: 2.5 Present Africa Non-Africa

  6. From Griffiths, 2001

  7. Models and their benefits. • Models+ Data • probability of data (statistics...) • probability of individual histories • hypothesis testing • parameter estimation

  8. Coalescent Theory in Biology www. coalescent.dk Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of sequenced data CATAGT CGTTAT TGTTGT Parameter Estimation Model Testing

  9. Wright-Fisher Model of Population Reproduction Haploid Model i. Individuals are made by sampling with replacement in the previous generation. ii. The probability that 2 alleles have same ancestor in previous generation is 1/2N Assumptions Constant population size No geography No Selection No recombination Diploid Model Individuals are made by sampling a chromosome from the female and one from the male previous generation with replacement

  10. 10 Alleles’ Ancestry for 15 generations

  11. Waiting for most recent common ancestor - MRCA Distribution until 2 alleles had a common ancestor, X2?: P(X2 > j) = (1-(1/2N))j P(X2 > 1) = (2N-1)/2N = 1-(1/2N) P(X2 = j) = (1-(1/2N))j-1 (1/2N) j j 2 2 1 1 1 1 1 1 2N 2N 2N Mean, E(X2) = 2N. Ex.: 2N = 20.000, Generation time 30 years, E(X2) = 600000 years.

  12. P(k):=P{k alleles had k distinct parents} 1 1 2N Ancestor choices: k -> k k -> any k -> k-1 k -> j (2N)k 2N *(2N-1) *..* (2N-(k-1)) =: (2N)[k] Sk,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind. For k << 2N:

  13. Geometric/Exponential Distributions The Geometric Distribution: {1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p. The Exponential Distribution: R+ Exp (a) Density: f(t) = ae-at, P(X>t)= e-at Properties: X~Exp(a) Y~Exp(b) independent i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1) ii. E(X) = 1/a. iii. P(Z>t)=(≈)P(X>t) small a (p=e-a). iv. P(X < Y) = a/(a + b). v. min(X,Y) ~ Exp (a + b).

  14. Discrete  Continuous Time tc:=td/2Ne 6 6/2Ne 0 2N 0 1 4 1.0 corresponds to 2N generations 1.0 0.0 2 6 5 3

  15. Adding Mutations m mutation pr. nucleotide pr.generation. L: seq. length µ = m*L Mutation pr. allele pr.generation. 2Ne - allele number. Q := 4N*µ -- Mutation intensity in scaled process. Continuous time Continuous sequence Discrete time Discrete sequence 1/L time 1/(2Ne) time sequence sequence mutation mutation coalescence Probability for two genes being identical: P(Coalescence < Mutation) = 1/(1+Q). 1 Q/2 Q/2 Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult.

  16. 1 2 3 4 5 The Standard Coalescent Two independent Processes Continuous: Exponential Waiting Times Discrete: Choosing Pairs to Coalesce. Waiting Coalescing {1,2,3,4,5} (1,2)--(3,(4,5)) {1,2}{3,4,5} 1--2 {1}{2}{3,4,5} 3--(4,5) {1}{2}{3}{4,5} 4--5 {1}{2}{3}{4}{5}

  17. Expected Height and Total Branch Length Branch Lengths Time Epoch 1 2 1 2 1 1/3 3 2/(k-1) k Expected Total height of tree: Hk= 2(1-1/k) i.Infinitely many alleles finds 1 allele in finite time. ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles. Expected Total branch length in tree, Lk: 2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)

  18. Kingman (Stoch.Proc. & Appl. 13.235-248 + 2 other articles,1982) A. Stochastic Processes on Equivalence Relations. D ={(i,i);i= 1,..n} Q ={(i,j);i,j=1,..n} 1 if s < t qs,t = 0otherwise This defines a process, Rt , going from to through equivalence relations on {1,..,n}. B. The Paint Box & exchangable distributions on Partitions. C. All coalescents are restrictions of “The Coalescent” – a process with entrance boundary infinity. D.Robustness of “The Coalescent”: If offspring distribution is exchangeable and Var(n1) --> s2 & E(n1m) < Mm for all m, then genealogies follows ”The Coalescent” in distribution. E. A series of combinatorial results.

  19. Effective Populations Size, Ne. In an idealised Wright-Fisher model: i. loss of variation per generation is 1-1/(2N). ii. Waiting time for random alleles to find a common ancestor is 2N. Factors that influences Ne: i.Variance in offspring. WF: 1. If variance is higher, then effective population size is smaller. ii.Population size variation - example k cycle: N1, N2,..,Nk. k/Ne= 1/N1+..+ 1/Nk. N1 = 10 N2= 1000 => Ne= 50.5 iii.Two sexesNe = 4NfNm/(Nf+Nm)I.e. Nf-10Nm -1000 Ne - 40

  20. 6 Realisations with 25 leaves Observations: Variation great close to root. Trees are unbalanced.

  21. Sampling more sequences The probability that the ancestor of the sample of size n is in a sub-sample of size k is Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.

  22. Three Models of Alleles and Mutations. Finite Site Infinite Allele Infinite Site acgtgctt acgtgcgt acctgcat tcctgcat tcctgcat Q Q Q acgtgctt acgtgcgt acctgcat tcctggct tcctgcat i. Allele is represented by a sequence. ii. A mutation changes nucleotide at chosen position. i. Only identity, non-identity is determinable ii. A mutation creates a new type. i. Allele is represented by a line. ii. A mutation always hits a new position.

  23. Infinite Allele Model 4 5 1 2 3

  24. Infinite Site Model Final Aligned Data Set:

  25. 0 1 1 1 4 2 3 5 4 5 5 5 6 3 7 2 8 1

  26. Number of paths: 0 1 1 1 4 2 2 2 2 2 3 5 4 5 6 2 4 3 4 7 5 7 5 8 14 2 22 28 10 6 3 32 7 50 2 8 1 82

  27. Labelling and unlabelling:positions and sequences 1 2 3 4 5 Ignoring mutation position Ignoring sequence label 1 2 3 5 4 Ignoring mutation position Ignoring sequence label { , , } The forward-backward argument 4 classes of mutation events incompatible with data 9 coalescence events incompatible with data

  28. Infinite Site Model: An example Theta=2.12 2 3 2 3 4 5 5 9 5 10 14 19 33

  29. Impossible Ancestral States

  30. Finite Site Model acgtgctt acgtgcgt acctgcat tcctgcat tcctgcat s s s Final Aligned Data Set:

  31. Simplifying assumptions 1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT 2) Processes in different positions of the molecule are independent. 3) A nucleotide follows a continuous time Markov Chain. 4) Time reversibility: I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i. This implies that a l2+l1 = l1 l2 N1 N2 N2 N1 5) The rate matrix, Q, for the continuous time Markov Chain is the same at all times.

  32. Evolutionary Substitution Process t1 e A t2 C C Pi,j(t) = probability of going from i to j in time t.

  33. Jukes-Cantor 69: Total Symmetry. TO A C G T FROM -3*  -3*  -3*  -3* A C G T • Stationary Distribution: (.25,.25,.25,.25) • B. Expected number of substitutions: 3at t 0 ATTGTGTATATAT….CAG ATTGCGTATCTAT….CCG Chimp Mouse E.coli Higher Cells Fish

  34. History of Coalescent Approach to Data Analysis 1930-40s: Genealogical arguments well known to Wright & Fisher. 1964: Crow & Kimura: Infinite Allele Model 1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So does King & Jukes 1971: Kimura & Otha proposes infinite sites model. 1972: Ewens’ Formula: Probability of data under infinite allele model. 1975: Watterson makes explicit use of “The Coalescent” 1982: Kingman introduces “The Coalescent”. 1983: Hudson introduces “The Coalescent with Recombination” 1983: Kreitman publishes first major population sequences.

  35. History of Coalescent Approach to Data Analysis 1987-95: Griffiths, Ethier & Tavare calculates site data probability under infinite site model. 1994-: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces highly computer intensitive simulation techniquees to estimate parameters in population models. 1996- Krone-Neuhauser introduces selection in Coalescent 1998- Donnelly, Stephens, Fearnhead et al.: Major accelerations in coalescent based data analysis. 2000-: Several groups combines Coalescent Theory & Gene Mapping. 2002: HapMap project is started.

  36. Basic Coalescent Summary i. Genealogical approach to population genetics. ii. ”The Coalescent” - generic probability distribution on allele trees. iii. Combining ”The Coalescent” with Allele/Mutation Models allows the calculation the probability of data.

More Related