Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Advanced Algorithms and Models for Computational Biology-- a machine learning approach Population Genetics: More on halpltypes --- coalescence, multi-population phasing, blocks, etc. Eric Xing Lecture 19, March 27, 2006

Clustering • How to label them ? • inference • How many clusters ??? • model selection ? • or inference ?

Genetic Demography • Are there genetic prototypes among them ? • What are they ? • How many ? (how many ancestors do we have ?)

Multi-population Genetic Demography • Inference done separately, or jointly?

Clustering as Mixture Modeling

Model Selection vs. Posterior Inference • Model selection • "intelligent" guess:??? • cross validation:data-hungry  • information theoretic: • AIC • TIC • MDL : • Posterior inference: we want to handle uncertainty of model complexity explicitly • we favor a distribution that does not constrain M in a "closed" space! Parsimony, Ockam's Razor

T Cp G A C sequencing Cm T ATGC A Heterozygous diploid individual TC TG AA T Genotype g pairs of alleles, whose associations to chromosomes are unknown G A C T A ??? T T A C G A haplotype hº(h1, h2) possible associations of alleles to chromosomes Haplotype Ambiguity This is a mixture modeling problem!

A Finite (Mixture of ) Allele Model • The probability of a genotype g: • Standard settings: • H| = K << 2Jfixed-sized population haplotype pool • p(h1,h2)= p(h1)p(h2)=f1f2Hardy-Weinberg equilibrium • Problem: K ? H ? Hn1 Hn2 Gn Genotyping model Population haplotype pool Haplotype model

Present 22 individuals • The coalescent process Time

The coalescent process Present 22 individuals 18 ancestors Time

Present 22 individuals 18 ancestors 16 ancestors • The coalescent process Time

Present 22 individuals 18 ancestors 16 ancestors 14 ancestors • The coalescent process Time

Present 22 individuals 18 ancestors 16 ancestors 14 ancestors 12 ancestors 9 ancestors 8 ancestors 8 ancestors 7 ancestors 7 ancestors 5 ancestors 5 ancestors 3 ancestors 3 ancestors 3 ancestors 2 ancestors 2 ancestors 1 ancestor Time

Present Time Most recent common ancestor (MRCA)

Mutational events can now be added to the genealogical tree, resulting in polymorphic sites. If these sites are typed in the modern sample, they can be used to split the sample into sub-clades (represented by different colours)

Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC * ** * * Time Most recent common ancestor (MRCA)

Present TCGAGGTATTAAC TCTAGGTATTAAC TCGAGGCATTAAC TCTAGGTGTTAAC TCGAGGTATTAGC TCTAGGTATCAAC * ** * * • Assuming there are presently k active lineages: • The probability of coalescence: • The probability of mutation (killing a lineage): Time

¥ N Population Genetic Basis of an Infinite Allele model • Kingman coalescent process with binary lineage merging • New population haplotype alleles emerge along all branches of the coalescence tree at rate a/2 per unit length  The Ewens Sampling Formula: an exchangeable random partition of individuals  Dirichlet process mixture Natural genealogy Coalescent with mutation Infinite mixtures Þ Þ

t G0 G ¥ A Hierarchical Bayesian Infinite Allele model • Assume anindividual haplotypeh is stochastically derived from apopulation haplotypeakwithnucleotide-substitution frequencyqk: Ak qk • Not knowing the correspondences between individual and population haplotypes, each individual haplotype is a mixture of population haplotypes. Hn1 Hn2 Gn • The number and identity of the population haplotypes are unknown • use aDirichlet Process to construct a priordistributionGon H´×RJ.

Dirichlet Process Priors • A c.d.f., G onQfollows a Dirichlet Process if for any measurable finite partition of (B1,B2, .., Bm), of Q, the joint distribution of the random variables (G(B1), G(B2), …, G(Bm)) ~ Dirichlet(aG0(B1), …., aG0(Bm)), where, G0 is a the base distribution and a is the precision parameter

0 0.4 0.4 0.6 0.5 0.3 Location 0.3 0.8 0.24 Mass G0 Stick-breaking Process

.... Chinese Restaurant Process CRP defines an exchangeable distribution on partitions over an (infinite) sequence of integers

1 8 9 4 {A,q} {A,q} {A,q} {A,q} {A,q} {A,q} 2 … 7 6 3 5 The DP Mixture of Ancestral Haplotypes • The customers around a table form a cluster • associate a mixture component (i.e., a population haplotype) with a table • sample{a,q}at each table from a base measureG0to obtain the population haplotype and nucleotide substitution frequency for that component • With p(h|{A,q})and p(g|h1,h2),the CRP yields a posterior distribution on the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies)

a G0 G K A q Hn1 Hn2 Gn N DP-haplotyper • Inference: Markov Chain Monte Carlo (MCMC) • Gibbs sampling • Metropolis Hasting DP infinite mixture components (for population haplotypes) Likelihood model (for individual haplotypes and genotypes)

Model components • Choice of base measure: • Nucleotide-substitution model: • Noisy genotyping model:

Gibbs sampling • Starting from some initial haplotype reconstruction H(0) , pick a first table with an arbitrary a1(0), and form initial population-hap pool A(0) ={a1 (0)}: • Choose an individual i and one of his/her two haplytopes t, uniformly and at random, from all ambiguous individuals; • Sample from , update ; • Sample , where , from ; • update A(t+1); • Sample from , update H(t+1).

Convergence of Ancestral Inference

Haplotyping Error The Gabriel data

Multi-population Genetic Gemography • Pool everything together and solve 1 hap problem? • --- ignore population structures • Solve 4 hap problems separately? • --- data fragmentation • Co-clustering … solve 4 coupled hap problems jointly

Population bottleneck: a small population that interbred reduced the genetic variation Out of Africa ~ 100,000 years ago Why humans are so similar and polymorphic patterns are regional Out of Africa

Population Specific DPs • Each population can be associated with a unique DP capturing population-specific genetic demography • Different population may have unique haplotypes • Different population may share common haplotypes • Thus Population specific DPs are marginally dependent G2 G3 G1 G4

.... Hierarchical DP Mixture g H 2 3 G2 G3 G1 G4 4 1

Simulated Data • As 1 HapTyping problem • As 5 HapTyping problems

HapMap Data

The block structure of haplotypes • The Daly et al (2001) data set • This consists of 103 common SNPs (>5% minor allele frequency) in a 500 kb region implicated in Crohn disease, genotyped in 129 trios (mom, pop, kid) from a European derived population, giving 258 transmitted and 258 untransmitted chromosomes. • The haplotype blocks span up to 100kb and contain 5 or more common SNPs. For example, one 84 kb block of 8 SNPs shows just two distinct haplotypes accounting for 95% of the observed chromosomes.

The haplotype patterns for 20 independent globally diverse chromosomes defined by 147 common human chr 21 SNPs spanning 106 kb of genomic sequence. Each row represents an SNP. Blue box = major, yellow = minor allele. Each column represents a single chromosome. The 147 SNPs are divided into 18 blocks defined by black lines. The expanded box on the right is an SNP block of 26 SNPs over 19kb of genomic DNA. The 4 most common of 7 different haplotypes include 80% of the chromosomes, and can be distinguished with 2 SNPs. Another study: the Patil et al data Figure 2 of Patil et al

Haplotype Blocks

Markov Dependence Between Blocks • On any chromosome, the haplotype carried at the kth block is drawn from A(k) according to unknown probabilities that depend on the haplotype at block k − 1.

Block Partitioning Algorithms • Greedy algorithm (Patil et al 2001). • Begin by considering all possible blocks of ≥1 consecutive SNPs. • Next, exclude all blocks in which < 80% of the chromosomes in the data are defined by haplotypes represented more than once in the block (80% coverage). • Considering the remaining overlapping blocks simultaneously, select the one which maximizes the ratio of total SNPsin the block to the number required to uniquely discriminate haplotypes represented more than once in the block. Any of the remaining blocks that physically overlap with the selected block are discarded, and the process repeated until we have selected a set of contiguous, non-overlapping blocks that cover the 32.4 Mb of chr 21 2ith no gaps and with every SNP assigned to a block. • Hidden Markov Models • Maximum a posterior inference (i.e., viterbi) (Daly et al. 2001) • Minimum description length (Anderson et al.2003) • Dynamic programming (Sun et al. 2002)

.... Inheritance of unknown generation Haplotypes are Shaped by Recombination

Hidden Markov DP for Recombination

Reference • E.P. Xing, R. Sharan and M.I Jordan, Bayesian Haplotype Inference via the Dirichlet Process. Proceedings of the 21st International Conference on Machine Learning (ICML2004), • N Patil et al . Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21 Science 294 2001:1719-1723. • M J Daly et al . High-resolution haplotype structure in the human genome Nat. Genet. 29 2001: 229-232 • Anderson, E.C., Novembre, J. (2003) "Finding haplotype block boundaries using the minimum description length principle." American Journal of Human Genetics 73(2):336-354.

Advanced Algorithms and Models for Computational Biology -- a machine learning approach