- 63 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Equilibria in populations' - gloria

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Linkage Disequilibrium

Equilibria in populations

Vineet Bafna

Population data

- Recall that we often study a population in the form of a SNP matrix
- Rows correspond to individuals (or individual chromosomes), columns correspond to SNPs
- The matrix is binary (why?)
- The underlying genealogy is hidden. If the span is large, the genealogy is not a tree any more. Why?

0100

0011

0011

0011

1000

1000

1000

Basic Principles

- In a ‘stable’ population, the distribution of alleles obeys certain laws
- Not really, and the deviations are interesting
- HW Equilibrium
- (due to mixing in a population)
- Linkage (dis)-equilibrium
- Due to recombination

Vineet Bafna

Hardy Weinberg equilibrium

- Assume a population of diploid individuals;
- Consider a locus with 2 alleles, A, a
- Define p(respectively, q): the frequency of A (resp. a) in the population
- 3 Genotypes: AA, Aa, aa
- Q: What is the frequency of each genotypes

- If various assumptions are satisfied, (such as random mating, no natural selection), Then
- PAA=p2
- PAa=2pq
- Paa=q2

Vineet Bafna

Hardy Weinberg: why?

- Assumptions:
- Diploid
- Sexual reproduction
- Random mating
- Bi-allelic sites
- Large population size, …
- Why? Each individual randomly picks his two parental chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.

0100

0011

0011

0011

1000

1000

1000

0100

1000

Vineet Bafna

Hardy Weinberg: Implications

- The allele frequency does not change from generation to generation. True or false?
- It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation?
- Males are 100 times more likely to have the ‘red’ type of color blindness than females. Why?
- Individuals homozygous for S have the sickle-cell disease. In an experiment, the ratios A/A:A/S: S/S were 9365:2993:29. Is HWE violated? Is there a reason for this violation?
- A group of individuals was chosen in NYC. Would you be surprised if HWE was violated?
- Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.

Vineet Bafna

A modern example of HW application

- SNP-chips can give us the allelic value at each polymorphic site based on hybridization.
- Typically, the sites sampled have common SNPs
- Minor allele frequency > 10%
- We simply plot the 3 genotypes at each locus on 3 separate horizontal lines.
- What is peculiar in the picture?
- What is your conclusion?

Vineet Bafna

Linkage Equilibrium

- HW equilibrium is the equilibrium in allele frequencies at a single locus.
- Linkage equilibrium refers to the equilibrium between allele occurrences at two loci.
- LE is due to recombination events
- First, we will consider the case where no recombination occurs.

HWE

What if there were no recombinations?

- Life would be simpler
- Each individual sequence would have a single parent (even for higher ploidy)
- True for mtDNA (maternal) , and y-chromosome (paternal)
- The genealogical relationship is expressed as a tree. This principle can be used to track ancestry of an individual

Vineet Bafna

Phylogeny

- Recall that we often study a population in the form of a SNP matrix
- Rows correspond to individuals (or individual chromosomes), columns correspond to SNPs
- The matrix is binary (why?)
- The underlying genealogy is hidden. If the span is large, the genealogy is not a tree any more. Why?

0100

0011

0011

0011

1000

1000

1000

Reconstructing Perfect Phylogeny

- Input: SNP matrix M (n rows, m columns/sites/mutations)
- Output: a tree with the following properties
- Rows correspond to leaf nodes
- We add mutations to edges
- each edge labeled with i splits the individuals into two subsets.
- Individuals with a 1 in column i
- Individuals with a 0 in column i

i

1 in position i

0 in position i

Vineet Bafna

Reconstructing perfect phylogeny

12345678

01000000

00110110

00110100

00110000

10000000

10001000

10001001

- Each mutation can be labeled by the column number
- Goal is to reconstruct the phylogeny
- genographic atlas

2

7

6

3

4

5

1

8

An algorithm for constructing a perfect phylogeny

- We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.
- In any tree, each node (except the root) has a single parent.
- It is sufficient to construct a parent for every node.
- In each step, we add a column and refine some of the nodes containing multiple children.
- Stop if all columns have been considered.

Vineet Bafna

Columns

- Define i0: taxa (individuals) with a 0 at the i’th column
- Define i1: taxa (individuals) with a 1 at the i’th column

Vineet Bafna

Inclusion Property

- For any pair of columns i,j, one of the folowing holds
- i1 j1
- j1 i1
- i1 j1 =
- For any pair of columns i,j
- i < j if and only if i1 j1
- Note that if i<j then the edge containing i is an ancestor of the edge containingj

i

j

Vineet Bafna

A

B

C

D

E

Perfect Phylogeny via Example1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

Initially, there is a single clade r, and each node has r as its parent

Vineet Bafna

Sort columns

- Sort columns according to the inclusion property (note that the columns are already sorted here).
- This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

Vineet Bafna

Add first column

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

- In adding column i
- Check each individual and decide which side you belong.
- Finally add a node if you can resolve a clade

r

u

B

D

A

C

E

Vineet Bafna

Adding other columns

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

- Add other columns on edges using the ordering property

r

1

3

E

2

B

5

4

D

A

C

Vineet Bafna

Unrooted case

- Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column.
- Alg (Unrooted)
- Switch the values in each column, so that 0 is the majority element.
- Apply the algorithm for the rooted case.
- Relabel columns and individuals.
- Show that this is a correct algorithm.

Vineet Bafna

Unrooted perfect phylogeny

- We transform matrix M to a 0-major matrix M0.
- if M0 has a directed perfect phylogeny, M has a perfect phylogeny.
- If M has a perfect phylogeny, does M0 have a directed perfect phylogeny?

Unrooted case

- Theorem: If M has a perfect phylogeny, there exists a relabeling, and a perfect phylogeny s.t.
- Root is all 0s
- For any SNP (column), #1s <= #0s
- All edges are mutated 01

Vineet Bafna

Proof

- Consider the perfect phylogeny of M.
- Find the center: none of the clades has greater than n/2 nodes.
- Is this always possible?
- Root at one of the 3 edges of the center, and direct all mutations from 01 away from the root. QED
- If the theorem is correct, then simply relabeling all columns so that the majority element is 0 is sufficient.

Vineet Bafna

‘Homework’ Problems

- What if there is missing data? (An entry that can be 0 or 1)?
- What if there are recurrent mutations?

Vineet Bafna

Vineet Bafna

Quiz

- Recall that a SNP data-set is a ‘binary’ matrix.
- Rows are individual (chromosomes)
- Columns are alleles at a specific locus
- Suppose you have 2 SNP datasets of a contiguous genomic region but no other information
- One from an African population, and one from a European Population.
- Can you tell which is which?
- How long does the genomic region have to be?

Vineet Bafna

Linkage (Dis)-equilibrium (LD)

- Consider sites A &B
- Case 1: No recombination
- Each new individual chromosome chooses a parent from the existing ‘haplotype’

A B

0 1

0 1

0 0

0 0

1 0

1 0

1 0

1 0

1 0

Vineet Bafna

Linkage (Dis)-equilibrium (LD)

- Consider sites A &B
- Case 2: diploidy and recombination
- Each new individual chooses a parent from the existing alleles

A B

0 1

0 1

0 0

0 0

1 0

1 0

1 0

1 0

1 1

Vineet Bafna

Linkage (Dis)-equilibrium (LD)

- Consider sites A &B
- Case 1: No recombination
- Each new individual chooses a parent from the existing ‘haplotype’
- Pr[A,B=0,1] = 0.25
- Linkage disequilibrium
- Case 2: Extensive recombination
- Each new individual simply chooses and allele from either site
- Pr[A,B=(0,1)]=0.125
- Linkage equilibrium

A B

0 1

0 1

0 0

0 0

1 0

1 0

1 0

1 0

Vineet Bafna

LD

- In the absence of recombination,
- Correlation between columns
- The joint probability Pr[A=a,B=b] is different from P(a)P(b)
- With extensive recombination
- Pr(a,b)=P(a)P(b)

Vineet Bafna

Measures of LD

- Consider two bi-allelic sites with alleles marked with 0 and 1
- Define
- P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
- P0* = Pr[Allele 0 in locus 1]
- Linkage equilibrium if P00 = P0* P*0
- D = (P00 - P0* P*0) = -(P01 - P0* P*1) = …

Vineet Bafna

Other measures of LD

- D’ is obtained by dividing D by the largest possible value
- Suppose D = (P00 - P0* P*0) >0.
- Then the maximum value of Dmax= min{P0* P*1, P1* P*0}
- If D<0, then maximum value is max{-P0* P*0, -P1* P*1}
- D’ = D/ Dmax

0

1

-D

D

0

Site 1

-D

D

1

Site 2

Vineet Bafna

Other measures of LD

- D’ is obtained by dividing D by the largest possible value
- Ex: D’ = abs(P11- P1* P*1)/ Dmax
- = D/(P1* P0* P*1 P*0)1/2
- Let N be the number of individuals
- Show that 2N is the 2 statistic between the two sites

0

1

P00N

0

P0*N

Site 1

1

Site 2

Vineet Bafna

Digression: The χ2 test

0

1

- The statistic behaves like a χ2 distribution (sum of squares of normal variables).
- A p-value can be computed directly

O1

O2

0

O3

O4

1

Observed and expected

0

1

0

1

P0*P*1N

P00N

0

P0*P*0N

P01N

Site 1

1

P10N

P11N

P1*P*0N

P1*P*1N

- = D/(P1* P0* P*1 P*0)1/2
- Verify that 2N is the 2 statistic between the two sites

LD over time and distance

- The number of recombination events between two sites, can be assumed to be Poisson distributed.
- Let r denote the recombination rate between two adjacent sites
- r = # crossovers per bp per generation
- The recombination rate between two sites l apart is rl

LD over time

- Decay in LD
- Let D(t) = LD at time t between two sites
- r’=lr
- P(t)00 = (1-r’) P(t-1)00 + r’ P(t-1)0* P(t-1)*0
- D(t) =P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 (Why?)
- D(t) =(1-r’) D(t-1) =(1-r’)t D(0)

Vineet Bafna

LD over distance

- Assumption
- Recombination rate increases linearly with distance and time
- LD decays exponentially.
- The assumption is reasonable, but recombination rates vary from region to region, adding to complexity
- This simple fact is the basis of disease association mapping.

Vineet Bafna

LD and disease mapping

- Consider a mutation that is causal for a disease.
- The goal of disease gene mapping is to discover which gene (locus) carries the mutation.
- Consider every polymorphism, and check:
- There might be too many polymorphisms
- Multiple mutations (even at a single locus) that lead to the same disease
- Instead, consider a dense sample of polymorphisms that span the genome

Vineet Bafna

LD can be used to map disease genes

LD

- LD decays with distance from the disease allele.
- By plotting LD, one can short list the region containing the disease gene.

D

N

N

D

D

N

0

1

1

0

0

1

Vineet Bafna

Haplotype blocks

- It was found that recombination rates vary across the genome
- How can the recombination rate be measured?
- In regions with low recombination, you expect to see long haplotypes that are conserved. Why?
- Typically, haplotype blocks do not span recombination hot-spots

Vineet Bafna

Long haplotypes

- Chr 2 region with high r2 value (implies little/no recombination)
- History/Genealogy can be explained by a tree ( a perfect phylogeny)
- Large haplotypes with high frequency

Vineet Bafna

19q13

Vineet Bafna

LD variation across populations

Reich et al.

Nature 411, 199-204(10 May 2001)

- LD is maintained upto 60kb in swedish population, 6kb in Yoruban population

Vineet Bafna

Understanding Population Data

- We described various population genetic concepts (HW, LD), and their applicability
- The values of these parameters depend critically upon the population assumptions.
- What if we do not have infinite populations
- No random mating (Ex: geographic isolation)
- Sudden growth
- Bottlenecks
- Ad-mixture
- It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation?

Vineet Bafna

Download Presentation

Connecting to Server..