population genetics basics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Population Genetics Basics PowerPoint Presentation
Download Presentation
Population Genetics Basics

Loading in 2 Seconds...

play fullscreen
1 / 74

Population Genetics Basics - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Population Genetics Basics. Terminology review. Allele Locus Diploid SNP. Single Nucleotide Polymorphisms. Infinite Sites Assumption: Each site mutates at most once. 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110. What causes variation in a population?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Population Genetics Basics' - julie


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
terminology review
Terminology review
  • Allele
  • Locus
  • Diploid
  • SNP
single nucleotide polymorphisms
Single Nucleotide Polymorphisms

Infinite Sites Assumption:

Each site mutates at most once

00000101011

10001101001

01000101010

01000000011

00011110000

00101100110

what causes variation in a population
What causes variation in a population?
  • Mutations (may lead to SNPs)
  • Recombinations
  • Other genetic events (gene conversion)
  • Structural Polymorphisms
recombination
Recombination

00000000

11111111

00011111

gene conversion
Gene Conversion
  • Gene Conversion versus crossover
    • Hard to distinguish in a population
structural polymorphisms
Structural polymorphisms
  • Large scale structural changes (deletions/insertions/inversions) may occur in a population.
topic 1 basic principles
Topic 1: Basic Principles
  • In a ‘stable’ population, the distribution of alleles obeys certain laws
    • Not really, and the deviations are interesting
  • HW Equilibrium
    • (due to mixing in a population)
  • Linkage (dis)-equilibrium
    • Due to recombination
hardy weinberg equilibrium
Hardy Weinberg equilibrium
  • Consider a locus with 2 alleles, A, a
  • p(respectively, q) is the frequency of A (resp. a) in the population
  • 3 Genotypes: AA, Aa, aa
  • Q: What is the frequency of each genotype
  • If various assumptions are satisfied, (such as random mating, no natural selection), Then
  • PAA=p2
  • PAa=2pq
  • Paa=q2
hardy weinberg why
Hardy Weinberg: why?
  • Assumptions:
    • Diploid
    • Sexual reproduction
    • Random mating
    • Bi-allelic sites
    • Large population size, …
  • Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.
hardy weinberg generalizations
Hardy Weinberg: Generalizations
  • Multiple alleles with frequencies
    • By HW,
  • Multiple loci?
hardy weinberg implications
Hardy Weinberg: Implications
  • The allele frequency does not change from generation to generation. Why?
  • It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation?
  • Males are 100 times more likely to have the “red’ type of color blindness than females. Why?
  • Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.
recombination1
Recombination

00000000

11111111

00011111

what if there were no recombinations
What if there were no recombinations?
  • Life would be simpler
  • Each individual sequence would have a single parent (even for higher ploidy)
  • The relationship is expressed as a tree.
the infinite sites assumption
The Infinite Sites Assumption

0 0 0 0 0 0 0 0

3

0 0 1 0 0 0 0 0

5

8

0 0 1 0 1 0 0 0

0 0 1 0 0 0 0 1

  • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.
  • Some phenotypes could be linked to the polymorphisms
  • Some of the linkage is “destroyed” by recombination
infinite sites assumption and perfect phylogeny
Infinite sites assumption and Perfect Phylogeny
  • Each site is mutated at most once in the history.
  • All descendants must carry the mutated value, and all others must carry the ancestral value

i

1 in position i

0 in position i

perfect phylogeny
Perfect Phylogeny
  • Assume an evolutionary model in which no recombination takes place, only mutation.
  • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.
the 4 gamete condition
The 4-gamete condition
  • A column i partitions the set of species into two sets i0, and i1
  • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous.
  • EX: i is heterogenous w.r.t {A,D,E}

i

A 0

B 0

C 0

D 1

E 1

F 1

i0

i1

4 gamete condition
4 Gamete Condition
  • 4 Gamete Condition
    • There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1.
    • Equivalent to
    • There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist

(0,0), (0,1), (1,0), (1,1)

4 gamete condition proof
4-gamete condition: proof

i

i0

i1

  • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous.
  • (only if) Every perfect phylogeny satisfies the 4-gamete condition
  • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?
an algorithm for constructing a perfect phylogeny
An algorithm for constructing a perfect phylogeny
  • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.
  • In any tree, each node (except the root) has a single parent.
    • It is sufficient to construct a parent for every node.
  • In each step, we add a column and refine some of the nodes containing multiple children.
  • Stop if all columns have been considered.
inclusion property
Inclusion Property
  • For any pair of columns i,j
    • i < j if and only if i1 j1
  • Note that if i<j then the edge containing i is an ancestor of the edge containing i

i

j

example
Example

r

A

B

C

D

E

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

Initially, there is a single clade r, and each node has r as its parent

sort columns
Sort columns
  • Sort columns according to the inclusion property (note that the columns are already sorted here).
  • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

add first column
Add first column
  • In adding column i
    • Check each edge and decide which side you belong.
    • Finally add a node if you can resolve a clade

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

r

u

B

D

A

C

E

adding other columns
Adding other columns
  • Add other columns on edges using the ordering property

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

r

1

3

E

2

B

5

4

D

A

C

unrooted case
Unrooted case
  • Switch the values in each column, so that 0 is the majority element.
  • Apply the algorithm for the rooted case
handling recombination
Handling recombination
  • A tree is not sufficient as a sequence may have 2 parents
  • Recombination leads to loss of correlation between columns
linkage dis equilibrium ld
Linkage (Dis)-equilibrium (LD)
  • Consider sites A &B
  • Case 1: No recombination
    • Pr[A,B=0,1] = 0.25
      • Linkage disequilibrium
  • Case 2:Extensive recombination
    • Pr[A,B=(0,1)=0.125
      • Linkage equilibrium

A B

0 1

0 1

0 0

0 0

1 0

1 0

1 0

1 0

handling recombination1
Handling recombination
  • A tree is not sufficient as a sequence may have 2 parents
  • Recombination leads to loss of correlation between columns
recombination and populations
Recombination, and populations
  • Think of a population of N individual chromosomes.
  • The population remains stable from generation to generation.
  • Without recombination, each individual has exactly one parent chromosome from the previous generation.
  • With recombinations, each individual is derived from one or two parents.
  • We will formalize this notion later in the context of coalescent theory.
linkage dis equilibrium ld1
Linkage (Dis)-equilibrium (LD)
  • Consider sites A &B
  • Case 1: No recombination
  • Each new individual chromosome chooses a parent from the existing ‘haplotype’

A B

0 1

0 1

0 0

0 0

1 0

1 0

1 0

1 0

1 0

linkage dis equilibrium ld2
Linkage (Dis)-equilibrium (LD)
  • Consider sites A &B
  • Case 2: diploidy and recombination
  • Each new individual chooses a parent from the existing alleles

A B

0 1

0 1

0 0

0 0

1 0

1 0

1 0

1 0

1 1

linkage dis equilibrium ld3
Linkage (Dis)-equilibrium (LD)
  • Consider sites A &B
  • Case 1: No recombination
  • Each new individual chooses a parent from the existing ‘haplotype’
    • Pr[A,B=0,1] = 0.25
      • Linkage disequilibrium
  • Case 2: Extensive recombination
  • Each new individual simply chooses and allele from either site
    • Pr[A,B=(0,1)=0.125
      • Linkage equilibrium

A B

0 1

0 1

0 0

0 0

1 0

1 0

1 0

1 0

slide35
LD
  • In the absence of recombination,
    • Correlation between columns
    • The joint probability Pr[A=a,B=b] is different from P(a)P(b)
  • With extensive recombination
    • Pr(a,b)=P(a)P(b)
measures of ld
Measures of LD
  • Consider two bi-allelic sites with alleles marked with 0 and 1
  • Define
    • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
    • P0* = Pr[Allele 0 in locus 1]
  • Linkage equilibrium if P00 = P0* P*0
  • D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
ld over time
LD over time
  • With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear
    • Let D(t) = LD at time t
    • P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0
    • D(t) =P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0
    • D(t) =(1-r) D(t-1) =(1-r)t D(0)
ld over distance
LD over distance
  • Assumption
    • Recombination rate increases linearly with distance
    • LD decays exponentially with distance.
  • The assumption is reasonable, but recombination rates vary from region to region, adding to complexity
  • This simple fact is the basis of disease association mapping.
ld and disease mapping
LD and disease mapping
  • Consider a mutation that is causal for a disease.
  • The goal of disease gene mapping is to discover which gene (locus) carries the mutation.
  • Consider every polymorphism, and check:
    • There might be too many polymorphisms
    • Multiple mutations (even at a single locus) that lead to the same disease
  • Instead, consider a dense sample of polymorphisms that span the genome
ld can be used to map disease genes
LD can be used to map disease genes
  • LD decays with distance from the disease allele.
  • By plotting LD, one can short list the region containing the disease gene.

LD

D

N

N

D

D

N

0

1

1

0

0

1

ld and disease gene mapping problems
LD and disease gene mapping problems
  • Marker density?
  • Complex diseases
  • Population sub-structure
human samples
Human Samples
  • We look at data from human samples
  • Gabriel et al. Science 2002.
    • 3 populations were sampled at multiple regions spanning the genome
      • 54 regions (Average size 250Kb)
      • SNP density 1 over 2Kb
      • 90 Individuals from Nigeria (Yoruban)
      • 93 Europeans
      • 42 Asian
      • 50 African American
population specific recombination
Population specific recombination
  • D’ was used as the measure between SNP pairs.
  • SNP pairs were classified in one of the following
    • Strong LD
    • Strong evidence for recombination
    • Others (13% of cases)
  • This roughly favors out-of-africa. A Coalescent simulation can help give confidence values on this.

Gabriel et al., Science 2002

haplotype blocks
Haplotype Blocks
  • A haplotype block is a region of low recombination.
    • Define a region as a block if less than 5% of the pairs show strong recombination
  • Much of the genome is in blocks.
  • Distribution of block sizes vary across populations.
testing out of africa
Testing Out-of-Africa
  • Generate simulations with and without migration.
  • Check size of haplotype blocks.
    • Does it vary when migrations are allowed?
    • When the ‘new’ population has a bottleneck?
  • If there was a bottleneck that created European and Asian populations, can we say anything about frequency of alleles that are ‘African specific’?
    • Should they be high frequency, or low frequency in African populations?
haplotype block implications
Haplotype Block: implications
  • The genome is mostly partitioned into haplotype blocks.
  • Within a block, there is extensive LD.
    • Is this good, or bad, for association mapping?
coalescent reconstruction
Coalescent reconstruction
  • Reconstructing likely coalescents
an algorithm for constructing a perfect phylogeny1
An algorithm for constructing a perfect phylogeny
  • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.
  • In any tree, each node (except the root) has a single parent.
    • It is sufficient to construct a parent for every node.
  • In each step, we add a column and refine some of the nodes containing multiple children.
  • Stop if all columns have been considered.
inclusion property1
Inclusion Property
  • For any pair of columns i,j
    • i < j if and only if i1 j1
  • Note that if i<j then the edge containing i is an ancestor of the edge containing i

i

j

example1
Example

r

A

B

C

D

E

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

Initially, there is a single clade r, and each node has r as its parent

sort columns1
Sort columns
  • Sort columns according to the inclusion property (note that the columns are already sorted here).
  • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

add first column1
Add first column
  • In adding column i
    • Check each edge and decide which side you belong.
    • Finally add a node if you can resolve a clade

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

r

u

B

D

A

C

E

adding other columns1
Adding other columns
  • Add other columns on edges using the ordering property

1 2 3 4 5

A 1 1 0 0 0

B 0 0 1 0 0

C 1 1 0 1 0

D 0 0 1 0 1

E 1 0 0 0 0

r

1

3

E

2

B

5

4

D

A

C

unrooted case1
Unrooted case
  • Important point is that the perfect phylogeny condition does not change when you interchange 1s and 0s at a column.
  • Switch the values in each column, so that 0 is the majority element.
  • Apply the algorithm for the rooted case.
  • Homework: show that this is a correct algorithm
population sub structure can increase ld
Population sub-structure can increase LD

Pop. A

0 .. 1

0 .. 1

0 .. 0

1 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

p1=0.1

q1=0.9

P11=0.1

D=0.01

Pop. B

1 .. 0

1 .. 0

0 .. 0

1 .. 1

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

p1=0.9

q1=0.1

P11=0.1

D=0.01

  • Consider two populations that were isolated and evolving independently.
  • They might have different allele frequencies in some regions.
  • Pick two regions that are far apart (LD is very low, close to 0)
recent ad mixing of population
Recent ad-mixing of population
  • If the populations came together recently (Ex: African and European population), artificial LD might be created.
  • D = 0.15 (instead of 0.01), increases 10-fold
  • This spurious LD might lead to false associations
  • Other genetic events can cause LD to arise, and one needs to be careful

0 .. 1

0 .. 1

0 .. 0

1 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

Pop. A+B

p1=0.5

q1=0.5

P11=0.1

D=0.1-0.25=0.15

1 .. 0

1 .. 0

0 .. 0

1 .. 1

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

determining population sub structure
Determining population sub-structure
  • Given a mix of people, can you sub-divide them into ethnic populations.
  • Turn the ‘problem’ of spurious LD into a clue.
    • Find markers that are too far apart to show LD
    • If they do show LD (correlation), that shows the existence of multiple populations.
    • Sub-divide them into populations so that LD disappears.
determining population sub structure1
Determining Population sub-structure

0 .. 1

0 .. 1

0 .. 0

1 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

1 .. 0

1 .. 0

0 .. 0

1 .. 1

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

  • Same example as before:
  • The two markers are too similar to show any LD, yet they do show LD.
  • However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears
iterative algorithm for population sub structure
Iterative algorithm for population sub-structure
  • Define
  • N = number of individuals (each has a single chromosome)
  • k = number of sub-populations.
  • Z  {1..k}N is a vector giving the sub-population.
    • Zi=k’ => individual i is assigned to population k’
  • Xi,j = allelic value for individual i in position j
  • Pk,j,l = frequency of allele l at position j in population k
example2
Example
  • Ex: consider the following assignment
  • P1,1,0 = 0.9
  • P2,1,0 = 0.1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

0 .. 1

0 .. 1

0 .. 0

1 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

1 .. 0

1 .. 0

0 .. 0

1 .. 1

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

slide63
Goal
  • X is known.
  • P, Z are unknown.
  • The goal is to estimate Pr(P,Z|X)
  • Various learning techniques can be employed.
    • maxP,Z Pr(X|P,Z) (Max likelihood estimate)
    • maxP,Z Pr(X|P,Z) Pr(P,Z) (MAP)
    • Sample P,Z from Pr(P,Z|X)
  • Here a Bayesian (MCMC) scheme is employed to sample from Pr(P,Z|X). We will only consider a simplified version
algorithm structure
Algorithm:Structure
  • Iteratively estimate
    • (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
  • After ‘convergence’, Z(m) is the answer.
  • Iteration
    • Guess Z(0)
    • For m = 1,2,..
      • Sample P(m) from Pr(P | X, Z(m-1))
      • Sample Z(m) from Pr(Z | X, P(m))
  • How is this sampling done?
example3
Example
  • Choose Z at random, so each individual is assigned to be in one of 2 populations. See example.
  • Now, we need to sample P(1) from Pr(P | X, Z(0))
  • Simply count
  • Nk,j,l = number of people in pouplation k which have allele l in position j
  • pk,j,l = Nk,j,l / N

1

2

2

1

1

2

1

2

1

2

1

2

2

1

1

2

1

2

2

1

0 .. 1

0 .. 1

0 .. 0

1 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

1 .. 0

1 .. 0

0 .. 0

1 .. 1

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

example4
Example
  • Nk,j,l = number of people in population k which have allele l in position j
  • pk,j,l = Nk,j,l / Nk,j,*
  • N1,1,0 = 4
  • N1,1,1 = 6
  • p1,1,0 = 4/10
  • p1,2,0 = 4/10
  • Thus, we can sample P(m)

1

2

2

1

1

2

1

2

1

2

1

2

2

1

1

2

1

2

2

1

0 .. 1

0 .. 1

0 .. 0

1 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

1 .. 0

1 .. 0

0 .. 0

1 .. 1

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

sampling z
Sampling Z
  • Pr[Z1 = 1] = Pr[”01” belongs to population 1]?
  • We know that each position should be in linkage equilibrium and independent.
  • Pr[”01” |Population 1] = p1,1,0 * p1,2,1 =(4/10)*(6/10)=(0.24)
  • Pr[”01” |Population 2] = p2,1,0 * p2,2,1 = (6/10)*(4/10)=0.24
  • Pr [Z1 = 1] = 0.24/(0.24+0.24) = 0.5

Assuming, HWE, and LE

sampling
Sampling
  • Suppose, during the iteration, there is a bias.
  • Then, in the next step of sampling Z, we will do the right thing
  • Pr[“01”| pop. 1] = p1,1,0 * p1,2,1 = 0.7*0.7 = 0.49
  • Pr[“01”| pop. 2] = p2,1,0 * p2,2,1 =0.3*0.3 = 0.09
  • Pr[Z1 = 1] = 0.49/(0.49+0.09) = 0.85
  • Pr[Z6 = 1] = 0.49/(0.49+0.09) = 0.85
  • Eventually all “01” will become 1 population, and all “10” will become a second population

1

1

1

2

1

2

1

2

1

1

2

2

2

1

2

2

1

2

2

1

0 .. 1

0 .. 1

0 .. 0

1 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

0 .. 1

1 .. 0

1 .. 0

0 .. 0

1 .. 1

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

1 .. 0

allowing for admixture
Allowing for admixture
  • Define qi,k as the fraction of individual i that originated from population k.
  • Iteration
    • Guess Z(0)
    • For m = 1,2,..
      • Sample P(m),Q(m) from Pr(P,Q | X, Z(m-1))
      • Sample Z(m) from Pr(Z | X, P(m),Q(m))
estimating z admixture case
Estimating Z (admixture case)
  • Instead of estimating Pr(Z(i)=k|X,P,Q), (origin of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q)

i,1

i,2

j

results thrush data
Results: Thrush data
  • For each individual, q(i) is plotted as the distance to the opposite side of the triangle.
  • The assignment is reliable, and there is evidence of admixture.
population structure
Population Structure
  • 377 locations (loci) were sampled in 1000 people from 52 populations.
  • 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)

Oceania

Eurasia

East Asia

America

Africa

population sub structure research problem
Population sub-structure:research problem
  • Systematically explore the effect of admixture. Can admixture be predicted for a locus, or for an individual
  • The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem:
    • (w/out admixture). Assign individuals to sub-populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub-populations
    • (w/ admixture) Assign (individuals, loci) to sub-populations