中国科学院上海生命科学研究院研究生课程
This presentation is the property of its rightful owner.
Sponsored Links
1 / 75

人类群体遗传学 基本原理和分析方法 PowerPoint PPT Presentation


  • 390 Views
  • Uploaded on
  • Presentation posted in: General

中国科学院上海生命科学研究院研究生课程 人类群体遗传学. 人类群体遗传学 基本原理和分析方法. 中科院 - 马普学会计算生物学伙伴研究所. 徐书华 金 力. 第六讲. 人群遗传结构分析 ( I ). 第四讲. 群体遗传学中的基本概念( 4 ) 群体遗传结构 描述群体遗传结构的统计量 Hierarchical F statistics 软件演示 利用 Arlequin 计算人群的 Hierarchical F statistics. 什么是遗传结构?. 从 差异 中发现 结构 ! 遗传多态性在时间上和空间上的不同分布 模式就是 遗传结构 。

Download Presentation

人类群体遗传学 基本原理和分析方法

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


4519575

中国科学院上海生命科学研究院研究生课程人类群体遗传学

人类群体遗传学基本原理和分析方法

中科院-马普学会计算生物学伙伴研究所

徐书华 金 力


4519575

第六讲

人群遗传结构分析(I)


4519575

第四讲

  • 群体遗传学中的基本概念(4)

  • 群体遗传结构

  • 描述群体遗传结构的统计量

    • Hierarchical F statistics

  • 软件演示

    • 利用Arlequin计算人群的Hierarchical F statistics


4519575

什么是遗传结构?

  • 从差异中发现结构!

  • 遗传多态性在时间上和空间上的不同分布 模式就是遗传结构。

    • 时间:不同时代;

      不同世代。

    • 空间:不同地理分布;

      同域不同人群;

      不同基因组区域。


Population structure

Population structure


4519575

为什么研究人类的遗传结构?

  • 人类起源、迁徙、进化历史及前景

  • 现代人群(民族)之间的亲缘关系

  • 复杂疾病的遗传基础和基因定位

    • 癌症

    • 肥胖

    • 哮喘

    • 精神病

    • II型糖尿病

    • 心血管系统疾病

  • 公共卫生保健

  • 个性化用药和个性化治疗

  • 法医学


An example

An example

  • Population structures and association studies


Population structures make trouble in association studies

Population structures make trouble in association studies

  • Population stratification in Epidemiology.

  • Analysis of mixed samples having different allele frequencies is a primary concern in human genetics, as it leads to false evidence for allelic association.


Odds ratio

Odds ratio

Disease

Exposure

yes

no

total

yes

a

b

a + b

no

c

d

c + d

total

a + c

b + d

a + b + c + d

Odds for case: a/c

Odds for control: b/d

Odds ratio


Explanation of or

Explanation of OR

  • OR>1: exposure factors increase the risk of disease; positive association

  • OR<1: exposure factors decrease the risk of disease; negative association

  • OR=1: no association


4519575

Example

Odds for case 50:50 = 1

Odds for control 20:80 = 0.25

Odds ratio = 50:50/20:80 = 1/0.25 = 4


Heterogeneity stratification

Subpopulation 1Subpopulation 2

casecontrolcasecontrol

exp(+) 5050 100exp(+) 19 10

exp(-)450450 900exp(-) 99891 990

5005001,0001009001000

Total Population

case control

exp(+) 5159 110

exp(-) 5491,3411,890

6001,4002,000

51

600

59

1,400

= 8.5%

= 4.2%

Heterogeneity/Stratification

OR=2.02


Human migration

Human migration

  • Anatomically modern humans evolve in Africa > 160,000 ybp.

  • Some leave Africa sometime around 75,000 - 55,000 ybp.

  • Replace Neanderthals in Europe and archaic humans around the world.

  • Arrive in Western hemisphere between 34,000 and 18,000 ybp.

  • Multiple migrations in different pre-historic periods, followed by different migrations in historical periods.


Note on definitions biological race

Note on Definitions: Biological Race

  • morphology (phenotype)

  • Geographical location

  • Population based (frequency of genes)

Socially Constructed Race: Arbitrarily utilizes

aspects of morphology, geography, culture,

language, religion, etc. in the service of a

social dominance hierarchy.


4519575

NIH & University of Michigan


4519575

描述遗传结构的统计量

  • Hierarchical F statistics


4519575

固定指数

  • 固定指数(F):

  • 如果一个座位上有两个等位基因,Hardy-Weinberg比率的任何偏差可以由参量F来度量,F称为固定指数,则基因型频率可以由下式给出:

  • 由以上第二式可得:


4519575

随机交配(h)情况下杂合子的预期频率

群体(h0)中下杂合子的观察频率

  • 固定指数F可正可负,视情况而定。

  • 可以看出,当h0小于h时,F取正值;当h0大于h时,F取负值。在近亲交配时,杂合子频率的观察值减小,F就取正值。

上式可写成


Sub population

亚群体(sub-population)

  • 以上考虑的是一个简单的群体,不论其是否近亲交配。

  • 然而,实际上大多数的自然群体可被再分为许多不同的繁殖单位或亚群体(sub-population),尽管这些群体并不是完全隔离的。这种情况下,研究群体内和群体间的遗传变异就显得十分重要。


4519575

可再分群体中的基因型频率

  • 假定一个群体可分为s个亚群体,每一个亚群体都满足Hardy-Weiberg平衡。设xk为第k个亚群体中等位基因A1的频率,则基因型A1A1,A1A2,A2A2的频率分别为

  • 我们用wk来表示第k个亚群体的相对大小,且总和为1。则A1A1,A1A2,A2A2在整个群体中的频率为:

其中

是亚群体中等位基因频率的均值和方差。


4519575

可再分群体中的固定指数

  • 比较

,因此

我们知道


4519575

Wahlund定律

  • 表明如果一个群体被分为多个交配单位,纯合子的频率要高于Hardy-Weinberg比率。这个性质首先由Wahlund(1928)发现,被称为Wahlund定律,也称Wahlund现象。

  • 当等位基因频率在所有亚群体中一致时,F为0;而当每个亚群体都被固定为某一个等位基因时,F为1。


Back to inbreeding

Back to inbreeding


Wahlund

Wahlund现象的启示

  • 群体结构(population structure)的存在!

  • 反之,当F为负值的时候,

杂合子频率比Hardy-Weinberg平衡时预期的要高,意味着杂合优势,某种程度的自然选择发生。

杂合优势与平衡选择(后面“自然选择”章节细谈)


Wright s fixation index f st

Wright’s Fixation Index (FST)

Sewall Wright

1889-1988


F statistics

F-statistics

  • Different F-statistics for different scales

    • Individual (I)

    • Subpopulation (S)

    • Total population (T)

  • Those are the traditional scales but in theory there can be no limit to the # of levels of analysis .

  • Originally defined for 2 alleles

  • Extended to >2 alleles as G-statistics


F statistics derived from inbreeding coefficient

F-statistics Derived from inbreeding coefficient

  • FIS

    • inbreeding in individuals relative to subpopulation (Weir and Cockerham’s f)

  • FST

    • inbreeding among subpopulations relative to total population (Weir and Cockerham’s )

  • FIT

    • inbreeding among individuals relative to total population (Weir and Cockerham’s F)


4519575

  • Remember that inbreeding coefficient, F, is related to loss of heterozygosity

    F = 1 – (Ho/He)

  • F-statistics can be expressed in the same way

    FIS = 1 – (HI/HS)

    FST = 1 – (HS/HT)

    FIT = 1 – (HI/HT)

HI= HO averaged across subpopulations

HS= He averaged across subpopulations

HT= He for total population = He


4519575

Deficit of heterozygote

aa

AA

aa

AA

FST = 1 – (HS/HT)

aa

AA

aa

AA

P(A) = p = 1

P(a) = q = 0

p = 0

q = 1

HS = Hewithin subpopulation

HS = 1 - pi2

= 1 - (12 + 02) = 0

HS = 0

Mean HS = 0

HT= He for total population

For total population, p = 0.5 & q = 0.5

HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5

FST = 1 – (HS/HT) = 1 - (0/0.5) = 1


4519575

Deficit of homozygote

Aa

Aa

Aa

Aa

FST = 1 – (HS/HT)

Aa

Aa

Aa

Aa

P(A) = p = 0.5

P(a) = q = 0.5

p = 0.5

q = 0.5

Mean HS = 0.5

HS = 1 - pi2 =

= 1 - (0.52 + 0.52) = 0.5

HS = 0.5

HT= He for total population

For total population, p = 0.5 & q = 0.5

HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5

FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0


4519575

P(A) = p = 0.5

P(a) = q = 0.5

p = 0.5

q = 0.5

Mean HS = 0.5

HS = 1 - pi2 =

= 1 - (0.52 + 0.52) = 0.5

HS = 0.5

HT= He for total population

For total population, p = 0.5 & q = 0.5

HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5

FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0

AA

AA

Aa

Aa

FST = 1 – (HS/HT)

Aa

Aa

aa

aa

FST uses expected heterozygosity, not observed heterozygosity!!


F statistics1

F statistics

  • FIS tells us if there is inbreeding within subpopulations by comparing HI and HS:

  • Bars mean that the values are the averages over all the subpopulations that we are considering.

  • So FIS measures whether there is, on average, a deficit of heterozygotes within subpopulations.


F statistics2

F statistics

  • FST is the statistic that tells us how differentiated the subpopulations are. Formally, FST tells us if there is a deficit of heterozygotes in the metapopulation, due to differentiation among subpopulations:

  • Bars mean that the values are the averages over all the subpopulations that we are considering.


F statistics3

F statistics

  • FIT tells us how much population structure has affected the average heterozygosity of individuals within the population:

  • Also (1-FIS) (1-FST) = (1-FIT).


F statistics measure departure from hardy weinberg equilibrium

F-statistics Measure departure from Hardy-Weinberg equilibrium

  • FIS = departure from HW in local subpopulations

  • FST = genetic divergence among subpopulations

  • FIT = total departure from HW including that within and among subpopulations


Partitioning of structure

FIT

FIS

FST

Partitioning of structure

Individuals

Subpopulations

Total population

Inbreeding

Wahlund Effect or fragmentation

1 – FIT= (1 – FST)(1 – FIS)

FIT = FIS + FST – (FIS)(FST)


The three f statistics are related to each other

The three F statistics are related to each other

  • FST = (FIT - FIS) / (1 - FIS)

  • FST is always positive

  • FIS is frequently positive, is negative if there is systematic avoidance of inbreeding

  • FIT is positive unless there are not clear subdivisions and there is avoidance


Extensions

Extensions

  • Variance of allele frequencies across subpopulations

  • When in HW, Var(q) = 0, therefore FST = 0

  • As Var(q) increases, divergence of subpopulations increases


Intuitive meaning of f st

Intuitive meaning of FST

  • The proportion of total genetic variation that is distributed among subpopulations, rather than within subpopulations.


Unbiased estimates of f st

Unbiased estimates of FST

  • Unbiased estimates of FST were calculated as described by Weir and Hill 2002.

  • Suppose we have i subpopulations (where i = 1,…, r), we denote sample allele frequency as , and denote the average frequency over samples as

  • and denote the average frequency over samples as


4519575

  • The observed mean square for loci within populations are denoted by MSG:


4519575

  • The observed mean square for between populations are denoted by MSP:


4519575

  • Then FSTcan be estimated as follows:

Where is the average sample size across samples that also incorporates and corrects for the variance in sample size over subpopulations:


Problems with f st

Problems with FST

  • Assumes Infinite Alleles Model (IAM) or K-alleles model with very low mutation rates (not appropriate for microsat data)

  • All alleles differ equally from each other (magnitude of difference between alleles ignored)

  • Does not work well with high heterozygosity

  • Assumes alleles arrive in population via migration rather than mutation


Special version for microsatellites

Special version for microsatellites

  • RST (Slatkin 1995)

  • Analogue of FST

  • Assumes Stepwise Mutation Model (mutation model most appropriate for microsats)

  • Allows for high mutation rates

  • Allows differences in magnitude between alleles to be accounted for

  • Where S = average sum of differences in allele sizes in total population, and SW = average sum within populations


Which to use for sats

Which to use for sats?

  • FST and RST can differ using same data

  • If loci don’t conform to SMM model, RST will be underestimated

  • If mutation rates are large relative to migration rates, RST is superior

  • Longer divergence times between populations favors RST

  • RST favored under ideal conditions and with large samples

  • FST favored with small samples and when a more conservative estimator is desired


Distance measures for microsatellites

Distance measures for microsatellites

  • µ2 (Goldstein et al. 1995): ∑(µx-µy)2/L

    • µx is the mean allele size in population x

    • µy is the mean allele size in population y

    • Summed across all loci and divided by # of loci (L)

    • Allele size expressed as # repeat units

    • Stepwise mutation model (SMM)

    • E(µ2) = 2αt

      • α = mutation rate per generation

      • t = # generations

    • Problems

      • α not constant among different loci

      • Variance very high

      • µsats don’t strictly follow the SMM


Distance measures for microsatellites1

Distance measures for microsatellites

  • DSA (Bowcock et al., 1994, Nature)

    • SA = shared alleles

    • PSA = (∑S)/2U

      • Where S = # shared alleles at a locus between 2 populations

      • U = # loci

    • DSA = 1 –PSA

    • IAM

    • May be superior to µ2 for closely related populations, even for µsat data


Degree of f statistics

Degree of F statistics

According to Sewall Wright:

  • FST ranges from 0-1

  • 0 = no genetic differentiation; panmixia

  • 0.00–0.05 = little genetic diff

  • 0.05-0.15 = moderate genetic diff

  • 0.15-0.25 = great genetic diff

  • 0.25-1.00 = very great genetic diff

  • 1 = complete genetic differentiation


Calculate hierarchical f st by arlequin

Calculate hierarchical FST by Arlequin

Chromosome 21 SNP data

#Asian

Group ={

"CHB"

"JPT"

"CHU"

"HMO"

"AVA"

}

#European

Group ={

"CEU"

"NEuro"

"Basque"

"Italian"

"Hungarian"

}

#African

Group ={

"YRI"

}


Amova

三个大洲人群的AMOVA分析


Meta population structure drift within populations migration between populations

Meta-population structure: Drift within populations, migration between populations

p=0.7

N=15

m=.02

m=.07

p=0.4

N=70

p=0.6

N=50

m=.01

p=0.3

N=10

p=0.5

N=150

p=1.0

N=20


Drift and migration have opposite effects

Drift and migration have opposite effects

  • Drift makes subpopulations differerent

  • Migration homogenizes subpopulations


Useful for estimating gene flow

Useful for estimating gene flow

  • If you know FST and Ne, you can calculate m


4519575

In addition, very little migration is required to prevent substantial genetic divergence among subpopulations resulting from random genetic drift


This can be shown by the following equation

This can be shown by the following equation:

1

Fst ~

4Nm + 1

Equilibrium

fixation

index

# of migrants/generation


Estimation of gene flow

Estimation of gene flow

  • Indirect (based on FST)

    • Nm = (1 - FST)/4FST

    • Some drawbacks but often acceptable if limitations are considered

    • High variance at low values of FST


Problems with f st1

Problems with FST

  • Assumptions of model not realistic

    • All populations have same N

    • Nm is equal among all demes

    • Mutations do not occur

    • Markers are truly neutral

    • Selection not operating (local adaptation causes overestimate of FST estimate and underestiamte of Nm; uniform selection underestimates FST and overestimates Nm

    • Recent isolation of demes won’t be detected

  • Related to gene flow on evolutionary time scales

  • Not appropariate for ecological time scales

    • Ignores ongoing dynamics in allele frequencies (rare alleles)


4519575

  • Best in situations where

    • Spatial scale small (island model holds and spatially varying selection unlikely)

    • Migration rates high (rapid attainment of genetic equilibrium)

    • Sample sizes and number of loci used are large - accuracy of estimates

    • Long-term estimate of Nem “averaged” over many generations desired

    • Not useful for short-term nonequilibrium situations e.g. recently fragmented, rapidly declining populations


Population differentiation under migration and drift

Population differentiation under migration and drift

  • If Ne and m are small, FST is large

  • If Nem < 1 then

  • FST > 0.2

  • “If there is > 1 migrant per generation, populations do not diverge much.”


4519575

Fst

Fixation

Index

0 1 2 3 4 5 6 7 8 9 10

# migrants/generation

Nm


Ompg rule of thumb

OMPG rule of thumb

  • From this analysis emerged a genetic rule of thumb that one migrant individual per local population per generation (OMPG) is sufficient to obscure any disruptive effects of drift.


4519575

Biologists concerned with population insularization caused by habitat fragmentation began advocating the application of this principle for conservation purposes


Examples

Examples:

1. Mace and Lande (1991) used the OMPG rule as a criterion in defining threatened species categories of the World Conservation Union

2. In the U.S. nearly every recovery plan that considers genetic issues and insularization applies the OMPG rule

3. Widely applied by managers charged with initiating connectivity between isolated populations - e.g., reduce concerns about inbreeding depression


Important aspects of ompg

Important Aspects of OMPG

Unlikely that polymorphism will be lost within subpopulations - unlikely to reach equilibrium gene frequencies where one allele or the other is lost or “fixed”

Provides a desirable balance between drift and gene flow by preventing the loss of alleles and minimizing loss of heterozygosity within subpopulations but allowing genetic divergence to exist among subpopulations


How much gene flow might be too much

How much gene flow might be too much?

Difficult to answer without extensive genetic and demographic information on the population


4519575

Frankel and Soule (1981) proposed an upper limit of 5 migrants per generationMills and Allendorf (1996) suggest that a minimum of 1 and a maximum of 10 migrants per generation would be the appropriate general rule of thumb for genetic purposes


4519575

Mutation has the same effect

Fst

Fixation

Index

0 1 2 3 4 5 6 7 8 9 10

# mutation/generation

Nu


4519575

常用软件

  • Arlequin 3.01

    • http://anthro.unige.ch/software/arlequin/


4519575

练习

  • 利用HapMap数据进行群体结构分析;

    • http://www.hapmap.org


  • Login