- 427 Views
- Uploaded on
- Presentation posted in: General

人类群体遗传学 基本原理和分析方法

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

中国科学院上海生命科学研究院研究生课程人类群体遗传学

人类群体遗传学基本原理和分析方法

中科院-马普学会计算生物学伙伴研究所

徐书华 金 力

人群遗传结构分析（I）

- 群体遗传学中的基本概念（4）
- 群体遗传结构
- 描述群体遗传结构的统计量
- Hierarchical F statistics

- 软件演示
- 利用Arlequin计算人群的Hierarchical F statistics

- 从差异中发现结构！
- 遗传多态性在时间上和空间上的不同分布 模式就是遗传结构。
- 时间：不同时代；
不同世代。

- 空间：不同地理分布；
同域不同人群；

不同基因组区域。

- 时间：不同时代；

- 人类起源、迁徙、进化历史及前景
- 现代人群（民族）之间的亲缘关系
- 复杂疾病的遗传基础和基因定位
- 癌症
- 肥胖
- 哮喘
- 精神病
- II型糖尿病
- 心血管系统疾病

- 公共卫生保健
- 个性化用药和个性化治疗
- 法医学

- Population structures and association studies

- Population stratification in Epidemiology.
- Analysis of mixed samples having different allele frequencies is a primary concern in human genetics, as it leads to false evidence for allelic association.

Disease

Exposure

yes

no

total

yes

a

b

a + b

no

c

d

c + d

total

a + c

b + d

a + b + c + d

Odds for case: a/c

Odds for control: b/d

Odds ratio

- OR>1: exposure factors increase the risk of disease; positive association
- OR<1: exposure factors decrease the risk of disease; negative association
- OR=1: no association

Example

Odds for case 50:50 = 1

Odds for control 20:80 = 0.25

Odds ratio = 50:50/20:80 = 1/0.25 = 4

Subpopulation 1Subpopulation 2

casecontrolcasecontrol

exp(+) 5050 100exp(+) 19 10

exp(-)450450 900exp(-) 99891 990

5005001,0001009001000

Total Population

case control

exp(+) 5159 110

exp(-) 5491,3411,890

6001,4002,000

51

600

59

1,400

= 8.5%

= 4.2%

OR=2.02

- Anatomically modern humans evolve in Africa > 160,000 ybp.
- Some leave Africa sometime around 75,000 - 55,000 ybp.
- Replace Neanderthals in Europe and archaic humans around the world.
- Arrive in Western hemisphere between 34,000 and 18,000 ybp.
- Multiple migrations in different pre-historic periods, followed by different migrations in historical periods.

- morphology (phenotype)
- Geographical location
- Population based (frequency of genes)

Socially Constructed Race: Arbitrarily utilizes

aspects of morphology, geography, culture,

language, religion, etc. in the service of a

social dominance hierarchy.

NIH & University of Michigan

- Hierarchical F statistics

- 固定指数（F）:
- 如果一个座位上有两个等位基因，Hardy-Weinberg比率的任何偏差可以由参量F来度量，F称为固定指数，则基因型频率可以由下式给出：

- 由以上第二式可得：

随机交配（h）情况下杂合子的预期频率

群体（h0）中下杂合子的观察频率

- 固定指数F可正可负，视情况而定。
- 可以看出，当h0小于h时，F取正值；当h0大于h时，F取负值。在近亲交配时，杂合子频率的观察值减小，F就取正值。

上式可写成

- 以上考虑的是一个简单的群体，不论其是否近亲交配。
- 然而，实际上大多数的自然群体可被再分为许多不同的繁殖单位或亚群体（sub-population），尽管这些群体并不是完全隔离的。这种情况下，研究群体内和群体间的遗传变异就显得十分重要。

- 假定一个群体可分为s个亚群体，每一个亚群体都满足Hardy-Weiberg平衡。设xk为第k个亚群体中等位基因A1的频率，则基因型A1A1，A1A2，A2A2的频率分别为

- 我们用wk来表示第k个亚群体的相对大小，且总和为1。则A1A1，A1A2，A2A2在整个群体中的频率为：

其中

和

是亚群体中等位基因频率的均值和方差。

- 比较

，因此

我们知道

Wahlund定律

- 表明如果一个群体被分为多个交配单位，纯合子的频率要高于Hardy-Weinberg比率。这个性质首先由Wahlund（1928）发现，被称为Wahlund定律，也称Wahlund现象。
- 当等位基因频率在所有亚群体中一致时，F为0；而当每个亚群体都被固定为某一个等位基因时，F为1。

- 群体结构（population structure）的存在！
- 反之，当F为负值的时候，

杂合子频率比Hardy-Weinberg平衡时预期的要高，意味着杂合优势，某种程度的自然选择发生。

杂合优势与平衡选择（后面“自然选择”章节细谈）

Sewall Wright

1889-1988

- Different F-statistics for different scales
- Individual (I)
- Subpopulation (S)
- Total population (T)

- Those are the traditional scales but in theory there can be no limit to the # of levels of analysis .
- Originally defined for 2 alleles
- Extended to >2 alleles as G-statistics

- FIS
- inbreeding in individuals relative to subpopulation (Weir and Cockerham’s f)

- FST
- inbreeding among subpopulations relative to total population (Weir and Cockerham’s )

- FIT
- inbreeding among individuals relative to total population (Weir and Cockerham’s F)

- Remember that inbreeding coefficient, F, is related to loss of heterozygosity
F = 1 – (Ho/He)

- F-statistics can be expressed in the same way
FIS = 1 – (HI/HS)

FST = 1 – (HS/HT)

FIT = 1 – (HI/HT)

HI= HO averaged across subpopulations

HS= He averaged across subpopulations

HT= He for total population = He

Deficit of heterozygote

aa

AA

aa

AA

FST = 1 – (HS/HT)

aa

AA

aa

AA

P(A) = p = 1

P(a) = q = 0

p = 0

q = 1

HS = Hewithin subpopulation

HS = 1 - pi2

= 1 - (12 + 02) = 0

HS = 0

Mean HS = 0

HT= He for total population

For total population, p = 0.5 & q = 0.5

HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5

FST = 1 – (HS/HT) = 1 - (0/0.5) = 1

Deficit of homozygote

Aa

Aa

Aa

Aa

FST = 1 – (HS/HT)

Aa

Aa

Aa

Aa

P(A) = p = 0.5

P(a) = q = 0.5

p = 0.5

q = 0.5

Mean HS = 0.5

HS = 1 - pi2 =

= 1 - (0.52 + 0.52) = 0.5

HS = 0.5

HT= He for total population

For total population, p = 0.5 & q = 0.5

HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5

FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0

P(A) = p = 0.5

P(a) = q = 0.5

p = 0.5

q = 0.5

Mean HS = 0.5

HS = 1 - pi2 =

= 1 - (0.52 + 0.52) = 0.5

HS = 0.5

HT= He for total population

For total population, p = 0.5 & q = 0.5

HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5

FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0

AA

AA

Aa

Aa

FST = 1 – (HS/HT)

Aa

Aa

aa

aa

FST uses expected heterozygosity, not observed heterozygosity!!

- FIS tells us if there is inbreeding within subpopulations by comparing HI and HS:
- Bars mean that the values are the averages over all the subpopulations that we are considering.
- So FIS measures whether there is, on average, a deficit of heterozygotes within subpopulations.

- FST is the statistic that tells us how differentiated the subpopulations are. Formally, FST tells us if there is a deficit of heterozygotes in the metapopulation, due to differentiation among subpopulations:
- Bars mean that the values are the averages over all the subpopulations that we are considering.

- FIT tells us how much population structure has affected the average heterozygosity of individuals within the population:
- Also (1-FIS) (1-FST) = (1-FIT).

- FIS = departure from HW in local subpopulations
- FST = genetic divergence among subpopulations
- FIT = total departure from HW including that within and among subpopulations

FIT

FIS

FST

Individuals

Subpopulations

Total population

Inbreeding

Wahlund Effect or fragmentation

1 – FIT= (1 – FST)(1 – FIS)

FIT = FIS + FST – (FIS)(FST)

- FST = (FIT - FIS) / (1 - FIS)
- FST is always positive
- FIS is frequently positive, is negative if there is systematic avoidance of inbreeding
- FIT is positive unless there are not clear subdivisions and there is avoidance

- Variance of allele frequencies across subpopulations
- When in HW, Var(q) = 0, therefore FST = 0
- As Var(q) increases, divergence of subpopulations increases

- The proportion of total genetic variation that is distributed among subpopulations, rather than within subpopulations.

- Unbiased estimates of FST were calculated as described by Weir and Hill 2002.
- Suppose we have i subpopulations (where i = 1,…, r), we denote sample allele frequency as , and denote the average frequency over samples as

- and denote the average frequency over samples as

- The observed mean square for loci within populations are denoted by MSG:

- The observed mean square for between populations are denoted by MSP:

- Then FSTcan be estimated as follows:

Where is the average sample size across samples that also incorporates and corrects for the variance in sample size over subpopulations:

- Assumes Infinite Alleles Model (IAM) or K-alleles model with very low mutation rates (not appropriate for microsat data)
- All alleles differ equally from each other (magnitude of difference between alleles ignored)
- Does not work well with high heterozygosity
- Assumes alleles arrive in population via migration rather than mutation

- RST (Slatkin 1995)
- Analogue of FST
- Assumes Stepwise Mutation Model (mutation model most appropriate for microsats)
- Allows for high mutation rates
- Allows differences in magnitude between alleles to be accounted for
- Where S = average sum of differences in allele sizes in total population, and SW = average sum within populations

- FST and RST can differ using same data
- If loci don’t conform to SMM model, RST will be underestimated
- If mutation rates are large relative to migration rates, RST is superior
- Longer divergence times between populations favors RST
- RST favored under ideal conditions and with large samples
- FST favored with small samples and when a more conservative estimator is desired

- µ2 (Goldstein et al. 1995): ∑(µx-µy)2/L
- µx is the mean allele size in population x
- µy is the mean allele size in population y
- Summed across all loci and divided by # of loci (L)
- Allele size expressed as # repeat units
- Stepwise mutation model (SMM)
- E(µ2) = 2αt
- α = mutation rate per generation
- t = # generations

- Problems
- α not constant among different loci
- Variance very high
- µsats don’t strictly follow the SMM

- DSA (Bowcock et al., 1994, Nature)
- SA = shared alleles
- PSA = (∑S)/2U
- Where S = # shared alleles at a locus between 2 populations
- U = # loci

- DSA = 1 –PSA
- IAM
- May be superior to µ2 for closely related populations, even for µsat data

According to Sewall Wright:

- FST ranges from 0-1
- 0 = no genetic differentiation; panmixia
- 0.00–0.05 = little genetic diff
- 0.05-0.15 = moderate genetic diff
- 0.15-0.25 = great genetic diff
- 0.25-1.00 = very great genetic diff
- 1 = complete genetic differentiation

Chromosome 21 SNP data

#Asian

Group ={

"CHB"

"JPT"

"CHU"

"HMO"

"AVA"

}

#European

Group ={

"CEU"

"NEuro"

"Basque"

"Italian"

"Hungarian"

}

#African

Group ={

"YRI"

}

p=0.7

N=15

m=.02

m=.07

p=0.4

N=70

p=0.6

N=50

m=.01

p=0.3

N=10

p=0.5

N=150

p=1.0

N=20

- Drift makes subpopulations differerent
- Migration homogenizes subpopulations

- If you know FST and Ne, you can calculate m

In addition, very little migration is required to prevent substantial genetic divergence among subpopulations resulting from random genetic drift

1

Fst ~

4Nm + 1

Equilibrium

fixation

index

# of migrants/generation

- Indirect (based on FST)
- Nm = (1 - FST)/4FST
- Some drawbacks but often acceptable if limitations are considered
- High variance at low values of FST

- Assumptions of model not realistic
- All populations have same N
- Nm is equal among all demes
- Mutations do not occur
- Markers are truly neutral
- Selection not operating (local adaptation causes overestimate of FST estimate and underestiamte of Nm; uniform selection underestimates FST and overestimates Nm
- Recent isolation of demes won’t be detected

- Related to gene flow on evolutionary time scales
- Not appropariate for ecological time scales
- Ignores ongoing dynamics in allele frequencies (rare alleles)

- Best in situations where
- Spatial scale small (island model holds and spatially varying selection unlikely)
- Migration rates high (rapid attainment of genetic equilibrium)
- Sample sizes and number of loci used are large - accuracy of estimates
- Long-term estimate of Nem “averaged” over many generations desired
- Not useful for short-term nonequilibrium situations e.g. recently fragmented, rapidly declining populations

- If Ne and m are small, FST is large
- If Nem < 1 then
- FST > 0.2
- “If there is > 1 migrant per generation, populations do not diverge much.”

Fst

Fixation

Index

0 1 2 3 4 5 6 7 8 9 10

# migrants/generation

Nm

- From this analysis emerged a genetic rule of thumb that one migrant individual per local population per generation (OMPG) is sufficient to obscure any disruptive effects of drift.

Biologists concerned with population insularization caused by habitat fragmentation began advocating the application of this principle for conservation purposes

1. Mace and Lande (1991) used the OMPG rule as a criterion in defining threatened species categories of the World Conservation Union

2. In the U.S. nearly every recovery plan that considers genetic issues and insularization applies the OMPG rule

3. Widely applied by managers charged with initiating connectivity between isolated populations - e.g., reduce concerns about inbreeding depression

Unlikely that polymorphism will be lost within subpopulations - unlikely to reach equilibrium gene frequencies where one allele or the other is lost or “fixed”

Provides a desirable balance between drift and gene flow by preventing the loss of alleles and minimizing loss of heterozygosity within subpopulations but allowing genetic divergence to exist among subpopulations

How much gene flow might be too much?

Difficult to answer without extensive genetic and demographic information on the population

Frankel and Soule (1981) proposed an upper limit of 5 migrants per generationMills and Allendorf (1996) suggest that a minimum of 1 and a maximum of 10 migrants per generation would be the appropriate general rule of thumb for genetic purposes

Mutation has the same effect

Fst

Fixation

Index

0 1 2 3 4 5 6 7 8 9 10

# mutation/generation

Nu

- Arlequin 3.01
- http://anthro.unige.ch/software/arlequin/

- 利用HapMap数据进行群体结构分析；
- http://www.hapmap.org