人类群体遗传学基本原理和分析方法

中国科学院上海生命科学研究院研究生课程人类群体遗传学中国科学院上海生命科学研究院研究生课程人类群体遗传学人类群体遗传学基本原理和分析方法中科院-马普学会计算生物学伙伴研究所徐书华金力

第六讲 人群遗传结构分析（I）

第四讲 • 群体遗传学中的基本概念（4） • 群体遗传结构 • 描述群体遗传结构的统计量 • Hierarchical F statistics • 软件演示 • 利用Arlequin计算人群的Hierarchical F statistics

什么是遗传结构？ • 从差异中发现结构！ • 遗传多态性在时间上和空间上的不同分布模式就是遗传结构。 • 时间：不同时代；不同世代。 • 空间：不同地理分布；同域不同人群；不同基因组区域。

Population structure

为什么研究人类的遗传结构？ • 人类起源、迁徙、进化历史及前景 • 现代人群（民族）之间的亲缘关系 • 复杂疾病的遗传基础和基因定位 • 癌症 • 肥胖 • 哮喘 • 精神病 • II型糖尿病 • 心血管系统疾病 • 公共卫生保健 • 个性化用药和个性化治疗 • 法医学

An example • Population structures and association studies

Population structures make trouble in association studies • Population stratification in Epidemiology. • Analysis of mixed samples having different allele frequencies is a primary concern in human genetics, as it leads to false evidence for allelic association.

Odds ratio Disease Exposure yes no total yes a b a + b no c d c + d total a + c b + d a + b + c + d Odds for case: a/c Odds for control: b/d Odds ratio

Explanation of OR • OR>1: exposure factors increase the risk of disease; positive association • OR<1: exposure factors decrease the risk of disease; negative association • OR=1: no association

Example Odds for case 50:50 = 1 Odds for control 20:80 = 0.25 Odds ratio = 50:50/20:80 = 1/0.25 = 4

Subpopulation 1Subpopulation 2 casecontrol casecontrol exp(+) 5050 100 exp(+) 19 10 exp(-) 450450 900 exp(-) 99891 990 500 500 1,000 100 900 1000 Total Population case control exp(+) 5159 110 exp(-) 5491,341 1,890 600 1,400 2,000 51 600 59 1,400 = 8.5% = 4.2% Heterogeneity/Stratification OR=2.02

Human migration • Anatomically modern humans evolve in Africa > 160,000 ybp. • Some leave Africa sometime around 75,000 - 55,000 ybp. • Replace Neanderthals in Europe and archaic humans around the world. • Arrive in Western hemisphere between 34,000 and 18,000 ybp. • Multiple migrations in different pre-historic periods, followed by different migrations in historical periods.

Note on Definitions: Biological Race • morphology (phenotype) • Geographical location • Population based (frequency of genes) Socially Constructed Race: Arbitrarily utilizes aspects of morphology, geography, culture, language, religion, etc. in the service of a social dominance hierarchy.

NIH & University of Michigan

描述遗传结构的统计量 • Hierarchical F statistics

固定指数 • 固定指数（F）: • 如果一个座位上有两个等位基因，Hardy-Weinberg比率的任何偏差可以由参量F来度量，F称为固定指数，则基因型频率可以由下式给出： • 由以上第二式可得：

随机交配（h）情况下杂合子的预期频率 群体（h0）中下杂合子的观察频率 • 固定指数F可正可负，视情况而定。 • 可以看出，当h0小于h时，F取正值；当h0大于h时，F取负值。在近亲交配时，杂合子频率的观察值减小，F就取正值。上式可写成

亚群体（sub-population） • 以上考虑的是一个简单的群体，不论其是否近亲交配。 • 然而，实际上大多数的自然群体可被再分为许多不同的繁殖单位或亚群体（sub-population），尽管这些群体并不是完全隔离的。这种情况下，研究群体内和群体间的遗传变异就显得十分重要。

可再分群体中的基因型频率 • 假定一个群体可分为s个亚群体，每一个亚群体都满足Hardy-Weiberg平衡。设xk为第k个亚群体中等位基因A1的频率，则基因型A1A1，A1A2，A2A2的频率分别为 • 我们用wk来表示第k个亚群体的相对大小，且总和为1。则A1A1，A1A2，A2A2在整个群体中的频率为：其中和是亚群体中等位基因频率的均值和方差。

可再分群体中的固定指数 • 比较，因此我们知道

Wahlund定律 • 表明如果一个群体被分为多个交配单位，纯合子的频率要高于Hardy-Weinberg比率。这个性质首先由Wahlund（1928）发现，被称为Wahlund定律，也称Wahlund现象。 • 当等位基因频率在所有亚群体中一致时，F为0；而当每个亚群体都被固定为某一个等位基因时，F为1。

Back to inbreeding

Wahlund现象的启示 • 群体结构（population structure）的存在！ • 反之，当F为负值的时候，杂合子频率比Hardy-Weinberg平衡时预期的要高，意味着杂合优势，某种程度的自然选择发生。杂合优势与平衡选择（后面“自然选择”章节细谈）

Wright’s Fixation Index (FST) Sewall Wright 1889-1988

F-statistics • Different F-statistics for different scales • Individual (I) • Subpopulation (S) • Total population (T) • Those are the traditional scales but in theory there can be no limit to the # of levels of analysis . • Originally defined for 2 alleles • Extended to >2 alleles as G-statistics

F-statistics Derived from inbreeding coefficient • FIS • inbreeding in individuals relative to subpopulation (Weir and Cockerham’s f) • FST • inbreeding among subpopulations relative to total population (Weir and Cockerham’s ) • FIT • inbreeding among individuals relative to total population (Weir and Cockerham’s F)

Remember that inbreeding coefficient, F, is related to loss of heterozygosity F = 1 – (Ho/He) • F-statistics can be expressed in the same way FIS = 1 – (HI/HS) FST = 1 – (HS/HT) FIT = 1 – (HI/HT) HI= HO averaged across subpopulations HS= He averaged across subpopulations HT= He for total population = He

Deficit of heterozygote aa AA aa AA FST = 1 – (HS/HT) aa AA aa AA P(A) = p = 1 P(a) = q = 0 p = 0 q = 1 HS = Hewithin subpopulation HS = 1 - pi2 = 1 - (12 + 02) = 0 HS = 0 Mean HS = 0 HT= He for total population For total population, p = 0.5 & q = 0.5 HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5 FST = 1 – (HS/HT) = 1 - (0/0.5) = 1

Deficit of homozygote Aa Aa Aa Aa FST = 1 – (HS/HT) Aa Aa Aa Aa P(A) = p = 0.5 P(a) = q = 0.5 p = 0.5 q = 0.5 Mean HS = 0.5 HS = 1 - pi2 = = 1 - (0.52 + 0.52) = 0.5 HS = 0.5 HT= He for total population For total population, p = 0.5 & q = 0.5 HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5 FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0

P(A) = p = 0.5 P(a) = q = 0.5 p = 0.5 q = 0.5 Mean HS = 0.5 HS = 1 - pi2 = = 1 - (0.52 + 0.52) = 0.5 HS = 0.5 HT= He for total population For total population, p = 0.5 & q = 0.5 HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5 FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0 AA AA Aa Aa FST = 1 – (HS/HT) Aa Aa aa aa FST uses expected heterozygosity, not observed heterozygosity!!

F statistics • FIS tells us if there is inbreeding within subpopulations by comparing HI and HS: • Bars mean that the values are the averages over all the subpopulations that we are considering. • So FIS measures whether there is, on average, a deficit of heterozygotes within subpopulations.

F statistics • FST is the statistic that tells us how differentiated the subpopulations are. Formally, FST tells us if there is a deficit of heterozygotes in the metapopulation, due to differentiation among subpopulations: • Bars mean that the values are the averages over all the subpopulations that we are considering.

F statistics • FIT tells us how much population structure has affected the average heterozygosity of individuals within the population: • Also (1-FIS) (1-FST) = (1-FIT).

F-statistics Measure departure from Hardy-Weinberg equilibrium • FIS = departure from HW in local subpopulations • FST = genetic divergence among subpopulations • FIT = total departure from HW including that within and among subpopulations

FIT FIS FST Partitioning of structure Individuals Subpopulations Total population Inbreeding Wahlund Effect or fragmentation 1 – FIT= (1 – FST)(1 – FIS) FIT = FIS + FST – (FIS)(FST)

The three F statistics are related to each other • FST = (FIT - FIS) / (1 - FIS) • FST is always positive • FIS is frequently positive, is negative if there is systematic avoidance of inbreeding • FIT is positive unless there are not clear subdivisions and there is avoidance

Extensions • Variance of allele frequencies across subpopulations • When in HW, Var(q) = 0, therefore FST = 0 • As Var(q) increases, divergence of subpopulations increases

Intuitive meaning of FST • The proportion of total genetic variation that is distributed among subpopulations, rather than within subpopulations.

Unbiased estimates of FST • Unbiased estimates of FST were calculated as described by Weir and Hill 2002. • Suppose we have i subpopulations (where i = 1,…, r), we denote sample allele frequency as , and denote the average frequency over samples as • and denote the average frequency over samples as

The observed mean square for loci within populations are denoted by MSG:

The observed mean square for between populations are denoted by MSP:

Then FSTcan be estimated as follows: Where is the average sample size across samples that also incorporates and corrects for the variance in sample size over subpopulations:

Problems with FST • Assumes Infinite Alleles Model (IAM) or K-alleles model with very low mutation rates (not appropriate for microsat data) • All alleles differ equally from each other (magnitude of difference between alleles ignored) • Does not work well with high heterozygosity • Assumes alleles arrive in population via migration rather than mutation

Special version for microsatellites • RST (Slatkin 1995) • Analogue of FST • Assumes Stepwise Mutation Model (mutation model most appropriate for microsats) • Allows for high mutation rates • Allows differences in magnitude between alleles to be accounted for • Where S = average sum of differences in allele sizes in total population, and SW = average sum within populations

人类群体遗传学 基本原理和分析方法