1 / 42

第五章 系谱分析

生物信息学. 第五章 系谱分析. 2. 系统发生分析( Phylogenetic analysis). 分析基因或蛋白质的进化关系 系统发生(进化)树( phylogenetic tree ). A tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor. 研究系统发生的方法. 经典进化生物学: 比较: 形态 、 生理结构 、 化石

kamin
Download Presentation

第五章 系谱分析

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 生物信息学 第五章 系谱分析

  2. 2. 系统发生分析(Phylogenetic analysis) • 分析基因或蛋白质的进化关系 • 系统发生(进化)树(phylogenetic tree) A tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor.

  3. 研究系统发生的方法 经典进化生物学: 比较:形态、生理结构、化石 分子进化生物学: 比较DNA和蛋白质序列

  4. An Alignment is an hypothesis of positional homology between bases/Amino Acids Easy only with substitutions Difficult also with indels Residues that are lined up in different sequences are considered to share a common ancestry (i.e., they are derived from a common ancestral residue).

  5. 末端节点 分支 Branch 可以是物种,群体,或者蛋白质、DNA、RNA分子等 A B C 祖先节点/树根 OTU D 内部节点/分歧点 Root 该分支可能的祖先 E HTU 系统发生树术语 节点Node = ((A, (B,C)), (D, E)) Newick format

  6. A clade(进化支)is a group of organisms that includes an ancestor and all descendents of that ancestor. 分支树 进化树 Taxon B 6 Taxon B Taxon B time 1 Taxon C 1 Taxon C Taxon C 3 Taxon A 1 Taxon A Taxon A Taxon D Taxon D 5 Taxon D no meaning genetic change 系统发生树术语 超度量树 Phylogram Ultrametric tree Cladogram

  7. 无根树 有根树 A C D B A d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9 10 C 3 2 2 B D 5 外群、外围支 outgroup 系统发生树术语 Rooted tree vs. Unrooted tree two major ways to root trees: By midpoint or distance

  8. animal animal fungus animal Rooted tree bacterium plant Monophyletic group plant plant animal animal Monophyletic group root animal fungus Rooted tree vs. Unrooted tree plant Unrooted tree plant plant

  9. 外群 bacteria outgroup archaea archaea archaea eukaryote eukaryote eukaryote eukaryote How to root a tree? 选择外群(Outgroup) • 选择一个或多个已知与分析序列关系较远的序列作为外类群 • 外类群可以辅助定位树根 • 外类群序列必须与进化树上其它序列同 源,但外类群序列与这些序列间的差异必须比这些序列之间的差异更显著。

  10. UPGMA 邻近法 (Neighbor-joining, NJ) 最小进化法 (minimum evolution) 系统发育树构建步骤 多序列比对(自动比对、手工校正) 最大简约法 (maximum parsimony, MP) 距离法 (distance) 最大似然法 (maximum likelihood, ML) 贝叶斯法 (Bayesian inference) 选择建树方法(替代模型) 建立进化树 统计分析 Bootstrap Likelihood Ratio Test …… 进化树评估

  11. 距离法 Cat Rat 2 1 1 2 4 Dog Cow 距离法又称距离矩阵法,首先通过各个序列之间的比较,根据一定的假设(进化距离模型)推导得出分类群之间的进化距离,构建一个进化距离矩阵。进化树的构建则是基于这个矩阵中的进化距离关系 。 计算序列的距离,建立距离矩阵 通过距离矩阵建进化树

  12. (选择替代模型) Uncorrected “p” distance (=observed percent sequence difference) Kimura 2-parameter distance (estimate of the true number of substitutions between taxa) Step1. 计算序列的距离,建立距离矩阵 对位排列,去除空格

  13. Step2. 通过矩阵建树 由进化距离构建进化树的方法有很多,常见有: 1. Unweighted Pair Group Method with Arithmetic mean(UPGMA) 2. Neighbor-Joining Method (NJ法/邻位连接法) 3.Minimum Evolution (MP法/最小进化法)

  14. 最大简约法(MP)最早源于形态性状研究,现在已经推广到分子序列的进化分析中。最大简约法的理论基础是奥卡姆(Ockham)哲学原则,对所有可能的拓扑结构进行计算,找出所需替代数最小的那个拓扑结构,作为最优树。最大简约法(MP)最早源于形态性状研究,现在已经推广到分子序列的进化分析中。最大简约法的理论基础是奥卡姆(Ockham)哲学原则,对所有可能的拓扑结构进行计算,找出所需替代数最小的那个拓扑结构,作为最优树。 最大简约法 (Maximum Parsimony) Find the tree that explains the observed sequences with a minimal number of substitutions

  15. MP法建树流程 Position 123 Position 1(1,2): 1 change; (1,3) or (1,4): 2 changes If 1 and 2 are grouped a total of four changes are needed. If 1 and 3 are grouped a total of five changes are needed. If 1 and 4 are grouped a total of six changes are needed. Position 2(1,3): 1 change; (1,2) or (1,4): 2 changes Position 3(1,2): 1 change; (1,3) or (1,4): 2 changes

  16. 6 5 4 MP法建树步骤 BEST

  17. 最大似然法 (Maximum Likelihood) 最大似然法(ML) 最早应用于对基因频率数据的分析上。其原理为选取一个特定的替代模型来分析给定的一组序列数据,使得获得的每一个拓扑结构的似然率都为最大值,然后再挑出其中似然率最大的拓扑结构作为最优树。

  18. C C A G ATGC ATGC ML法建树流程

  19. Inferring the maximum likelihood tree • Pick an Evolutionary Model • For each position, Generate all possible tree structures • Based on the Evolutionary Model, calculate Likelihood of these Trees and Sum them to get the Column Likelihood for each OTU cluster. • Calculate Tree Likelihood by multiplying the likelihood for each position • Choose Tree with Greatest Likelihood

  20. 构建进化树的新方法——贝叶斯推断(Bayesian inference) Holder&Lewis (2003) Nature Reviews Genetics 4, 275-284 • Bayesian inference: • What is the probability that the model/theory is correct given the observed data? • Pr(T|D) Maximum Likelihood: What is the probability of seeing the observed data (D) given a model/theory (T)? Pr(D|T) 与ML相比,BI的优势: • Speed • No need for bootstrapping

  21. Comparison of Methods

  22. Choosing a Method for Phylogenetic Prediction Molecular Biology and Evolution 2005 22(3):792-802 Bioinformatics: Sequence and Genome Analysis, 2nd edition, by David W. Mount. p254 http://cshprotocols.cshlp.org/cgi/content/full/2008/5/pdb.ip49

  23. Assessing tree reliability Phylogenetic reconstruction is a problem of statistical inference. One must assess the reliability of the inferred phylogeny and its component parts. Questions: (1) how reliable is the tree? (2) which parts of the tree are reliable? (3) is this tree significantly better than another one?

  24. 评估进化树的可靠性——自展法(bootstrapping method) A statistical technique that uses intensive random resampling of data to estimate a statistic whose underlying distribution is unknown. • 从排列的多序列中随机有放回的抽取某一列,构成相同长度的新的排列序列 • 重复上面的过程,得到多组新的序列 • 对这些新的序列进行建树,再观察这些树与原始树是否有差异,以此评价建树的可靠性

  25. Pseudo sample 1 0123456789 0011222345 rat GGAAGGGGCT human GGTTGGGGCT turtle GGTTGGGCCC fruitfly CCTTCCCGCC oak AATTCCCGCT duckweed AATTCCCCCT rat GAGGCTTATC human GTGGCTTATC turtle GTGCCCTATG fruitfly CTCGCCTTTG oak ATCGCTCTTG duckweed ATCCCTCCGG Sample rat human Pseudo sample 2 4455567778 turtle rat CCTTTTAAAT human CCTTTTAAAT turtle CCCCCTAAAT fruitfly CCCCCTTTTT oak CCTTTCTTTT duckweed CCTTTCCCCG fruit fly oak duckweed Inferred tree The Bootstrap • Computational method to estimate the confidence level of a certain phylogenetic tree. More replicates (between 100 - 1000)

  26. 自展法检验流程 Bootstrapping doesn’t really assess the accuracy of a tree, only indicates the consistency of the data 对ML法而言,自展法太耗时,可用aLRT法检验进化树的可靠性 Anisimova&Gascuel (2006) Syst. Biol. 55(4):539-552

  27. MSA程序可对任何序列进行比对,选择什么样的序列进行比对非常重要!!MSA程序可对任何序列进行比对,选择什么样的序列进行比对非常重要!! MSA是构建分子进化树的关键步骤 用于构建进化树的序列必须是同源序列

  28. EBI的ClustalW2-phylogeny分析网页 http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/ • 分子进化树构建(ClustalW) 输入比对后的序列(或上载Alignments文件) 页面下方 显示Cladogram Tree 点击“Show as Phylogram Tree”展示Phylogram Tree 不推荐:仅提供距离法建树,且没有进行评估

  29. 看图工具 • TreeView进化树编辑打印软件 (在http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) EBI的ClustalW2-phylogeny分析网页 输入比对后的序列(或上载Alignments文件) 下载“Phylip tree file”(ph文件) 用TreeView软件打开上述文件 可以不同格式展示进化树(1、2、3)

  30. 分子进化分析软件 PHYLIP http://evolution.genetics.washington.edu/phylip.html 免费的集成进化分析工具 PAUP http://paup.csit.fsu.edu/ 商业软件,集成的进化分析工具 MEGA http://www.megasoftware.net/ 免费的图形化集成进化分析工具 PHYML http://atgc.lirmm.fr/phyml/ 最快的ML建树工具 PAML http://abacus.gene.ucl.ac.uk/software/paml.html ML建树工具 Tree-puzzle http://www.tree-puzzle.de/ 较快的ML建树工具 MrBayes http://mrbayes.csit.fsu.edu/ 基于贝叶斯方法的建树工具 http://evolution.gs.washington.edu/phylip/software.html 更多工具

  31. http://www.megasoftware.net/ • 分子进化树构建方法 提供最大简约法(MP)、最大似然法(ML)和距离法三种建树方法。其中距离法包括邻接法(NJ)、最小进化法(ME)和UPGMA三种算法。 优点:图形界面,集序列查询、比对、进化树构建为一体,帮助文件详尽,免费

  32. Buffon (1707-1788) Natural History of Animals

  33. 始祖鸟化石 复原图

  34. 2.7% difference

  35. xl, Xenopus laevis; xt, Xenopus tropicalis; gg, Gallus gallus; rn, Rattus norvegicus; mm, Mus musculus; hs, Homo sapiens. BMC Evolutionary Biology 2007 7:164

  36. Degree of divergence Total number of substitutions 原始序列 由于同一位点多重替代(multiple substitution)的发生,观测到的差异比实际替代数要小 13 mutations = 3 differences 后代序列

  37. 替代模型 Substitution model 在进化的任意时间点,任意位点的核苷酸都可能发生回复和平行突变。 为了估算出正确的分歧时间(期望替代数),必须对观测到的替代数进行校正

  38. 替代模型

More Related