1 / 162

最佳的多重序列比對方法針對基因組領域

最佳的多重序列比對方法針對基因組領域. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. Getting the best out of multiple sequence alignment methods in the genomic era. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Download Presentation

最佳的多重序列比對方法針對基因組領域

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 最佳的多重序列比對方法針對基因組領域 Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

  2. Getting the best out of multiple sequence alignment methods in the genomic era Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

  3. Which Tool for Which Sequence ?

  4. In- SilVo Biology • In Silico Biology • Making Sense of digital data • In Vivo Biology • Recording data in a living Cell • In SilVo Biology • Connect In-Vitro and In-Vivo • In-Vivo: High-throughput recording • In-Silico: High-Throughput analysis

  5. Is it Possible to Compare all Types of Sequences ? • Non Transcribed World • Genes/Full Genomes • Lagan, TBA • Promoter Regions • Meta-Aligner • Motifs Finders • Nucleosome • ??? • Multiple Genome Aligners • Not Very Accurate • Very Fast • Deal with rearrangements

  6. Multiple Genome Alignments andre-sequencing • Before • Re-sequence Human Genomes • Map the Reads onto the reference genome • Now • Re-sequence • Assemble • Align • Non trivial with very large datasets

  7. Is it Possible to Compare all Types of Sequences ? • RNA Comparison • Less Accurate than Proteins • Secondary Structures • ncRNA World • Sankoff • Time O(L2n) • Space O(L3n) • Consan • R-Coffee

  8. Is it Possible to Compare all Types of Sequences ? • Protein Comparisons • Very Accurate • 3D-Structure Improves it • Protein Aligners • ClustalW • T-Coffee • 3D-Coffee

  9. What Changes with 1000 Genomes? Phylogeny

  10. Phylogeny Vs Function • Function • Low level => Biochemistry => Protein Domains • High Level => Metabolic Pathway => Orthology • Orthology • Phylogenetic Analysis • Phylogenetic Analysis =>Accurate Alignments

  11. apparent one2one one2many one2one many2many one2one Duplication node Speciation node or leaf (Adpated from “Going beyond AGC and T, E. Birney)

  12. Using The tree Correct Tree  Correct Orthologous Assignment  Correct Functional Prediction

  13. The Alignment that Hides The Forest…

  14. Phylogenetic Trees andMultiple Sequence Alignments

  15. Phylogenetic Trees andMultiple Sequence Alignments

  16. Genomic Era: The Goal • 10.000 Sequences: interspecies • 1 Billion: Re-sequencing • Incorporation of ALL experimental Data • Structure, Genomic, ChIp-Chip, ChIp-Seq… • Alignments suitable for all applications of comparative genomics • Homology Modeling (function) • Functional Analysis • Phylogenetic Reconstruction • 3D-Modelling • Accurate Alignments for ALL kind of data • Non Transcribed DNA • Transcribed DNA • Translated DNA

  17. Accuracy Proteins: 30% is the limit DNA/RNA 70% is the limit Scale With too many sequences algorithms lose in accuracy Data Integration Structure Homology Genomic Structure Function Proteomics Methods Wealth of alternative methods Poorly Characterized Genomic Era Challenges

  18. Consistency and Data Integration • Most methods rely on the progressive algorithm • Consistency based methods have been designed as an extension • Consistency based alignment methods have been designed to: • Better extract the signal contained in the data • Integrate/Confront existing methods • Integrate/Confront heterogeneous types of Information

  19. The Progressive Alignment Algorithm

  20. T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CA-T --- SeqC GARFIELDTHEVERY FAST CAT SeqD -------- THE ---- FA-T CAT

  21. T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD -------- THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CATPrim. Weight =100 SeqC GARFIELDTHEVERY FAST CAT SeqC GARFIELDTHEVERY FAST CAT Prim. Weight =100 SeqD -------- THE ---- FA-T CAT

  22. SeqA GARFIELD THE LAST FAT CAT Prim. Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Prim. Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT Prim. Weight =100 SeqD -------- THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CATPrim. Weight =100 SeqC GARFIELDTHEVERY FAST CAT SeqC GARFIELDTHEVERY FAST CAT Prim. Weight =100 SeqD -------- THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT T-Coffee and Concistency…

  23. SeqA GARFIELD THE LAST FAT CAT Weight =88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT Weight =77 SeqC GARFIELDTHE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT SeqA GARFIELD THE LAST FA-T CAT Weight =100 SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT T-Coffee and Concistency…

  24. T-Coffee and Concistency…

  25. T-Coffee and Concistency…

  26. Methods Scalability Data

  27. A Brief History of Consistency A Long Chain of Small Contributions…

  28. Gotoh (1990) Iterative strategy using consistency Martin Vingron (1991) Dot Matrices Multiplications Accurate but too stringeant Dialign (1996, Morgenstern) Concistency Agglomerative Assembly T-Coffee (2000, Notredame) Concistency Progressive algorithm ProbCons (2004, Do) T-Coffee with a Bayesian Treatment Biphasic Gap Penalty AMAP (Schwarz, 2007) ProbCons Consistency Replace Progressive alignment with simulated Annealing Hard to distinguish from ProbCons FSA ( Patcher, 2009) AMAP with automated parameter estimation Hard to distinguish from ProbCons Consistency Based Algorithms

  29. Choosing the right modeling methodM-Coffee

  30. Combining Many MSAs into ONE ClustalW MAFFT T-Coffee MUSCLE ???????

  31. Consistency and Accuracy

  32. Integrating New Types of DataTemplate Based Sequence Alignments

  33. Templates Templates Template Aligner TARGET TARGET TARGET Experimental Data … Experimental Data … Template Alignment Template-Sequence Alignment Template based Alignment of the Sequences Primary Library

  34. Exploring The Template World

  35. Exploring The Template World

  36. 3D-Coffee/ExpressoIncorporating Structural Information

  37. Expresso: Finding the Right Structure Sources BLAST BLAST SAP Templates Templates Template Alignment Source Template Alignment Library Remove Templates

  38. PSI-CoffeeHomology Extension

  39. Exploring The Template World

  40. What is Homology Extension ? -Simple scoring schemes result in alignment ambiguities L ? L L

  41. What is Homology Extension ? L L Profile 1 L L L L L L L L L L L I L Profile 2 V L I L L L

  42. What is Homology Extension ? L L Profile 1 L L L L L L L L L L L I L V L Profile 2 I L L L

  43. PSI-Coffee: Homology Extension Sources BLAST BLAST Profile Aligner Templates Templates Template Alignment Source Template Alignment Library Remove Templates

  44. Benchmarks

  45. Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

  46. Consistency Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

  47. Homology Extension Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

  48. Structural Extension Score: fraction of correct columns when compared with a structure based reference (BB11 of BaliBase).

  49. T-Coffee and The World -Some Templates are obtained with a BLAST -Queries can be sent to the EBI or the NCBI -No Need for a Local BLAST installation BLAST/ SOAP Users sequences

More Related