1 / 106

Comparative genomics for biological discovery

Comparative genomics for biological discovery. Lior Pachter Dept. Mathematics, U.C. Berkeley lpachter@math.berkeley.edu February 3, 2004. Comparative Genomics. From: Hardison RC (2003) Comparative Genomics. PLoS Biol 1(2): e58. February 2001. December 2002. Rat 2004.

roscoe
Download Presentation

Comparative genomics for biological discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative genomics for biological discovery Lior Pachter Dept. Mathematics, U.C. Berkeley lpachter@math.berkeley.edu February 3, 2004

  2. Comparative Genomics From: Hardison RC (2003) Comparative Genomics. PLoS Biol 1(2): e58.

  3. February 2001 December 2002

  4. Rat 2004 Picture credit: G.Bourque, P. Pevzner, G. Tesler and the Rat Genome Sequencing Consortium

  5. State of the Genomes (Jan 2004) Aligned (multiple) Working on it As soon as released

  6. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

  7. Outline VISTA/AVID tools for comparative genomics Related biological stories Human/Mouse/Rat Phylogenetic Shadowing

  8. http://www-gsd.lbl.gov/vista Processed ~ 11000 queries on-line, distributed > 560 copies of the program in 34 countries

  9. VISTA/AVID package • AVID: Program for global alignment of DNA fragments of any length ` N. Bray and L. Pachter, MAVID: Constrained Ancestral Alignment of Multiple Sequences, Genome Research, in press. N. Bray, I. Dubchak, L. Pachter, AVID: A Global Alignment Program , Genome Research, 13 (2003) p 97 - 102. • VISTA: Visualization of alignment and various sequence features for any number of species C. Mayor, M. Brudno, J.R. Schwartz, A. Poliakov, E. M. Rubin, K. A. Frazer, L. Pachter and I. Dubchak, VISTA: Visualizing global DNA sequence alignments of arbitrary length, Bioinformatics, 16 (2000), p 1046-1047.

  10. Aligning large genomic regions • Long sequences lead to memory problems • Speed becomes an issue • Long alignments are very sensitive to parameters • Draft sequences present a nontrivial problem • Accuracy is difficult to measure and to achieve References for other existing programs: Glass: Domino Tiling, Gene Recognition, and Mice. Pachter, L. Ph.D. Thesis, MIT (1999) Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Batzoglou, S., Pachter, L., Mesirov, J., Berger, B., Lander, E. Genome Research (2000). MUMmer Delcher, A.L., Kasif S., Fleischmann, R.D., Peterson J., White, O. and Salzberg, S.L. Alignment of whole genomes. Nucleic Acids Research (1999) PipMaker PipMaker: A Web Server for Aligning Two Genomic DNA Sequences. Scott Schwartz, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs, Ross Hardison, and Webb Miller. Genome Research (2000) DIALIGN Multiple DNA and protein sequence alignment based on segment-to-segment comparison B. Morgenstern, A. Dress and T. Werner, Proc. Natl. Acad. Sci. USA 93 (1996)

  11. Variations on Sequence Alignment Find the best OVERALL alignment. Global alignment Find ALL regions of similarity. Local alignment Find the BEST region of similarity. Optimal local alignment

  12. AVID- the alignment engine behind VISTA • Very fastglobal alignment of megabases of sequence. • Provides detailsabout ordered and oriented contigs, and accurate placement in the finished sequence. • Full integrationwith repeat masking. • ORDER and ORIENT • FIND all common k-long words (k-mers) • ALIGN k-mers scoring by local homology • FIX k-mers with good local homology • RECURSE with smaller k (shorter words)

  13. Visualization tggtaacattcaaattatg-----ttctcaaagtgagcatgaca-acttttttccatgg || | |||| | | || || | | | |||||| | || | | || tgatgacatctatttgctgtttcctttttagaaactgcatgagagcctggctagtaggg Window of length L is centered at a particular nucleotide in the base sequence Percent of identical nucleotides in Lpositions of the alignment is calculated and plotted Move to the next nucleotide

  14. Finding conserved regions with percentage and length cutoffs Conserved segments with percent identity X and length Y - regions in which every contiguous subsegment of length Y was at least X% identical to its paired sequence. These segments are merged to define the conserved regions. Output: 11054 - 11156 = 103bp at 77.670% NONCODING 13241 - 13453 = 213bp at 87.793% EXON 14698 - 14822 = 125bp at 84.800% EXON

  15. Conserved NonCoding Sequences VISTA Plot KIF Gene 100% % Identity 75 50 0kb 10kb Human Sequence (horizontal axis)

  16. Apolipoprotein AI gene 100% human/ macaque 75% 50/100% human/ pig 75% 50/100% human/ rabbit 75% 50/100% human/ mouse 75% 50/100% human/ rat 75% 50/100% human/ chicken 75% 50% Liver enhancer Multi-Species Comparative Analysis (mVISTA)

  17. Some results obtained with VISTA J Mol Cell Cardiol 34, 1345-1356 (2002) Myocardin: A Component of a Molecular Switch for Smooth Muscle Differentiation. J. Chen, C. M. Kitchen, J. W. Streb and J. M. Miano University of Oxford VSTA used to solve the gene structures of rat and human myocardin.

  18. Blood, 100, 3450-3456 (2002) Deletion of the mouse a -globin regulatory element (HS 26) has an unexpectedly mild phenotype E. Anguita, J. A. Sharpe, J. A. Sloane-Stanley, C. Tufarelli, D. R. Higgs, and W. G. Wood University of Oxford.

  19. Genome Research 11, 78 (2001) Human and Mouse - Synuclein Genes: Comparative Genomic Sequence Analysis and Identification of a Novel Gene Regulatory Element J. W. Touchman, et al. NIH Intramural Sequencing Center, National Institutes of Health Synuclein gene involved in Alzheimer’s disease

  20. EMBO reports 4:143 (2003) The kangaroo genome. Leaps and bounds in comparative genomics M. J. Wakefield and J. A. Marshall Graves Research School of Biological Sciences, The Australian National University, Canberra, ACT 0200, Australia ‘The kangaroo genome is a rich and unique resource for comparativegenomics, a treasure trove of comparative genomics data’. Phylogenetic footprinting of 3’ untranslated region of the SLC16A2 gene

  21. VISTA flavors • VISTA – comparing DNA of multiple organisms • for 3 species - analyzing cutoffs to define actively conserved non-coding sequences • cVISTA - comparing two closely related species • rVISTA – regulatory VISTA

  22. Identifying non-coding sequences (CNSs) involved in transcriptional regulation

  23. rVISTA - prediction of transcription factor binding sites • Simultaneous searches of the major transcription factor binding site database (Transfac) and the use of global sequence alignment to sieve through the data • Combination of database searches with comparative sequence analysis reduces the number of predicted transcription factor binding sites by several orders of magnitude

  24. Ikaros-2 Ikaros-2 NFAT Ikaros-2 Human TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACAAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTGTCTCTCCCTTCCCCTCTG Mouse TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCTCTCTCTTCCTCCCCCTCCA Dog TGATTTCTCGGCAGCAAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCGATTTTCTACCTACGACCTCACTTTCTGTTGCGCTCACTCCCTTCCCCTGCA Rat TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCACTCGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGTTCTCTCTTCCTCCCCCTCCA Cow TGATTTCTCGGCAGCCAGGGAGGGCCCCATGACGAAGCCATTTGAAATCCCAGAAGCAATTTTCTACTTACGACCTCACTTTCTGTTGCGTTCTCTCCCTTCCCCTCCT Rabbit TGATTTCTCGGCAGCCAGGGAGGGCCCCACGAC-AAGCCATTCAAAATCCCAGAAGTGATTTTCTACTTACGACCTCACTTTCTGTTG----CTCTCTCCTTCCCTCCA 20 bp dynamic shifting window >80% ID Regulatory VISTA (rVISTA) 1. Identify potential transcription factor binding sites for each sequence using library of matrices (TRANSFAC) 2. Identify aligned sites using AVID 3. Identify conserved sites using dynamic shifting window Percentage of conserved sites of the total 3-5%

  25. ~1 Meg region, 5q31 Coding Noncoding Human interval Transfac predictions for GATA sites 839 20654 Aligned with the same predicted site in the mouse seq. 450 2618 Alligned sites conserved at 80% / 24 bp dynamic window 303 731 Random DNA sequence of the same length 29280 

  26. 2 Exp. Verified GATA-3 Sites IL 5 GATA-3 (28) GATA-3 Conserved (4)

  27. Ik-2-All Ik-2-Aligned Ik-2-conserved 100% 75% 50% AP-1-conserved NFAT-conserved GATA-3-conserved 100% 75% 50% AP-1-All NFAT-All AP-1-Aligned NFAT-Aligned AP-1-Conserved NFAT-Conserved 100% 75% 50% A B C

  28. Main features of AVID • Alignments up to several megabases • Works with finished and draft sequences • Fast • Accurate for close and distant organisms

  29. Main features of VISTA • Clear , configurable output • Ability to visualize several global alignments on the same scale • Available source code and WEB site

  30. Large scale VISTA/AVID applications: Cardiovascular comparative genomicsdatabase http://pga.lbl.gov Berkeley Genome Pipeline – comparing the human and mouse genome http://pipeline.lbl.gov/ Multiple whole genome comparisons using MAVID http://bio.math.berkeley.edu/genome/

  31. Automatic computational system for comparative analysis of pairs of genomes http://pipeline.lbl.gov Alignments (all pair-wise combinations): Human Genome: (Golden Path Assembly) Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002) Rat assemblies: November 2002, February 2003 ---------------------------------------------------------- D. Melanogaster vs D. Pseudoobscura February 2003

  32. Main modules of the system Mapping and alignment of mouse contigs against the human genome Visualization Analysis of conservation

  33. Tandem Local/Global Alignment Approach • Finding a likely mapping for a contig • Multi-step verification of potential regions by global alignment

  34. Specificity test The ratio of the number of bp on each human chromosome covered by alignments of the reversed mouse genome and the number of base pairs covered by the actual mouse genome.

  35. Apolipoprotein(a) region. The expressed gene is confined to A subset of primates. Our method is the only one to predict that apoa(a) has NO homology in the mouse.

  36. VistaBrowser

  37. Input your own sequence to align against the Reference Genomes: Human, Mouse, Rat, D.Melanogaster

  38. GenomeVISTA Opposum BAC versus Human Genome

  39. Examples of Results • Understanding the structure of conservation • Identification of putative functional sites • Discovery of new genes • Detection of contamination and misassemblies

  40. Two assemblies are better than one

  41. Highly Conserved Region ApoA4 ApoC3 ApoA1 Zoom In Identification of a New Apo Gene on Human 11q23 Gene Name

  42. New Gene (ApoA5) Pennacchio LA et al. Science. 2001, 294:169-73. Identification of a New Apo Gene on Human 11q23

  43. Finding regulatory regions Muscle Specific Regulatory Region: human beta enolase intronic enhancer

  44. Comparative analysis of genomic intervals containing important cardiovascular geneshttp://pga.lbl.gov

  45. http://pga.lbl.gov/cvcgd.html

  46. Example of CVCGD entry

  47. Short annotation of the region

  48. Detailed annotation in AceDB format

  49. VISTA plot of the region

More Related