1 / 37

Vertebrate Comparative Genomics: Sequence Conservation and Function

This lecture explores vertebrate comparative genomics, focusing on sequence conservation and its implications for understanding function. Topics include DNA replication, imperfect replication, sequence conservation, and our place in the tree of life.

Download Presentation

Vertebrate Comparative Genomics: Sequence Conservation and Function

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MW  11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean http://cs273a.stanford.edu [Bejerano Aut07/08]

  2. Lecture 6 • Vertebrate Comparative Genomics • Sequence Conservation and Function • Chains & Nets http://cs273a.stanford.edu [Bejerano Aut07/08]

  3. Meet Your Genome contd. [Human Molecular Genetics, 3rd Edition] http://cs273a.stanford.edu [Bejerano Aut07/08]

  4. human chimp macaque mouse rat cow dog opossum platypus chicken zfish tetra fugu Comparative Genomics “Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky human chimp macaque mouse rat cow dog opossum platypus Intelligent Designer chicken zfish tetra fugu t [Adam Siepel, Cornell] http://cs273a.stanford.edu [Bejerano Aut07/08]

  5. DNA: Functional and Non-Functional • DNA = linear molecule that carries instructions for making living organisms ~ long string(s) over a small alphabet • Alphabet of four {A,C,G,T} Strings of length 104-1011 ...ACGTACGACTGACTAGCATCGACTACGACTAGCAC... genetic instructions: how to... when to... where to... “junk” DNA “junk” DNA http://cs273a.stanford.edu [Bejerano Aut07/08]

  6. One Cell, One Genome, One Replication • Every cell holds a copy (actually 2) of all its DNA = its genome. • The genome is replicated every cell division. • The human body is made of ~1014 cells. • All originate from a single cell through repeated cell divisions. egg DNA string egg cell cell division genome = all DNA chicken egg chicken ≈ 1014 copies(DNA) of egg (DNA) http://cs273a.stanford.edu [Bejerano Aut07/08]

  7. DNA Replication is Imperfect • Small Scale: single letters are substituted, erased, added junk functional ...ACGTACGACTGACTAGCATCGACTACGA... chicken TT CAT egg ...ACGTACGACTGACTAGCATCGACTACGA... many changes are not tolerated “anything goes” chicken thus, sequence conservation over generations implies function! http://cs273a.stanford.edu [Bejerano Aut07/08]

  8. Sequence Conservation implies Function • (but which function/s?...) Comparative Genomics of Distantly related species: functional region! human ...CTTTGCGA-TGAGTAGCATCTACTATTT... common ancestor ...ACGTGGGACTGACTA-CATCGACTACGA... anotherspecies Note: the inverse “no conservation  no function”is a much weaker statement given current knowledge http://cs273a.stanford.edu [Bejerano Aut07/08]

  9. Our Place in the Tree of Life Which species to compare to? Too close and purifying selection will be largely indistinguishable from the neutral rate. Too far and many functional orthologs will diverge beyond our ability to accurately align them.  you are here [Human Molecular Genetics, 3rd Edition] http://cs273a.stanford.edu [Bejerano Aut07/08]

  10. Metazoans (multi-cellular organisms)  you are here [Human Molecular Genetics, 3rd Edition] http://cs273a.stanford.edu [Bejerano Aut07/08]

  11. Vertebrates: what to sequence? , Stickleback , Lizard , Opossum too far sweet spot too close  you are here [Human Molecular Genetics, 3rd Edition] http://cs273a.stanford.edu [Bejerano Aut07/08]

  12. The Dawn of Whole Genome Comparative Genomics 2001 2002 40% DNA alignable 95% coding genes shared http://cs273a.stanford.edu [Bejerano Aut07/08]

  13. Phylogenetic Shadowing , Stickleback , Lizard , Opossum “too close” can actually be a boon if you have enough closely related genomes too close  you are here [Human Molecular Genetics, 3rd Edition] http://cs273a.stanford.edu [Bejerano Aut07/08]

  14. More Species Have Joined Since Are you sure it’s all orthologous DNA?? http://cs273a.stanford.edu [Bejerano Aut07/08]

  15. Paralogy & Orthology Chains & Nets http://cs273a.stanford.edu [Bejerano Aut07/08]

  16. Chaining Alignments • Chaining bridges the gulf between syntenic blocks and base-by-base alignments. • Local alignments tend to break at transposon insertions, inversions, duplications, etc. • Global alignments tend to force non-homologous bases to align. • Chaining is a rigorous way of joining together local alignments into larger structures. [Jim Kent’s slides]

  17. Chains join together related local alignments Protease Regulatory Subunit 3

  18. Chains • a chain is a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain. • Within a chain, target and query coords are monotonically non-decreasing. (i.e. always increasing or flat) • double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed. • not just orthologs, but paralogs too, can result in good chains. but that's useful! • chains should be symmetrical -- e.g. swap human-mouse -> mouse-human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. • chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done. • chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki] http://cs273a.stanford.edu [Bejerano Aut07/08]

  19. Affine penalties are too harsh for long gaps Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine gap scores model red/blue plots as straight lines.

  20. Before and After Chaining

  21. Chaining Algorithm • Input - blocks of gapless alignments from blastz • Dynamic program based on the recurrence relationship:score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj)) • Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i

  22. Netting Alignments • Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. • Net finds best match mouse match for each human region. • Highest scoring chains are used first. • Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

  23. Net Focuses on Ortholog

  24. Nets • a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels. • a net is single-coverage for target but not for query. • because it's single-coverage in the target, it's no longer symmetrical. • the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again. • nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level. [Angie Hinrichs, UCSC wiki] http://cs273a.stanford.edu [Bejerano Aut07/08]

  25. "LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process. [Angie Hinrichs, UCSC wiki] http://cs273a.stanford.edu [Bejerano Aut07/08]

  26. Before and After Netting

  27. Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

  28. Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

  29. Mouse/HumanRearrangement Statistics Number of rearrangements of given type per megabase excluding known transposons.

  30. A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

  31. Cautionary Note 1 http://cs273a.stanford.edu [Bejerano Aut07/08]

  32. Cautionary Note 2 http://cs273a.stanford.edu [Bejerano Aut07/08]

  33. Same Region… same in all the other fish http://cs273a.stanford.edu [Bejerano Aut07/08]

  34. Orthology vs. Paralogy http://cs273a.stanford.edu [Bejerano Aut07/08]

  35. Conservation Track Documentation http://cs273a.stanford.edu [Bejerano Aut07/08]

  36. What People Largely Expected to Find • gene (how to) • control region(when & where) DNA proximal: in 103 letters 3kb genome.ucsc.edu http://cs273a.stanford.edu [Bejerano Aut07/08]

  37. What They Found Human Genome: 3*109 letters 1.5% known function compare to other species >50% junk >5% human genome functional 3x more functional DNA than known! ~106 substrings do not code for protein What do they do then? [Science 2004 Breakthrough of the Year, 5th runner up] http://cs273a.stanford.edu [Bejerano Aut07/08]

More Related