phylogenetic shadowing
Skip this Video
Download Presentation
Phylogenetic Shadowing

Loading in 2 Seconds...

play fullscreen
1 / 17

Phylogenetic Shadowing - PowerPoint PPT Presentation

  • Uploaded on

Phylogenetic Shadowing. Daniel L. Ong. Abstract. The human genome contains about 3 billion base pairs! Algorithms to analyze these sequences must be linear to be tractable Finding genes is important to Molecular Biologists, first step to understanding. Outline. Introduction Alignments

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Phylogenetic Shadowing' - reed

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • The human genome contains about 3 billion base pairs!
  • Algorithms to analyze these sequences must be linear to be tractable
  • Finding genes is important to Molecular Biologists, first step to understanding

RUGS, UC Berkeley

  • Introduction
  • Alignments
  • Phylogenetic trees
  • Sequence models
    • Example: mRNA and scRNA models
  • Conclusions

RUGS, UC Berkeley

introduction to biosequences
Introduction to Biosequences
  • 4 nucleotides: A matches T; G matches C
    • In RNA, U replaces T
  • The NIH GenBank has 188 GB of sequence data; UC Santa Cruz has another 128 GB
  • The central dogma:

RUGS, UC Berkeley



  • Alignment: given two sequences, insert gaps or allow mismatches in input sequences to minimize a cost function
    • Similar to edit distance
    • Generalizes to n sequences
  • Exploited to predict genes
    • Greater similarity in protein-coding genes
    • Mutated as a pair in structural RNA genes

RUGS, UC Berkeley (Chakrabarti & Pachter, 2004)

multiple alignment
Multiple alignment
  • Considering multiple sequences allows us to leverage the comparative genomics paradigm
    • Functionally important regions of the genome are more likely to be conserved across species
    • The converse is also true
  • Genomes should be closely related
    • About 5-7 species of a family (Boffelli, et. al. 2003)
    • Additional genomes increase sensitivity (true positives) and decrease specificity (true negatives)

RUGS, UC Berkeley

phylogenetic trees

[Durbin, et. al., 1998]

Phylogenetic Trees
  • Use directed binary tree to track the relationships between organisms
  • Each node represents the nucleotide at a particular position in an aligned sequence
    • Current organisms are leaves of tree (observed)
    • Internal nodes are the common ancestor (unobserved)
  • Edges are speciation events and represent “evolutionary distance” as an extra parameter
  • Assume each nucleotide evolves independently (site independent evolution)

RUGS, UC Berkeley


Phylogenetic Tree

  • Site independent model computes probability of independent columns
    • Used for protein-coding genes
  • Pairwise site dependent model computes probability of base-paired columns
    • Used for scRNA genes

Marty Yanofsky

RUGS, UC Berkeley

how to find a phylogenetic tree
How to find a Phylogenetic Tree?
  • Given n sequences, we want to find the correct tree topology
    • Search works for small n
    • Maximum likelihood: choose the tree that maximizes the probability of the alignment

RUGS, UC Berkeley

biosequence analysis
Biosequence analysis
  • Phylogenetic trees encapsulate evolutionary time across sequences
  • Sequence model predicts changes along the length of a particular sequence
    • Sequence models are typically HMMs

RUGS, UC Berkeley

example mrna genes
Example: mRNA genes
  • Suppose we want to identify coding genes with an HMM
    • Exon: DNA segment that gets transcribed to mRNA
    • Have states in HMM corresponding to exon regions (Alexandersson, et. al., 2003)
  • Other types of RNA that get transcribed from DNA but not translated into protein are noncoding

RUGS, UC Berkeley


Structural RNA (scRNA)

  • A sequence with many self-binding sites, forming a stable structure.
  • Implicated in regulating critical biochemical pathways

Michael W. King

RUGS, UC Berkeley

example structural rna

[Chakrabarti & Ong, 2004]

Example: Structural RNA
  • Due to semi-palindromic structure, sequence model would be a PCFG
    • Violates the site-independent assumption of phylogenetic trees
    • Modify to allow pairwise site-dependencies in addition to non-matches
  • Gene length can be in the thousands
    • Limit the length of scRNA to constant L; time O(L3 + N*L2), N = length of multi-alignment

RUGS, UC Berkeley

example completed

[Chakrabarti & Ong, 2004]

Example completed
  • Can combine HMM and the PCFG to form a supermodel
  • Use a generic framework to identify mRNA, scRNA, and other regions

RUGS, UC Berkeley

phylogenetic shadowing1
Phylogenetic shadowing
  • Use multiple alignment of several closely related genomes
  • Analysis of data becomes more reliable (Boffelli, et. al., 2003)
    • More genomes reduce probability of false positives
    • Still need closely related species to decrease chance of false negatives

RUGS, UC Berkeley

  • Phylogenetic shadowing uses a multiple alignment to analyze multiple genomes simultaneously, increasing success
  • AI techniques have been proven useful in Computational Biology
    • Still many more problems to solve

RUGS, UC Berkeley

  • M. Alexandersson, S. Cawley, and L. Pachter. “SLAM: Cross-Species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model.” Genome Research, 13 (2003) p 496--502.
  • D. Boffelli, J. McAuliffe, D. Ovcharenko, K.D. Lewis, I. Ovcharenko, L. Pachter, and E.M. Rubin. “Phylogenetic shadowing of primate sequences to find functional regions of the human genome.” Science, 299 (2003), p 1391-1394.
  • K. Chakrabarti and D.L. Ong. “Computational Identification of Noncoding RNA Genes through Phylogenetic Shadowing.” ACM/ISCB RECOMB 8 (2004), poster.
  • K. Chakrabarti and L. Pachter. “Visualization of multiple genome annotations and alignments with the K-BROWSER.” Genome Research 14 (2004), p 716--720.
  • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.” New York: Cambridge University Press, 1998.

RUGS, UC Berkeley