Phylogenetic shadowing
1 / 17

Phylogenetic Shadowing - PowerPoint PPT Presentation

  • Uploaded on

Phylogenetic Shadowing. Daniel L. Ong. Abstract. The human genome contains about 3 billion base pairs! Algorithms to analyze these sequences must be linear to be tractable Finding genes is important to Molecular Biologists, first step to understanding. Outline. Introduction Alignments

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Phylogenetic Shadowing' - reed

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Phylogenetic shadowing

Phylogenetic Shadowing

Daniel L. Ong


  • The human genome contains about 3 billion base pairs!

  • Algorithms to analyze these sequences must be linear to be tractable

  • Finding genes is important to Molecular Biologists, first step to understanding

RUGS, UC Berkeley


  • Introduction

  • Alignments

  • Phylogenetic trees

  • Sequence models

    • Example: mRNA and scRNA models

  • Conclusions

RUGS, UC Berkeley

Introduction to biosequences
Introduction to Biosequences

  • 4 nucleotides: A matches T; G matches C

    • In RNA, U replaces T

  • The NIH GenBank has 188 GB of sequence data; UC Santa Cruz has another 128 GB

  • The central dogma:

RUGS, UC Berkeley


  • Alignment: given two sequences, insert gaps or allow mismatches in input sequences to minimize a cost function

    • Similar to edit distance

    • Generalizes to n sequences

  • Exploited to predict genes

    • Greater similarity in protein-coding genes

    • Mutated as a pair in structural RNA genes

RUGS, UC Berkeley (Chakrabarti & Pachter, 2004)

Multiple alignment
Multiple alignment

  • Considering multiple sequences allows us to leverage the comparative genomics paradigm

    • Functionally important regions of the genome are more likely to be conserved across species

    • The converse is also true

  • Genomes should be closely related

    • About 5-7 species of a family (Boffelli, et. al. 2003)

    • Additional genomes increase sensitivity (true positives) and decrease specificity (true negatives)

RUGS, UC Berkeley

Phylogenetic trees

[Durbin, et. al., 1998]

Phylogenetic Trees

  • Use directed binary tree to track the relationships between organisms

  • Each node represents the nucleotide at a particular position in an aligned sequence

    • Current organisms are leaves of tree (observed)

    • Internal nodes are the common ancestor (unobserved)

  • Edges are speciation events and represent “evolutionary distance” as an extra parameter

  • Assume each nucleotide evolves independently (site independent evolution)

RUGS, UC Berkeley

Phylogenetic Tree

  • Site independent model computes probability of independent columns

    • Used for protein-coding genes

  • Pairwise site dependent model computes probability of base-paired columns

    • Used for scRNA genes

Marty Yanofsky

RUGS, UC Berkeley

How to find a phylogenetic tree
How to find a Phylogenetic Tree?

  • Given n sequences, we want to find the correct tree topology

    • Search works for small n

    • Maximum likelihood: choose the tree that maximizes the probability of the alignment

RUGS, UC Berkeley

Biosequence analysis
Biosequence analysis

  • Phylogenetic trees encapsulate evolutionary time across sequences

  • Sequence model predicts changes along the length of a particular sequence

    • Sequence models are typically HMMs

RUGS, UC Berkeley

Example mrna genes
Example: mRNA genes

  • Suppose we want to identify coding genes with an HMM

    • Exon: DNA segment that gets transcribed to mRNA

    • Have states in HMM corresponding to exon regions (Alexandersson, et. al., 2003)

  • Other types of RNA that get transcribed from DNA but not translated into protein are noncoding

RUGS, UC Berkeley

Structural RNA (scRNA)

  • A sequence with many self-binding sites, forming a stable structure.

  • Implicated in regulating critical biochemical pathways

Michael W. King

RUGS, UC Berkeley

Example structural rna

[Chakrabarti & Ong, 2004]

Example: Structural RNA

  • Due to semi-palindromic structure, sequence model would be a PCFG

    • Violates the site-independent assumption of phylogenetic trees

    • Modify to allow pairwise site-dependencies in addition to non-matches

  • Gene length can be in the thousands

    • Limit the length of scRNA to constant L; time O(L3 + N*L2), N = length of multi-alignment

RUGS, UC Berkeley

Example completed

[Chakrabarti & Ong, 2004]

Example completed

  • Can combine HMM and the PCFG to form a supermodel

  • Use a generic framework to identify mRNA, scRNA, and other regions

RUGS, UC Berkeley

Phylogenetic shadowing1
Phylogenetic shadowing

  • Use multiple alignment of several closely related genomes

  • Analysis of data becomes more reliable (Boffelli, et. al., 2003)

    • More genomes reduce probability of false positives

    • Still need closely related species to decrease chance of false negatives

RUGS, UC Berkeley


  • Phylogenetic shadowing uses a multiple alignment to analyze multiple genomes simultaneously, increasing success

  • AI techniques have been proven useful in Computational Biology

    • Still many more problems to solve

RUGS, UC Berkeley


  • M. Alexandersson, S. Cawley, and L. Pachter. “SLAM: Cross-Species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model.” Genome Research, 13 (2003) p 496--502.

  • D. Boffelli, J. McAuliffe, D. Ovcharenko, K.D. Lewis, I. Ovcharenko, L. Pachter, and E.M. Rubin. “Phylogenetic shadowing of primate sequences to find functional regions of the human genome.” Science, 299 (2003), p 1391-1394.

  • K. Chakrabarti and D.L. Ong. “Computational Identification of Noncoding RNA Genes through Phylogenetic Shadowing.” ACM/ISCB RECOMB 8 (2004), poster.

  • K. Chakrabarti and L. Pachter. “Visualization of multiple genome annotations and alignments with the K-BROWSER.” Genome Research 14 (2004), p 716--720.

  • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.” New York: Cambridge University Press, 1998.

RUGS, UC Berkeley