1 / 25

Gene Prediction: Similarity-Based Methods

Gene Prediction: Similarity-Based Methods. (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm.

horowitz
Download Presentation

Gene Prediction: Similarity-Based Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm

  2. The Gene Prediction Problem • Given genome sequences, determine where are the genes • The problem is easier for prokaryotes (no introns) • The problem is significantly harder for eukaryotes (alternative splicing)

  3. Splicing Causes Problem…

  4. Exons vs. Introns • Exon: A portion of the gene that appears in both the primary and the mature mRNA transcripts. • Intron: A portion of the gene that is transcribed but excised prior to translation.

  5. Definition of a Gene • Regulatory regions: up to 50 kb upstream of +1 site • Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp) • Introns: splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron • Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.

  6. Different Views of a Gene Gene ATGCTTGCCAAAT…TCG… DNA Exons Pre-mRNA e2 e3 e1 Introns e2 e3 e1 mRNA Protein MSRTAQ…

  7. Approaches to Gene Prediction • Similarity-based approaches: • Exploit the fact that many genes are conserved across species • Can be highly reliable • Only good for finding unknown genes • Statistical approaches • Exploit statistical characteristics of coding regions and non-coding regions and other knowledge about genes • Can potentially detect new genes • May not be reliable • They can/should be combined • Currently no principled approaches for doing this

  8. Outline • The idea of similarity-based approach to gene prediction • Exon Chaining Problem • Spliced Alignment Problem

  9. Using Known Genes to Predict New Genes • Some organism’s genome may be very well- documented, with many genes having been experimentally verified. • Closely-related organisms may have similar genes • Unknown genes in one species may be compared to genes in some closely-related species

  10. Comparing Genes in Two Genomes • Small islands of similarity corresponding to similarities between exons

  11. Reverse Translation • Given a known protein, find a gene in the genome which codes for it • One might infer the coding DNA of the given protein by reversing the translation process • Inexact: amino acids map to > 1 codon • This problem is essentially reduced to an alignment problem

  12. mRNA (codon sequence) { { { { { exon1 intron1 exon2 intron2 exon3 Portion of genome Comparing Genomic DNA Against mRNA

  13. Frog Gene (known) Human Genome Using Similarities to Find the Exon Structure • The known frog gene is aligned to different locations in the human genome • Find the “best” path to reveal the exon structure of human gene

  14. Frog Genes (known) Human Genome Finding Local Alignments Use local alignments to find all islands of similarity

  15. Chaining Local Alignments • Find substrings that match a given gene sequence (candidate exons) • Define structure of candidate exons as (l, r, w) (left, right, weight defined as score of local alignment) • Look for a maximum chain of substrings • Chain: a set of non-overlapping nonadjacent intervals.

  16. 5 5 15 9 11 4 3 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Exon Chaining Problem • Locate the beginning and end of each interval (2n points) • Find the “best” path

  17. Exon Chaining Problem: Formulation • Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons • Input: a set of weighted intervals (putative exons) • Output: A maximum chain of intervals from this set

  18. Exon Chaining: Graph Representation • This problem can be solved with dynamic programming in O(n) time.

  19. Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals • fori ←to 2n • si← 0 • fori ← 1 to 2n • if vertex vi in G corresponds to right end of interval I • j← index of vertex for left end of the interval I • w← weight of the interval I • sj← max {sj + w, si-1} • else • si← si-1 • return s2n

  20. Exon Chaining: Deficiencies • Poor definition of the putative exon endpoints • Optimal chain of intervals may not correspond to any valid alignment • First interval may correspond to a suffix, whereas second interval may correspond to a prefix • Combination of such intervals is not a valid alignment

  21. Spliced Alignment • Proposed in 1996 by Mikhail Gelfand and colleagues • Goal: Use a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. • Method • Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem) • Find a chain of putative exons that has the highest similarity to the target protein

  22. Spliced Alignment Problem: Formulation • Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence • Input: Genomic sequences G, target sequence T, and set of candidate exons B. • Output: A chain of exons Γ such that the global alignment score s(Γ*, T) is maximum among all chains of blocks from B. Γ* is the string formed by concatenating strings in Γ. Essentially an alignment problem…

  23. Lewis Carroll Example

  24. The solution to the sliced alignment problem will be discussed later when we talk about sequence alignment…

  25. What You Should Know • Why splicing causes difficulty in gene prediction • The formulation and algorithm for Exon Chaining • Why Spliced Alignment is a better formulation than Exon Chaining

More Related