1 / 28

A knowledge-based approach to integrated genome annotation

A knowledge-based approach to integrated genome annotation. Michael Brent Washington University. EST-, mRNA-, and protein-based methods. Outline of our process. MGC validated clones + RefSeq NM’s. Remove all with frame shifts. Fill with spliced Hs mRNA & EST. Threaded de novo

eyad
Download Presentation

A knowledge-based approach to integrated genome annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A knowledge-based approach tointegrated genome annotation Michael Brent Washington University

  2. EST-, mRNA-, and protein-based methods

  3. Outline of our process MGC validated clones + RefSeq NM’s Remove all with frame shifts Fill with spliced Hs mRNA & EST Threaded de novo predict- ions Paragon aligner BLAT N-SCAN +EST ENCODE Workshop

  4. Paragon aligner Manimozhiyan Arumugam with Chaochun Wei

  5. Better EST/cDNA-to-genome alignment • Idea • Go beyond minimizing mismatches and gaps • Accurate probabilities in correct alignments • Estimate parameters for each sequence set ENCODE Workshop

  6. Better EST/cDNA alignment • Two sources of mismatches & gaps • Error (sequencing, RT) • Quals give local probs. Not used here. • Polymorphism (RNA vs. genome strains) • Gap vs. indel rates are different • Parameters must vary with sequence quality & source strains/polymorphism rates • E.g. prefer non-matches in low quality bases ENCODE Workshop

  7. Better EST/cDNA alignment • Introns • Accurate probabilities in correct alignments • GT/AG vs. GC/AG vs. AT/AC • Absolutely no junk splice sites • Not clear what to do with polymorphic sites • Long introns are rarer than short introns ENCODE Workshop

  8. Small exon in finished cDNA STANDARD TOOL (EST_GENOME) GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGACG-Agcgcc.......tgtagCACTTCT 16339 ||||||||||||||||| || ||| |>>>>> 15907 >>>>> ||||| BC000810 101 CCGGGACTACCTCATGA-GT-ACGCA.................--CTTCT 129 GENOME 16340 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 16389 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 130 GGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCCATCAATGATATG 179 OUR PAIR HMM GENOME 351 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 400 |||||||||||||||||||||||||||||||||||||||||||||||||| BC000810 51 GCGGGCGCGTTGGTGCGGAAAGCGGCGGACTATGTCCGAAGCAAGGATTT 100 GENOME 401 CCGGGACTACCTCATGAGGTGAC.......AATAGTACGGTAAG...... 13006 ||||||||||||||||||>>>>> 12584 >>>>>||||>>>>> 3326 BC000810 101 CCGGGACTACCTCATGAG.................TACG........... 122 GENOME 13007 TGTAGCACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 13046 >>>>>||||||||||||||||||||||||||||||||||||||||||||| BC000810 123 .....CACTTCTGGGGCCCAGTAGCCAACTGGGGTCTTCCCATTGCTGCC 167 ENCODE Workshop

  9. ENCODE Workshop

  10. Blind test • Test set • 100 alignment pairs of MGC clones to genome • Paragon & EST_genome differ on all of them • Output format identical • Evaluation • Curator attempting to explain discrepancies • Result • 37 cases where biological evidence favors 1 • In 31/37 Paragon alignment is supported ENCODE Workshop

  11. Future directions • UTR vs. ORF • Polymorphism is more common in UTR • And 3rd position in ORF • Conservation • Use alignments to distinguish true from false • Splice sites, introns • Codons • Polymorphisms (analogous to quality values) ENCODE Workshop

  12. Conceptual shift • Traditional view • cDNA data “speaks for itself”. Theory neutral. • Alignment = counting matches, mismatches, gaps • cDNA = genome annotation ENCODE Workshop

  13. Conceptual shift • Our view • More knowledge = better alignments & annotations • cDNA is very useful evidence re: gene structure • Need to align it correctly • Need to determine its completeness • If not complete, predict the remainder • Gene prediction & cDNA alignment are the same problem • cDNA/EST just adds another information source ENCODE Workshop

  14. N-SCAN_EST Chaochun Wei

  15. TWINSCAN/N-SCAN_EST • Goal: • Integrate EST information with TWINSCAN to • improve accuracy where EST evidence exits • without losing the ability to predict novel genes. ENCODE Workshop

  16. Twinscan_est ENCODE Workshop

  17. Generating EST-alignment Sequence ENCODE Workshop

  18. Modeling EST alignment sequence • Probability models • In each HMM state • Separate models for EST alignment sequence • Probabilities of DNA, conservation sequence, and EST sequence are multiplied. • Very similar to models of genomic alignments ENCODE Workshop

  19. Multi-genome methods:N-SCAN Samuel Gross with Randall Brown

  20. N-SCAN:Using multi-genome alignments • Motivation • Many genomes should give stronger signal of negative selection than two • Lots of genomes are being sequenced • Methods • Extend Twinscan to a phylogenetic tree model • At each site, mutation rate & pattern of tolerated substitutions depend on function ENCODE Workshop

  21. Example • A multiple alignment that (A) is and (B) is not typical of the splice boundary shown ENCODE Workshop

  22. Using mutation patterns for improving gene prediction • Tree hidden Markov model • Each state • generates columns of a multiple alignment • by a substitution process • along the branches of a phylogenetic tree ENCODE Workshop

  23. Challenges • Columns are not correct, orthologous • Sequencing error • Alignment error • Change of function (I am not a mouse!) ENCODE Workshop

  24. Differences from EXONIPHY • Approach • Estimate models of actual alignments, not evolutionary processes • Model • Independent substitution probabilities on each branch of the tree • 6 characters: A, C, G, T, gap, unaligned • Condition backwards from target genome ENCODE Workshop

  25. Using mutation patterns for improving gene prediction • Traditional factorization • Pr(a2) Pr(a1|a2) Pr(h|a1) Pr(m|a1) Pr(c|a2) • N-SCAN factorization • Pr(h) Pr(a1|h) Pr(a2|a1) Pr(m|a1) Pr(c|a2) ENCODE Workshop

  26. Preliminary study in human ENCODE Workshop

  27. Preliminary study in human ENCODE Workshop

  28. Fin ENCODE Workshop

More Related