1 / 40

Gene Feature Recognition

For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician. Gene Feature Recognition. Limsoon Wong. Recognition of Splice Sites. A simple example to start the day . Splice Sites. Donor. Acceptor. Image credit: Xu. Acceptor Site (Human Genome).

khuong
Download Presentation

Gene Feature Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician Gene Feature Recognition Limsoon Wong

  2. Recognition of Splice Sites A simple example to start the day 

  3. Splice Sites Donor Acceptor

  4. Image credit: Xu Acceptor Site (Human Genome) • If we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution • Acceptor site: CAG | TAG | coding region

  5. Image credit: Xu Donor Site (Human Genome) • If we align all known donor sites (with their splice junction site aligned), we have the following nucleotide distribution • Donor site: coding region | GT

  6. What Positions Have “High” Information Content? • For a weight matrix, information content of each column is calculated as – X{A,C,G,T}Prob(X)*log (Prob(X)/0.25) • When a column has evenly distributed nucleotides, its information content is lowest • Only need to look at positions having high information content

  7. Image credit: Xu Information Content Around Donor Sites in Human Genome • Information content • column –3 = – .34*log (.34/.25) – .363*log (.363/.25) – .183* log (.183/.25) – .114* log (.114/.25) = 0.04 • column –1 = – .092*log (.92/.25) – .03*log (.033/.25) – .803* log (.803/.25) – .073* log (.73/.25) = 0.30

  8. Image credit: Xu Weight Matrix Model for Splice Sites • Weight matrix model • build a weight matrix for donor, acceptor, translation start site, respectively • use positions of high information content

  9. Image credit: Xu Splice Site Prediction: A Procedure • Add up freq of corr letter in corr positions: • Make prediction on splice site based on some threshold AAGGTAAGT: .34 + .60 + .80 +1.0 + 1.0 + .52 + .71 + .81 + .46 = 6.24 TGTGTCTCA: .11 + .12 + .03 +1.0 + 1.0 + .02 + .07 + .05 + .16 = 2.56

  10. Recognition of Translation Initiation Sites An introduction to the World’s simplest TIS recognition system A simple approach to accuracy and understandability

  11. Translation Initiation Site

  12. A Sample cDNA • What makes the second ATG the TIS? 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

  13. Approach • Training data gathering • Signal generation • k-grams, distance, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...

  14. Training & Testing Data • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] • 3312 sequences • 13503 ATG sites • 3312 (24.5%) are TIS • 10191 (75.5%) are non-TIS • Use for 3-fold x-validation expts

  15. Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame

  16. Signal Generation: An Example 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT • Window = 100 bases • In-frame, downstream • GCT = 1, TTT = 1, ATG = 1… • Any-frame, downstream • GCT = 3, TTT = 2, ATG = 2… • In-frame, upstream • GCT = 2, TTT = 0, ATG = 0, ...

  17. An Example File Resulting From Feature Generation

  18. Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! • This is too many for most machine learning algorithms

  19. Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance

  20. Signal Selection (eg., t-statistics)

  21. Signal Selection (eg., 2)

  22. Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • Correlation-based Feature Selection • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

  23. Sample k-grams Selected by CFS • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Leaky scanning Kozak consensus Stop codon Codon bias?

  24. Signal Integration • kNN • Given a test sample, find the k training samples that are most similar to it. Let the majority class win • SVM • Given a group of training samples from two classes, determine a separating plane that maximises the margin of error • Naïve Bayes, ANN, C4.5, ...

  25. Neighborhood 5 of class 3 of class = Illustration of kNN (k=8) Image credit: Zaki Typical “distance” measure =

  26. Using WEKA for TIS Prediction

  27. Results (3-fold x-validation) * Using top 20 2-selected features from amino-acid features

  28. Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s Our method ATGpr Validation Results (on Chr X and Chr 21)

  29. Pedersen&Nielsen [ISMB’97] 85% accuracy Neural network No explicit features Zien [Bioinformatics’00] 88% accuracy SVM+kernel engineering No explicit features Hatzigeorgiou [Bioinformatics’02] 94% accuracy (with scanning rule) Multiple neural networks No explicit features Our approach 89% accuracy (94% with scanning rule) Explicit feature generation Explicit feature selection Use any machine learning method w/o any form of complicated tuning Technique Comparisons

  30. Recognition of Transcription Start Sites An introduction to the World’s best TSS recognition system A heavy tuning approach

  31. Transcription Start Site

  32. -200 to +50 window size Model selected based on desired sensitivity Structure of Dragon Promoter Finder

  33. GC-rich submodel #C + #G Window Size (C+G) = GC-poor submodel Each model has two submodels based on GC content

  34. sp se si K-gram (k = 5) positional weight matrix Data Analysis Within Submodel

  35. Pentamer at ith position in input Window size s jth pentamer at ith position in training window Frequency of jth pentamer at ith position in training window Promoter, Exon, Intron Sensors • These sensors are positional weight matrices of k-grams, k = 5 (aka pentamers) • They are calculated as s below using promoter, exon, intron data respectively

  36. Simple feedforward ANN trained by the Bayesian regularisation method Tuning parameters wi Tuned threshold sE tanh(net) sI sIE ex- e-x ex+ e-x tanh(x) = net =  si * wi Data Preprocessing & ANN

  37. with C+G submodels without C+G submodels Accuracy Comparisons

  38. Notes

  39. References (TIS Recognition) • A. G. Pedersen, H. Nielsen, “Neural network prediction of translation initiation sites in eukaryotes”, ISMB 5:226--233, 1997 • H.Liu, L. Wong, “Data Mining Tools for Biological Sequences”, Journal of Bioinformatics and Computational Biology, 1(1):139--168, 2003 • A. Zien et al., “Engineering support vector machine kernels that recognize translation initiation sites”, Bioinformatics 16:799--807, 2000 • A. G. Hatzigeorgiou, “Translation initiation start prediction in human cDNAs with high accuracy”, Bioinformatics 18:343--350, 2002

  40. References (TSS Recognition) • V. B. Bajic et al., “Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates”, J. Mol. Graph. & Mod. 21:323--332, 2003 • J. W. Fickett, A. G. Hatzigeorgiou, “Eukaryotic promoter recognition”, Gen. Res. 7:861--878, 1997 • A. G. Pedersen et al., “The biology of eukaryotic promoter prediction---a review”, Computer & Chemistry 23:191--207, 1999 • M. Scherf et al., “Highly specific localisation of promoter regions in large genome sequences by PromoterInspector”, JMB 297:599--606, 2000

More Related