1 / 39

Limsoon Wong Laboratories for Information Technology Singapore

From Datamining to Bioinformatics. Limsoon Wong Laboratories for Information Technology Singapore. What is Bioinformatics?. Themes of Bioinformatics. Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery =

esandoval
Download Presentation

Limsoon Wong Laboratories for Information Technology Singapore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Datamining to Bioinformatics Limsoon Wong Laboratories for Information Technology Singapore

  2. What is Bioinformatics?

  3. Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

  4. Benefits of Bioinformatics • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science

  5. From Informatics to Bioinformatics MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) 8 years of bioinformatics R&D in Singapore Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) Venom Informatics 1994 1996 1998 2002 2000 ISS LIT KRDL

  6. Quick Samplings

  7. Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

  8. 1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS

  9. Transcription Start Prediction

  10. Transcription Start Prediction Results

  11. Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable

  12. Gene Expression Analysis • Classifying gene expression profiles • find stable differentially expressed genes • find significant gene groups • derive coordinated gene expression

  13. Medical Record & Gene Expression Analysis Results • PCL, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works well for gene expressions Cancer Cell, March 2002, 1(2)

  14. Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Behind the Scene and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

  15. Questions?

  16. A More Detailed Account

  17. Jonathan’s blocks Jessica’s blocks Whose block is this? What is Datamining? Jonathan’s rules : Blue or Circle Jessica’s rules : All the rest

  18. What is Datamining? Question: Can you explain how?

  19. The Steps of Data Mining • Training data gathering • Signal generation • k-grams, colour, texture, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...

  20. Translation Initiation Recognition

  21. A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

  22. Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame

  23. Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! • This is too many for most machine learning algorithms

  24. Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance • Which of the following 3 signals is good?

  25. Signal Selection (eg., t-statistics)

  26. Signal Selection (eg., MIT-correlation)

  27. Signal Selection (eg., 2)

  28. Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other • Homework: find a formula that captures the key idea of CFS above

  29. Sample k-grams Selected Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Kozak consensus Stop codon Codon bias

  30. Signal Integration • kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. • SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. • Naïve Bayes, ANN, C4.5, ...

  31. Results (on Pedersen & Nielsen’s mRNA)

  32. Acknowledgements • Roland Yap • Zeng Fanfan • A.G. Pedersen • H. Nielsen

  33. Questions?

  34. Common Mistakes

  35. Self-fulfilling Oracle • Consider this scenario • Given classes C1 and C2 w/ explicit signals • Use 2 to C1 and C2 to select signals s1, s2, s3 • Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90% • Is the accuracy really 90%? • What can be wrong with this?

  36. Phil Long’s Experiment • Let there be classes C1 and C2 w/ 100000 features having randomly generated values • Use 2 to select 20 features • Run k-fold x-validation on C1 and C2 w/ these 20 features • Expect: 50% accuracy • Get: 90% accuracy! • Lesson: choose features at each fold

  37. Apples vs Oranges • Consider this scenario: • Fanfan reported 89% accuracy on his TIS prediction method • Hatzigeorgiou reported 94% accuracy on her TIS prediction method • So Hatzigeorgiou’s method is better • What is wrong with this conclusion?

  38. Apples vs Oranges • Differences in datasets used: • Fanfan’s expt used Pedersen’s dataset • Hatzigeorgiou’s used her own dataset • Differences in counting: • Fanfan’s expt was on a per ATG basis • Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis • When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also!

  39. Questions?

More Related