1 / 37

Limsoon Wong Institute for Infocomm Research Singapore

From Informatics to Bioinformatics. Limsoon Wong Institute for Infocomm Research Singapore. What is Bioinformatics?. Themes of Bioinformatics. Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery =

rruf
Download Presentation

Limsoon Wong Institute for Infocomm Research Singapore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Informatics to Bioinformatics Limsoon Wong Institute for Infocomm Research Singapore

  2. What is Bioinformatics?

  3. Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

  4. Benefits of Bioinformatics • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science

  5. From Informatics to Bioinformatics Protein Interactions Extraction (PIES) 8 years of bioinformatics R&D in Singapore MHC-Peptide Binding (PREDICT) Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) Venom Informatics 1994 1996 1998 2002 2000 ISS LIT/I2R KRDL

  6. Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs.

  7. Data Integration Results sybase-add (#name:”GDB", ...); create view Lfromlocus_cyto_locationusingGDB; create view Efromobject_genbank_erefusingGDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from Lasc, Easg, {selectu fromg.#genbank_ref.na-get-homolog-summaryasu wherenot(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")}asH where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); • Using Kleisli: • Clear • Succinct • Efficient • Handles • heterogeneity • complexity

  8. Data Warehousing {(#uid: 6138971, #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} • Motivation efficiency availabilty “denial of service” data cleansing • Requirements efficient to query easy to update. model data naturally

  9. Data Warehousing Results ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create tableGP (#uid: "NUMBER", #detail: "LONG") usingdb; ! Populate table with GenPept reports select#uid: x.#uid, #detail: xintoGP fromaa-get-seqfeat-general "PTP”asx usingdb; ! Map GP to that table create viewGPfrom GPusingdb; ! Run a queryto get title of 131470 selectx.#detail.#title fromGPasx wherex.#uid = 131470; Relational DBMS is insufficientbecauseit forces us to fragment data into 3NF. Kleisli turns flat relational DBMS into nested relationalDBMS.It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its update-able complex object store.

  10. Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

  11. 1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS

  12. Transcription Start Prediction

  13. Transcription Start Prediction Results

  14. Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable

  15. Gene Expression Analysis • Classifying gene expression profiles • find stable differentially expressed genes • find significant gene groups • derive coordinated gene expression

  16. Medical Record & Gene Expression Analysis Results • PCL, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works well for gene expressions Cancer Cell, March 2002, 1(2)

  17. WEB Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?”

  18. Protein Interaction Extraction Results • Rule-based system for processing free texts in scientific abstracts • Specialized in • extracting protein names • extracting protein-protein interactions Jak1

  19. Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Behind the Scene and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

  20. Using Feature Generation & Feature Selection for Accurate Prediction of Translation Initiation Sites A more detailed example of post-genome knowledge discovery

  21. Translation Initiation Recognition

  22. A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

  23. Approach • Training data gathering • Signal generation • k-grams, distance, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...

  24. Training & Testing Data • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] • 3312 sequences • 13503 ATG sites • 3312 (24.5%) are TIS • 10191 (75.5%) are non-TIS • Use for 3-fold x-validation expts

  25. Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame

  26. Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! • This is too many for most machine learning algorithms

  27. Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance • Which of the following 3 signals is good?

  28. Signal Selection (eg., t-statistics)

  29. Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

  30. Sample k-grams Selected by CFS Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Kozak consensus Stop codon Codon bias?

  31. Signal Integration • kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. • SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. • Naïve Bayes, ANN, C4.5, ...

  32. Results (3-fold x-validation)

  33. Improvement by Voting • Apply any 3 of Naïve Bayes, SVM, Neural Network, & Decision Tree. Decide by majority.

  34. Improvement by Scanning • Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS. • Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG

  35. Performance Comparisons * result not directly comparable

  36. Pedersen&Nielsen [ISMB’97] Neural network No explicit features Zien [Bioinformatics’00] SVM+kernel engineering No explicit features Hatzigeorgiou [Bioinformatics’02] Multiple neural networks Scanning rule No explicit features Our approach Explicit feature generation Explicit feature selection Use any machine learning method w/o any form of complicated tuning Scanning rule is optional Technique Comparisons

  37. Acknowledgements • A.G. Pedersen • H. Nielsen • Roland Yap • Fanfan Zeng

More Related