1 / 29

Machine-learning in building bioinformatics databases for infectious diseases

Machine-learning in building bioinformatics databases for infectious diseases. ASEAN-China International Bioinformatics Workshop 2008 17 Apr 2008. Victor Tong Institute for Infocomm Research A*STAR, Singapore. Overview. Definitions and background

vahe
Download Presentation

Machine-learning in building bioinformatics databases for infectious diseases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine-learning in building bioinformatics databases for infectious diseases ASEAN-China International Bioinformatics Workshop 2008 17 Apr 2008 Victor Tong Institute for Infocomm Research A*STAR, Singapore

  2. Overview • Definitions and background • Architectures of existing immunological databases • Machine-learning for biological databases • Conclusion

  3. The information centric world • Biology produces more data than we can process • >3000 HLA alleles • 107-1015 different T-cell receptors • 1011 linear 9mer epitopes • Post-translational spliced epitopes • Data are stored in databases, literature, laboratory records, clinical records, … • A major issue: turning data into knowledge

  4. Use of bioinformatics • Impractical to do manual curation • ≥ 16 million PubMed abstracts • ~80K immunology related references • Large amounts of data that are difficult to interpret • Protein-protein interaction extraction from text • Bioinformatics: systematic construction and updating of databases

  5. Ad hoc bioinformatics Biological system Computational analysis Biological interpretation

  6. More systematic use of bioinformatics Biological system Formal description Computational analysis Mathematical problem Conversion of results Biological interpretation

  7. Knowledge discovery from databases is the process of automated extraction of useful information or knowledge from individual or multiple databases

  8. 1) Data explosion • Current databases: • Volume of data increasing exponentially • GenBank, SWISS-PROT, IMGT, PubMed, etc • New databases: • Growth in numbers • Increase in size • More complex • Biologists: • Maintain personal data bank • Information relevant to their research • Define objectives for data mining and analysis

  9. 2) Data quality • Data cleaning: • Limit on the percentage error that can be tolerated in the data • Prevent propagation of errors to our databases • Prevent depreciation of data quality • Nature of biological data: • Fuzzy and complex • Varying interpretations • Problems with raw data: • Inconsistent • Inaccurate • Redundant • Irrelevant • Incomplete • Incorrect

  10. 3) Database creation and maintenance • Software tools and programming efforts: • Data collection • Constructing databases • Integrating data mining tools • Updating the databases • Nature of the databases: • Short lifespan • Hard to maintain

  11. 4) Data integration • Disparities in data sources: • Data structures • Data formats • Views • Search mechanisms • Location

  12. Overview • Definitions and background • Architectures of existing immunological databases • Machine-learning for biological databases • Conclusion

  13. Web-resources for immune epitope information • Immune Epitope Database and Analysis Resource (IEDB) Contains B-cell epitopes, T-cell epitopes, MHC ligands for humans, non-human primates, rodents, and other animal species. URL: http://www.immuneepitope.org • The international ImMunoGeneTics information system (IMGT) Specializes in Ig, T-cell receptors, MHC, Ig superfamily, MHC superfamily, and related proteins of the immune system of human and other vertebrate species URL: http://imgt.cines.fr/ • SYFPEITHI Contains ~3,500 T-cell epitopes, MHC ligands and peptide motifs for humans and rodents URL: http://www.syfpeithi.de/

  14. Web-resources for immuneepitope information • MHCBN Contains T-cell epitopes, TAP ligands, MHC binding peptides and MHC non-binding peptides for humans and rodents URL: http://www.imtech.res.in/raghava/mhcbn/ • MPID-T Contains 3D structural information of 187 T-cell receptors, MHCs and interacting epitopes for humans and rodents, spanning 40 alleles URL: http://surya.bic.nus.edu.sg/mpidt/ • AntiJen/JenPep Contains T-cell epitopes, MHC ligands, TAP ligands and B-cell epitopes. URL: http://www.jenner.ac.uk/antijen/

  15. The IEDB class diagram

  16. Relationships between an epitope & contexts

  17. Overview • Definitions and background • Architectures of existing immunological databases • Machine-learning for biological databases • Conclusion

  18. Naϊve Bayes classifiers • Attribute values are conditionally independent given the target value • Goal: to assign a new instance vj the most probable target value Vtarget given a set of attribute values <a1, a2, … an> • The target class may be defined as: Vtarget= argmaxP(vj)ΠP(ai|vj)

  19. Comparison of popular text classification algorithms • Dataset • 20,910 PubMed abstracts • 181,299 unique words • AROC • NBC: 0.838 • ANN: 0.831 • SVM: 0.825 • DT: 0.809 Wang et al., BMC Bioinformatics 2007, 8:269

  20. Feature selection (FS) • Data source • PubMed abstracts • Medical Subject Headings (MeSH) - National Library of Medicine's controlled vocabulary used for indexing articles, for cataloging books and other holdings • Publication title • Author(s) • etc

  21. Feature selection (FS) • Algorithms • Document frequency (DF) – ranks features based on the number of abstracts they appear in • Information gain (IG) – measures the number of bits of information obtained for category prediction based on their occurrence in a document IG(u) = -∑ P(ci) log P(ci) + P(u) ∑ P(ci|u) log P(ci|u) + P(t) ∑ P(ci|ū) log P(ci|ū) where u is the feature of interest, ci (i = 1, …, m) denotes the set of categories the documents belong to

  22. Feature condensation (FC) • Stemming • To reduce words to their common root e.g. “binding, binds, bind” to bind • Porter stemmer – AROC = 0.846 to AROC = 0.842 • Domain specific vocabulary may be reduced to unsuitable terms

  23. Feature extraction (FE) • Rules to capture immune related expressions and group them together • Reduction of feature space (i.e. no. of unique words) • Enrichment of information content • Better performance?

  24. Feature extraction (FE) • Examples: • Sequence length – identify sequence length and replace with “~range<50~” or “~range>50~” if sequences to be mapped stretches 50 amino acids • MHC alleles – identify MHC alleles and replace with “~mhc_allele~” • Protein sequences – identify sequences as a) exclusively containing characters representing the 20 aa, b) in upper case, length > threshold, and replace with “~sequence~”

  25. Performance comparison Wang et al., BMC Bioinformatics 2007, 8:269

  26. Overview • Definitions and background • Architectures of existing immunological databases • Machine-learning for biological databases • Conclusion

  27. Conclusion • Machine-learning algorithms enable systematic approach to database construction and facilitates scientific discovery • It must be performed with due care and must be scientifically and technically sound

More Related