1 / 33

Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu

Exploiting Likely-Positive and Unlabeled Data to Improve the Identification of Protein-Protein Interaction Articles. Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu I ntelligent Agent Systems Lab, I IS, Academia Sinica Taiwan Aug. 29, 2007. Outline. Background

latoya
Download Presentation

Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Likely-Positive and Unlabeled Data to Improve the Identification of Protein-Protein Interaction Articles Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu Intelligent Agent Systems Lab, IIS, Academia Sinica Taiwan Aug. 29, 2007

  2. Outline • Background • Traditional Method • Formulation • Traditional weighting functions • Our Proposed Method • New term weighting functions • Select Likely data • Exploiting likely-positive and negative data • Results

  3. BACKGROUND

  4. Relevant v.s. Irrelevant Articles Relevant Physical interactions among circadian clock proteins KaiA, KaiB and KaiC in cyanobacteria. Irrelevant Differential protein expression in human gliomas and molecular insights.

  5. Curation of Protein-Protein Interaction Databases (PPI-DB) Database PPI Database Filtering and Ranking Filtering and Ranking Human Verification Human Verification Information Extraction Information Extraction Unstructured Text Unstructured Texts

  6. Example of a PPI Record UniProt Protein ID

  7. Difficulties of PPI-Text Classification • Annotation cost is very high • Annotators should be experienced biological researchers • Unbalanced document numbers between the relevant and irrelevant classes • Various definitions of “PPI-relevance” • PPI taxonomy is defined in GO ontology • MINT: physical interaction • BIND: physical interaction, genetic interaction

  8. Data Source 9 1 Second BioCreAtIvE Challenge Workshop : Critical Assessment of Information Extraction in Molecular Biology

  9. TRADITIONAL METHOD

  10. Formulation • PPI Abstract Identification is formulated as a text classification problem (PPI-TC) • Class 1: PPI-relevant (+) • extracted from PPI-DBs • Class 2: PPI-irrelevant (-) • annotated by experts

  11. Irrelevant to PPI Relevant to PPI Schema An APAF-1 cytochrome cmultimeric complex is a functional apoptosome that activates procaspase-9 Words to Feature Vectorby Weighting Functions Classify by SVM 11

  12. About BM25 • BM25 is the best known ranking function used by search engines • It is used to rank documents by their relevance to a given search query • It is also commonly used as weighting functions in text classification [1] S. Robertson, “Understanding Inverse Document Frequency: On theoretical arguments for IDF,” Journal of Documentation60, 503-520, 2004

  13. function of TF, abbreviated as (TFd(wi)) BM25 (1) A weighting function for estimating each word’s discrimination ability Relative Freq. Balanced Relative Freq.

  14. BM25 (2) A weighting function for estimating each word’s discrimination ability For articles containing wi positive / negative For articles not containing wi negative / positive (wi, playeri), (article, game), (positive, winning), (negative, loosing) Denominators cancel out (Winning % when he is in) X (losing % when he is out)

  15. PROPOSED METHOD

  16. Basic Idea • Develop better weighting functions • Expand the training set—select likely data • Select PPI-relevant articles recorded in other PPI-DBs • Select articles not recorded in any PPI-DBs • Exploit likely data

  17. Proposed Variants of BM25 (Robertson, 2004) Relative Freq. Balanced Relative Freq.

  18. Likely (positive, negative) Data • Advantage • Improve the generality and robustness of the classification model • Reduce the number of unseen features • Source for PPI article classification • Likely positive • PPI articles recorded in other PPI-DBs • Likely negative • PubMed articles which are not recorded in any PPI-DBs

  19. + - + - + + - - - + - + Filtering GeneratingFilter Model Unlabeled Data Negative Articles Select Likely Data Likely Positive Positive Negative Like negative BIND : Containing both genetic & physical interaction of PPI MINT : Containing only physical interaction of PPI

  20. EXPLOITING LIKELY DATA

  21. Mixed Model Additional training data

  22. Hierarchical Model This value is used as an additional feature

  23. EXPERIMENTS & RESULTS

  24. Evaluation Metrics for Classification ,

  25. Evaluation Metrics for Ranking , • Receiver Operating Characteristic (ROC) curve • AUC • the area under the ROC curve ROC curve = Sensitivity TPR AUC = 1-Specificity FPR

  26. Model for Filtering The Hierarchical model is most appropriate for filtering out irrelevant articles

  27. Model for Ranking The mixed model is most appropriate for ranking articles

  28. Weighting Scheme for Filtering

  29. Weighting Scheme for Ranking

  30. Conclusion • PPI-TC can save a lot of annotation effort • Integrating multiple PPI resources • Likely data is effective to improve PPI-TC for both filtering and ranking • Suitable term-weighting function • BM25 and its inheritance are the most effective • For Filtering • Hierarchical Model with BM25 • For Ranking • Mixed Model with TFBRF

  31. Thank you for your attention

  32. Architecture of IASL PPI-TC System

  33. Classification Algorithm • Support Vector Machines (SVM) • Find a maximal-margin separating hyperplane <w, φ(x)> - b = 0 where x(i) is the ith training instance y(i){1, -1} is its label ξ(i) denotes its training error C is the cost factor

More Related