intex a syntactic role driven protein protein interaction extractor for bio medical text l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text PowerPoint Presentation
Download Presentation
IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text

Loading in 2 Seconds...

play fullscreen
1 / 28

IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text - PowerPoint PPT Presentation


  • 211 Views
  • Uploaded on

IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. Syed Toufeeq Ahmed Deepthi Chidambaram Hasan Davulcu Chitta Baral. Outline. Introduction Issues and Challenges Our Approach (IntEx System) Evaluation Future Work Conclusion Demo. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text' - alka


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
intex a syntactic role driven protein protein interaction extractor for bio medical text

IntEx:A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text

Syed Toufeeq Ahmed

Deepthi Chidambaram

Hasan Davulcu

Chitta Baral

outline
Outline
  • Introduction
  • Issues and Challenges
  • Our Approach (IntEx System)
  • Evaluation
  • Future Work
  • Conclusion
  • Demo
introduction
Introduction
  • Genomic Research in the last decade has resulted in humongous amount of data, and most of these findings are in form of free text.
  • PubMed/ MedLine has around 12 millions abstracts online.
  • An automated tool to extract information from free text (bio-medical) will be of great use to researchers (biologists).
issues that make extraction difficult seymore mccallum et al 1999
Issues that make extraction difficult (Seymore, McCallum et al.1999)
  • The task involves free text – hence there are many ways of stating the same fact.
  • The genre of text is not grammatically simple.
  • The text includes a lot of technical terminology unfamiliar to existing natural language processing systems.
  • Information may need to be combined across several sentences.
  • There are many sentences from which nothing should be extracted.
challenges
Challenges
  • Interactions specified in different ways
    • HMBA inhibits MEC-1 cell proliferation.
    • GBMs commonly overexpress the oncogenes EGFR and PDGFR, and contain mutations and deletions of tumor suppressor genes PTEN and TP53.
    • Protein kinase B (PKB) has emerged as the focal point for many signal transduction pathways, regulating multiple cellular processes such as glucose metabolism, transcription, apoptosis, cell proliferation, angiogenesis, and cell motility.
challenges cont
Challenges (cont.)
  • Anaphora resolution
    • Pronominals – “It activates HMBA”.
    • Sortal anaphora – “Both enzymes are phosphorylated”.
    • Event anaphora – “This reaction acts in a mediated environment.”
  • Multiple interactions in Complex sentences

Most of the tumor-suppressive properties of Pten are dependent on

its lipid phosphatase activity, which inhibits the phosphatidylinositol-3'-kinase

(PI3K)/Akt signaling pathway through dephosphorylation of phosphatidylinositol-(3,4,5)-triphosphate

our approach intex system
Our Approach (IntEx System)
  • Identify syntactic roles, such as Subject, Object , Verb and modifiers of a sentence.
  • Using these syntactic roles, transform complex sentences into multiple simple clauses.
  • Extract Protein-Protein interactions from these simple clausal structures.
  • Simple Pronoun resolution to identify references across multiple sentences.
intex system components
IntEx System Components
  • Pronoun Resolution
  • Tagging: tagging biological entities with the help of biomedical and linguistic gazetteers.
  • Complex Sentence Processing: splitting complex sentences into simple clausal structures made of up syntactic roles.
  • Interaction Extractor: extracting complete interactions by analyzing the matching contents of syntactic roles and their linguistically significant combinations.
pronoun resolution
Pronoun Resolution

Ku loads onto dsDNA ends and it can diffuse along the DNA in an energy-independent manner.

  • Pronouns in abstracts – third person

- It, itself, them, themselves.

  • Replace pronouns with first noun group that matches the Person/number agreement.

Ku loads onto dsDNA ends and Ku can diffuse along the DNA in an

energy-independent manner.

tagging
Tagging
  • Dictionary lookup using gene/protein gazetteers from UMLS, LocusLink etc..
  • To tag new gene names, we used regular expressions (alpha numeric names, combination of lower case and upper case characters etc..).
  • Some heuristics like using proper nouns, NP chunking to improve recall.
  • ‘Interaction word’ list is derived from UMLS and WordNet.
complex sentence processing
Complex Sentence Processing

Upon growth factor stimulation of quiescent cells, Gene100 declines

late in Gene101 and Gene102 is replaced by Gene103, which is absent

in quiescent cells.

Upon growth factor stimulation of quiescent cells, Gene100 declines late in Gene101.

Gene102 is replaced by Gene103.

Gene103 is absent in quiescent cells.

complex sentence processing13
Complex Sentence Processing
  • Verb-based approach.
  • Identify clauses in complex sentences using Link Grammar Linkages
  • Build simple clause sentences from them (for each main verb) in the following Clause Format:

Subject | Verb | Object | Modifying phrase

link grammar parser sleator d and d temperley 1993
Link Grammar Parser(Sleator, D. and D. Temperley ,1993)

Sentence: “The cat chased a snake”

Link Grammar Representation:

interaction extractor role type matching
Interaction Extractor: Role Type Matching

Various syntactic roles (such as Subject , Object and Modifying phrase) and their linguistically significant combinations makes up roles

roles examples
Roles: Examples

“HMBA could inhibit the MEC-1 cell proliferation by down-regulation

of PCNA expression.”

Elementary

(Subject)

Elementary

(Object)

Interaction

(Verb)

Partial

(Modifying Phrase)

interaction extractor algorithm
Interaction Extractor Algorithm

Is

Main Verb

an

Interaction (I)

?

Interaction : { G1, I, G2 }

Interaction : { G1, I, G2 }

Elementary (G1)

Partial (I,G2)

Elementary (G2)

complete (G,I,G) 

interact: {G,I,G}

complete (G,I,G) 

interact: {G,I,G}

complete (G,I,G) 

interact: {G,I,G}

interaction extractor example
Interaction Extractor Example

“HMBA could inhibit the MEC-1 cell proliferation by down-regulation of PCNA expression.”

Main Verb

{ “HMBA”, “down-regulation”, “PCNA expression”}

Elementary

Elementary

{ “HMBA”, “inhibit”, “the MEC-1 cell proliferation” }

Partial

evaluation recall comparison with biorat
Evaluation (Recall comparison with BioRAT)

IntEx and BioRAT from 229 abstracts when compared with DIP database. DIP (Database of Interacting Proteins) – is a database of proteins that interact, and is curated from both abstracts and full text.

evaluation precision comparison with biorat
Evaluation (Precision comparison with BioRAT)

Precision comparison of IntEx and BioRAT from 229 abstracts.

future work in interaction extraction
Future Work in Interaction Extraction
  • Handling negations in the sentences (such as “not interact”, “fails to induce”, “does not inhibit”).
  • Extraction of detailed contextual attributes of interactions (such as bio-chemical context or location) by interpreting modifiers:
      • Location/Position modifiers (in, at, on, into, up, over…)
      • Agent/Accompaniment modifiers (by, with…)
      • Purpose modifiers( for…)
      • Theme/association modifiers ( of..)
  • Extraction of relationships between interactions from among multiple sentences within and across abstracts/full text articles. (Protein Interaction Pathways)
a bigger future combining automated extraction with mass collaboration
A bigger future: combining automated extraction with mass collaboration
  • `Curation’ is expensive.
  • Automated extraction – miles to go
  • Vision: automated extraction with mass curation
  • The CBioC system: www.cbioc.org
conclusion
Conclusion
  • Verb-based approach to extract protein-protein interactions
  • Handles complex sentences
  • Easy to scale up , and to use in other domains (we are working on it to use on other domains too).
  • Protein name tagging needs improvement, and we are working on using other methods.
  • First release version is almost ready for both Windows and Linux platforms.
references
References
  • Link Grammar:

http://www.link.cs.cmu.edu/link

  • LocusLink (Now Entrez Gene):

http://www.ncbi.nlm.nih.gov/LocusLink

  • UMLS:

http://www.nlm.nih.gov/research/umls/umlsmain.html

references cont
References (cont.)
  • Blaschke, C., M. A. Andrade, et al. (1999). "Automatic extraction of biological information from scientific text: Protein-protein interactions." Proceedings of International Symposium on Molecular Biology: 60-67.
  • Corney, D. P. A., B. F. Buxton, et al. (2004). "BioRAT: extracting biological information from full-length papers." Bioinformatics 20(17): 3206-3213.
  • Friedman, C., P. Kra, et al. (2001). GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Proceedings of the International Confernce on Intelligent Systems for Molecular Biology: 574-82.
  • Rzhetsky, A., I. Iossifov, et al. (2004). "GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data." J. of Biomedical Informatics 37(1): 43--53.
  • Seymore, K., A. McCallum, et al. (1999). Learning hidden markov model structure for information extraction. AAAI 99 Workshop on Machine Learning for Information Extraction
  • Sleator, D. and D. Temperley (1993). Parsing English with a Link Grammar. Third International Workshop on Parsing Technologies.