1 / 21

Towards an Automated Analysis of Biomedical Abstracts

Towards an Automated Analysis of Biomedical Abstracts. Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University of Skövde, Sweden. The goal of the project: text analysis for candidate path extraction.

april
Download Presentation

Towards an Automated Analysis of Biomedical Abstracts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University of Skövde, Sweden

  2. The goal of the project: text analysis for candidate path extraction

  3. The characteristics of the language of biomedical texts A typical PubMed abstract (PMID: 16301995): The tumor suppressor gene hypermethylated in cancer 1 (HIC1), located on human chromosome 17p13.3, is frequently silenced in cancer by epigenetic mechanisms. Hypermethylated in cancer 1 belongs to the bric a brac/poxviruses and zinc-finger family of transcription factors and acts by repressing target gene expression. It has been shown that enforced p53 expression leads to increased HIC1mRNA, and recent data suggest that p53 and Hic1 cooperate in tumorigenesis. In order to elucidate the regulation of HIC1 expression, we have analysed the HIC1 promoter region for p53-dependent induction of gene expression. (…) Other members of the p53 family, notably TAp73beta and DeltaNp63alpha, can also act through this HIC1.PRE to induce transcription of HIC1, and finally, hypermethylation of the HIC1 promoter attenuates inducibility by p53.

  4. Results of POS-tagging of two large corpora (30 million words each): 1) texts on stem cell research, and 2) general English prose light - Stem Cell, dark - Prose

  5. Results of POS-tagging of a smaller sample corpus of biomedical abstracts

  6. The general architecture of the Information Extraction system

  7. Patterns for domain-specific Named Entity Recognition • Pattern 1: n lower case chars (n>=1) + m integers (m >=2) + optionally: any character (p53, cdc25C, bcl2) • Pattern 2: n lower case chars (n>=1) + m upper case chars (m>=1) + k integers (k>=0) (mRNA) • Pattern 3: integer + lower case + n integers (n>=0) (1alpha) • Pattern 4: n integers (n>=1) + m upper case (m >=1) (7BL)

  8. Linking acronyms to full names of biological objects From previous procedure Is A Mark the words inside the (…), followed by ’(’ and L1* ? Place pointer at the first word in the sentence Find next acronym A No Yes link to A No Find the N:th word beginning in L1 to the left of the ‘(‘ , link that word and its right context to A L1:= First Letter of AN := Number of letters in A Within Yes Yes Found? (…) ? No To next procedure (Other parts of the NER-module) p16INK4a ( ) also tumor-related genes like NF2 neurofibromatose of type 2 There are . ( ) belongs to a cell cycle regulator group called cyclin dependent kinase inhibitors CDKI .

  9. Sample semantico-syntactic tags Our finding implicates that TNF-alpha released from the mesangium after IgA deposition activates renal tubular cells. [semcat('Our',our,[[],poss([])]),semcat(finding,find,[wnn,[]]), semcat(implicates,implicate,[[],[speech_act_verb([1])]), semcat(that,that,[[],rel([])]),semcat('TNF',[propername]), semcat(alpha,alpha,[wnn,[]]),semcat(released,release,[[],bioverb([[],production])]), semcat(from,from,[[],prep([])]),semcat(the,the,[[],det([])]), semcat(mesangium,mesangium,[[],[]]), semcat(after,after,[[],prep([])]),semcat('IgA',[propername]), semcat(deposition,deposition,[wnn,[]]), semcat(activates,activate,[[],bioverb([[],activation])]), semcat(renal,renal,[adj,[]]),semcat(tubular,tubular,[adj,[]]), semcat(cells,cell,[[],cell([])]),semcat('.',[[],[]])]

  10. Tags (occurrences) in the test set in relation to knowledge sources

  11. The next step: findingbackground and foreground in abstracts

  12. Background/foreground in abstracts ID: 16284406. The transcription factors dehydration-responsive element-binding protein 1s (DREB1s)/C-repeat-binding factors (CBFs) specifically interact with the DRE/CRT cis-acting element and control the expression of many stress-inducible genes in Arabidopsis. The genes for DREB1 orthologs, OsDREB1A and OsDREB1B from rice, are induced by cold stress, and overexpression of DREB1 or OsDREB1 induced strong expression of stress-responsive genes in transgenic Arabidopsis plants, resulting in increased tolerance to high-salt and freezing stresses.In this study, we generated transgenic rice plants overexpressing the OsDREB1 or DREB1 genes. These transgenic rice plants showed not only growth retardation under normal growth conditions but also improved tolerance to drought, high-salt and low-temperature stresses like the transgenic Arabidopsis plants overexpressing OsDREB1 or DREB1. We also detected elevated contents of osmoprotectants such as free proline and various soluble sugars in the transgenic rice as in the transgenic Arabidopsis plants. (…)

  13. Retrieval of Relevant Text Parts • Presence of the string this study/current study/present stud/our studyor synonyms of study in the same context (work, research, investigation) • Presence of the pronoun we preceded by or followed by a verb denoting an event in the world of the researcher (i.e., a cognition, communication, or manipulation verb) and not combined with a time adverb referring to past time, such as previously, earlier • Presence of the string ourgoal/our aim • Presence of a cognition/communication verb combined with the adverb now, presentlyorhere. • Tense shift from present to past. success rate: 92,5%

  14. Retrieval of Relevant Text Parts (2) if Foreground < 6 and word is in [study, work, research, investigation] and word-1 is in [this, current, present,our] -> Foreground = 6 else if Foreground < 5 and word is a CCVerb and foundWe=1 and set found{ "previously", "earlier" } = 0 -> Foreground = 5 else if Foreground < 4 and foundCCverb=1 and (foundWe=1 and word is not in [previously, earlier])) -> Foreground = 4 else if Foreground < 3 and word is in [goal, aim] and word-1 is [our] -> Foreground = 3 else if Foreground < 2 and word is a CCverb -> if set found{ "now", "presently", "here" } = 1  Foreground= 2 else foundCCverb=1 if Foreground < 2 and word is in [now, presently, here] -> if foundCCverb=1 -> Foreground= 2 else set_found{ "now", "presently", "here" } = 1 if word indicates tense shift from present to past -> Foreground = 1

  15. n v : release o i production relation type : t a v i t c a : n o i e t t a a l v e i r t c G a G KEGG relation : activation : E v v : trigger K : e s e h t o p y H Extracting relations from syntactic trees We hypothesise that mediators released from human mesangial cells (HMC) triggerred by IgA deposition may lead to activation of proximal tubular epithelial cells (PTEC) S mediators subj Sdsent we pred obj hypothesize subj Sdsent HMC Relcl NUX pred obj i may lead to mediators PTEC subj : Ref Sdsent i activation IgA deposition ( mediators ) pred advl release , P NP pass PTEC from NUX Relcl j HMC subj : Ref Sdsent j ( HMC ) pred advl : agent trigger , pass IgA deposition

  16. Allelic loss at TP53 seems to arise independently of LOH at the RB1 gene in carcinomas of the uterine corpus in humans

  17. The syntactic tree after application of the tree search algorithm

  18. A possible graphical representation of the compressed tree

  19. Results • Test corpus: about 15 000 words selected from PubMed using p53 as keyword • Tagging: 95.2% recall • Retrieval of relevant text parts: success rate 92.5% • Syntactic parsing: 79% recall, 86% precision • Relation retrieval: tested only manually, success rate about 94%

  20. Current and Future Work • A revised tagging procedure; tagging using a smaller lexicon and domain-specific prefix list • parsing improvements • implementation of the tree search algorithm • the question of the final output format

More Related