1 / 62

New Search Tools for Bioscience Journal Articles

New Search Tools for Bioscience Journal Articles. Marti Hearst, UC Berkeley School of Information. UIUC Comp-Bio Seminar February 12, 2007. Supported by NSF DBI-0317510 And a gift from Genentech. Outline. Biotext Project Introduction Simple Abbreviation Definition Recognition Citances

tfallon
Download Presentation

New Search Tools for Bioscience Journal Articles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Search Tools for Bioscience Journal Articles Marti Hearst, UC Berkeley School of Information UIUC Comp-Bio Seminar February 12, 2007 Supported by NSF DBI-0317510 And a gift from Genentech

  2. Outline • Biotext Project Introduction • Simple Abbreviation Definition Recognition • Citances • A New Search Interface Idea

  3. Double Exponential Growth in Bioscience Journal Articles From Hunter & Cohen, Molecular Cell 21, 2006

  4. BioText Project Goals • Provide flexible, useful, appealing search for bioscientists. • Focus on: • Full text journal articles • New language analysis algorithms • New search interfaces

  5. Bioscience Text is Challenging • Complex sentence structure • Huge vocabulary • Including LOTS of abbreviations • Gene/protein name recognition a major task • Full text documents have complex structure – which parts are key?

  6. BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

  7. Project Team • Project Leaders: • PI: Marti Hearst • Co-PI: Adam Arkin • Computational Linguistics and Databases • Preslav Nakov • Jerry Ye • Ariel Schwartz (alum) • Brian Wolf (alum) • Barbara Rosario (alum) • Gaurav Bhalotia (alum) • User Interface / IR • Mike Wooldridge • Rowena Luk (alum) • Dr. Emilia Stoica (alum) • Bioscience • Dr. Anna Divoli • Janice Hamerja (alum) • Dr. TingTing Zhang (alum)

  8. The Problem: Identify Acronym Definitions methyl methanesulfonate sulfate (MMS) heat shock transcription factor (HSF) Gcn5-related N-acetyltransferase (GNAT) We investigated the redox regulation of the stress response and report here that in the human pre-monocytic line U937 cells, H2O2 induced a concentration-dependent transactivation and DNA-binding activity of heat-shock factor-1 (HSF-1)

  9. Identifying Acronym Definitions • To identify <“short form”, “long form”> pairs from biomedical text: • Short form is abbreviation of long form • There exists character mapping from short form to long form • Examples: • Gcn5-related N-acetyltransferase (GNAT) • A non-trivial problem: • Words in long form may be skipped • Internal letters in long form may be used

  10. Previous Work • Machine learning approaches • Linear regression (Chang et al.) • Encoding and compression (Yeates et al.) • Cubic time or worse • Heuristic approach • Rule-based (Park & Byrd) • Factors considered include: • Distance between definition and abbreviation • Number of stop words • Capitalization • Can’t reproduce this algorithm

  11. Step 1: Identifying Candidates • Consider only two cases: • long form ‘(‘ short form ‘)’ • short form ‘(‘ long form ‘)’ • Short form: • No more than 2 words • Between 2 and 10 chars • At least one letter • First char alphanumeric • Long form: • Adjacent to short form • No more than min(|A| + 5, |A| * 2) words

  12. Step 2: Identifying Correct Long Forms heat shock transcription factor (HSF) heat shock transcription factor (HSF) heat shock transcription factor (HSF) heat shock transcription factor (HSF) [heat shock transcription factor](HSF)

  13. Step 2: Identifying Correct Long Forms Gcn5-related N-acetyltransferase (GNAT) Gcn5-related N-acetyltransferase (GNAT) Gcn5-related N-acetyltransferase (GNAT) Gcn5-related N-acetyltransferase (GNAT)

  14. Step 2: Identifying Correct Long Forms • From right to left, the shortest long form that matches the short form: • Each character in short form must match a character in long form • The match of the character at the beginning of the short form must match a character in the initial position of the first word in the long form

  15. Java Code for Finding the Best Long Form for a Given Short Form

  16. Evaluation • 1000 randomly selected MEDLINE abstracts • 82% recall, 95% precision • Medstract Gold Standard Evaluation Corpus • 82% recall, 96% precision • Compared with • 83% recall, 80% precision (Cheng et al., linear regression) • 72% recall, 98% precision (Pustejovsky et al., heuristics)

  17. Missing Pairs • Skipped characters in short form • <CNS1, cyclophilin seven suppressor> • No match • <5-HT, serotonin> • Out of order • <ATN, anterior thalamus> • Partial match • <Pol I, RNA polymerase I>

  18. Other NLP Work • Relation labeling • (Work primarily by Barbara Rosario) • Protein-protein interactions: which ones are happening? • They also demonstrate that the GAG protein from membrane-containing viruses , such as HIV, binds to Alix / AIP1 , thereby recruiting the ESCRT machinery to allow budding of the virus from the cell surface [cite]. • Distinguished among 10 different relations • Binds, degrades, synergizes with, upregulates … • Simple supervised approach gets surprisingly high results (~60% accuracy)

  19. Acquiring Labeled Data using Citances

  20. A discovery is made … A paper is written …

  21. That paper is cited … and cited … and cited … … as the evidence for some fact(s) F.

  22. Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!

  23. Citances • Nearly every statement in a bioscience journal article is backed up with a cite. • It is quite common for papers to be cited 30-100 times. • The text around the citation tends to state biological facts. (Call these citances.) • Different citances will state the same facts in different ways … • … so can we use these for creating models of language expressing semantic relations?

  24. Using citances • Potential uses of citation sentences (citances) • creation of training and testing data for semantic analysis, • synonym set creation, • database curation, • document summarization, • and information retrieval generally. • All of the above require citance word alignments.

  25. Sample Citance “Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1-p53 complex formation [70].”

  26. Related Work • Traditional citation analysis dates back to the 1960’s (Garfield). Includes: • Citation categorization, • Context analysis, • Citer motivation. • Citation indexing systems, such as ISI’s SCI, and CiteSeer. • Mercer and Di Marco (2004) propose to improve citation indexing using citation types. • Bradshaw (2003) introduces Reference Directed Indexing (RDI), which indexes documents using the terms in the citances citing them.

  27. Related Work (cont.) • Teufel and Moens (2002) identify citances to improve summarization of the citing paper. They give lower weight to citances as candidate sentences for summarization. • Nanba et. al. (2000) use citances as features for classifying papers into topics. • Related field to citation indexing is the use of link structure and anchor text of Web pages. • Applications include: IR, classification, Web crawlers, and summarization. • See the full paper for references.

  28. Issues for Processing Citances • Text span • Identification of the appropriate phrase, clause, or sentence that constructs a citance. • Correct mapping of citations when shown as lists or groups (e.g., “[22-25]”). • Grouping citances by topic • Citances that cite the same document should be grouped by the facts they state. • Normalizing or paraphrasing concepts in citances

  29. How Do Citances Differ From Abstracts? • (This part primarily by Anna Divoli.) • We did a detailed analysis of facts that appear in citances. • 6 target papers, molecular interactions domain • We did the same for the abstracts of the target papers.

  30. Distributions of Concept Types

  31. How Do Citances Differ From Abstracts? • Main results: • all of the facts in the abstract are covered by the citances (collectively) • However, some facts in citances do not appear in the abstracts. • Mainly Entities and Experimental Methods • This suggests there is important information in the full text that is not represented by the abstract, title, and metadata alone.

  32. Paraphrasing Citances • (This part primarily by Preslav Nakov) • Problem: many citances say the same thing in different ways • The sentence structure is very complex and contains irrelevant information • We want to first “normalize” those citances that talk about similar things, so we can then determine which sentences repeat the same information. • This will then allow us to determine what the key points are and thus convert them into summaries.

  33. Want to Normalize These: • NGF withdrawal from sympathetic neurons induces Bim, which then contributes to death. • Nerve growth factor withdrawal induces the expression of Bim and mediates Bax dependent cytochrome c release and apoptosis. • Recently, Bim has been shown to be upregulated following both nerve growth factor withdrawal from primary sympathetic neurons, and serum and potassium withdrawal from granule neurons. • The proapoptotic Bcl-2 family member Bim is strongly induced in sympathetic neurons in response to NGF withdrawal. • In neurons, the BH3 only Bcl2 member, Bim, and JNK are both implicated in apoptosis caused by nerve growth factor deprivation.

  34. The Resulting Paraphrases • NGF withdrawal induces Bim. • Nerve growth factor withdrawal induces the expression of Bim. • Bim has been shown to be upregulated following nerve growth factor withdrawal. • Bim is induced in sympathetic neurons in response to NGF withdrawal. • Bim implicated in apoptosis caused by nerve growth factor deprivation. All they paraphrase: Bim is induced after NGF withdrawal.

  35. Paraphrase Creation Algorithm 1. Extract the sentences that cite the target. 2. Mark the NEs of interest (genes/proteins, MeSH terms) and normalize. 3. Dependency parse. 4. For each parse For each pair of NEs of interest i. Extract the path between them. ii. Create a paraphrase from the path. 5. Rank the candidates for a given pair of NEs. 6. Select only the ones above a threshold. 7. Generalize.

  36. Creating a Paraphrase Given the path from the dependency parse: • Restore the original word order. • Addwords to improve grammaticality. • Bim … shown … be … following nerve growth factor withdrawal. • Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal.

  37. 2-word Heuristic Demonstration • NGF withdrawal induces Bim. • Nerve growth factor withdrawal induces [the] expression of Bim. • Bim [has] [been] shown [to] be [upregulated] following nerve growth factor withdrawal. • Bim [is] induced in [sympathetic] neurons in response to NGF withdrawal. • member Bim implicated in apoptosis caused by nerve growth factor deprivation.

  38. Evaluation (1) • An influential journal paper from Neuron: • J. Whitfield, S. Neame, L. Paquet, O. Bernard, and J. Ham. Dominantnegative c-jun promotes neuronal survival by reducing bim expression and inhibiting mitochondrial cytochrome c release. Neuron, 29:629–643, 2001. • 99 journal papers citing it • 203 citances in total • 36 different types of important biological factoids • But we concentrated on one of them: “Bim is induced after NGF withdrawal.”

  39. Evaluation (2) • Set 1: 67 citances pointing to the target paper and manually found to contain a good or acceptable paraphrase (do not necessarily contain Bim or NGF); • Set 2: 65 citances pointing to the target paper and containing both Bim and NGF; • Set 3: 102 sentences from the 99 texts, contain both Bim and NGF • Cluster: all 203 citances: • Spectral clustering • Polynomial kernel • clusters for which more than 80% of the citances include both NGF and Bim • Set 1 – assess the system under ideal conditions. • Set 2 vs. 3 – Do citances produce better paraphrases?

  40. Results % - good (1.0) or acceptable (0.5)

  41. The Citance Fact Extraction Problem • (This part primarily by Ariel Schwartz.) • Find groups of words/phrases that are semantically “similar” in target paper’s context. • Orthographic similarity is important but does not always entail semantic similarity. • This is another step needed for normalizing the content. • Can use the results of this algorithm to determine which entities to use in the paraphrasing just described.

  42. Example of original citances

  43. Entities Identified and Labeled as Equivalent to One Another responsegenotoxic stressChk1 Chk2phosphorylateCdc25A N terminal sites target rapidly ubiquitindependentdegradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25AturnoverresponseDNA damage vivo Chk1requiredCdc25Aubiquitination SCF beta TRCP vitro explored role Cdc25Aphosphorylationubiquitinationprocess activated phosphorylatedChk2T68 involved phosphorylationdegradationCdc25Aexamined levels Cdc25A 2fTGH U3A cells exposed gamma IR

  44. Features for citance word alignment • Orthographic features • exact string match, • normalized edit distance, • prefix, suffix match, • word lengths, • capitalization. • Local contextual features • distance between target words of adjacent source words, • Word specific tendency to align like the previous/next word, • Transition to, from, and between (un)aligned words. • Biological ontology based features • Medical Subject Headings (MeSH), • Gene synonyms (Entrez Gene, Uniprot, OMIM). • Lexical features • Wordnet similarity (Lin, 1998)

  45. Approach: Posterior Decoding • Use Conditional Random Fields • Compute posterior probabilities using EM • For every target word w, compute the combination of source words that maximizes the expected score of w • Take the union of individual word optimal alignments and produce a multiple alignment • Use a match-factor to reward/penalize a combination based on the number of words that align to the same target word

  46. Data sets • 3 sets of citances annotated by a PhD with biological training (Anna Divoli) • Training set - 4 groups, 10 citances each (360 pairs). • Development set – 51 citances (2550 pairs). • Test set – 45 citances (1980 pairs). • Feature engineering was done using the training and development sets. • Final results based on a model trained on training and development sets combined, and tested on the test set. • Baseline – using only normalized edit distance with a simple cutoff.

  47. Results

  48. A Full Text Search Interface (This work in part by Mike Wooldridge and Jerry Ye)

  49. The Importance of Figures and Captions • Observations of biologist’s reading habits: • It has often observed that biologists focus on figures+captions along with title and abstract. • KDD Cup 2002 • The objective was to extract only the papers that included experimental results regarding expression of gene products and to identify, from all the genes mentioned in each document, the genes and products for which experimental results were provided. • ClearForest+Celera did well in part by focusing on figure captions, which contain critical experimental evidence.

More Related