1 / 25

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu.

Download Presentation

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov,Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMSUniversity of California, Berkeleyhttp://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

  2. Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation

  3. Overview • Motivation: Need to re-use results of NLP processing: • for additional processing • for end applications: data mining etc. • Proposed solution: • Layers of annotations over text • Illustration: • Application to noun compound bracketing

  4. Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation

  5. Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) • In (a), the antibody targets the cell line. • In (b), the cell line is derived from the liver.

  6. Related Work • Pustejosky et al. (1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w3) vs. Pr(w2|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Nakov & Hearst (2005): will be presented at coNLL! • use the Web, Chi-squared • n-grams • paraphrases • surface features

  7. Nakov & Hearst (2005) • Web page hits: proxy for n-gram frequencies • Sample surface features • amino-acid sequence  left • brain stem’s cell  left • brain’s stem cell  right • Majority vote to combine different models • Accuracy 89.34%

  8. Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation

  9. Web Counts: Problems • The Web lacks linguistic annotation • Pr(health|care) = #(“health care”) / #(care) • “health”: returns nouns • “care”: returns both verbs and nouns • can be adjacent by chance • can come from different sentences • Cannot find: • stem cells VERB PREPOSITION brain • protein synthesis’ inhibition • Page hits are inaccurate

  10. Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation

  11. Solution: MEDLINE+LQL • MEDLINE: ~13 million abstracts • We annotated: • 1.4 million abstracts • ~10 million sentences • ~320 million annotations • Layered Query Language: demo at ACL! • http://biotext.berkeley.edu/lql/

  12. The System • Built on top of an RDBMS system • Supports layers of annotations over text • hierarchical, overlapping • cannot be represented by a single-file XML • Specialized query language • LQL (Layered Query Language)

  13. Annotated Example

  14. Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation

  15. Noun Compound Extraction (1) layers’ beginnings should match FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content layers’ endings should match

  16. Noun Compound Extraction (2) SELECTLOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQL GROUPBY lc ORDER BY freq DESC

  17. Noun Compound Extraction (3) SELECTLOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQL GROUPBY lc ORDER BY freq DESC layer negation artificial range

  18. Finding Bigram Counts SELECTCOUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ [layer=’pos’ && tag_type="noun“ && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_type="noun“ && (content="virus"||content="viruses")] ] ] SELECT word1.content END_LQL GROUPBY lc ORDER BY freq DESC

  19. Paraphrases • Types of paraphrases (Warren,1978): • Prepositional • immunodeficiency virus in humans right • Verbal • virus causinghuman immunodeficiency  left • immunodeficiency virus found inhumans  left • Copula • immunodeficiency virus that is human right

  20. Prepositional Paraphrases SELECTLOWER(prep.content) lp, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && contentIN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && contentIN ("the","a","an")] [layer=’pos’ && tag_type="noun" && contentIN ("human", "humans")] ] SELECT prep.content END_LQL GROUP BY lp, ORDER BY freq DESC optional layer

  21. Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation

  22. Evaluation • obtained 418,678 noun compounds (NCs) • annotated the top 232 NCs (after cleaning) • agreement 88% • kappa .606 • baseline (left): 83.19% • n-grams: Pr, #, χ2 • prepositional paraphrases • for inflections, we used UMLS

  23. Results wrong N/A correct

  24. Discussion • Semantics of bone marrow cells • top verbalparaphrases • cells derived from bone marrow (22 instances) • cells isolated from bone marrow (14 instances) • top prepositional paraphrases • cells in bone marrow (456 instances) • cells from bone marrow (108 instances) • Finding hard examples for NC bracketing • w1w2w3 such that both w1w2 and w2w3 are MeSH terms

  25. The End Thank you!

More Related