Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov,Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMSUniversity of California, Berkeleyhttp://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

Plan • Overview • Noun compound (NC) bracketing • Problems with Web Counts • Layers of annotation • Applying LQL to NC bracketing • Evaluation

Overview • Motivation: Need to re-use results of NLP processing: • for additional processing • for end applications: data mining etc. • Proposed solution: • Layers of annotations over text • Illustration: • Application to noun compound bracketing

Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) • In (a), the antibody targets the cell line. • In (b), the cell line is derived from the liver.

Related Work • Pustejosky et al. (1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w3) vs. Pr(w2|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Nakov & Hearst (2005): will be presented at coNLL! • use the Web, Chi-squared • n-grams • paraphrases • surface features

Nakov & Hearst (2005) • Web page hits: proxy for n-gram frequencies • Sample surface features • amino-acid sequence  left • brain stem’s cell  left • brain’s stem cell  right • Majority vote to combine different models • Accuracy 89.34%

Web Counts: Problems • The Web lacks linguistic annotation • Pr(health|care) = #(“health care”) / #(care) • “health”: returns nouns • “care”: returns both verbs and nouns • can be adjacent by chance • can come from different sentences • Cannot find: • stem cells VERB PREPOSITION brain • protein synthesis’ inhibition • Page hits are inaccurate

Solution: MEDLINE+LQL • MEDLINE: ~13 million abstracts • We annotated: • 1.4 million abstracts • ~10 million sentences • ~320 million annotations • Layered Query Language: demo at ACL! • http://biotext.berkeley.edu/lql/

The System • Built on top of an RDBMS system • Supports layers of annotations over text • hierarchical, overlapping • cannot be represented by a single-file XML • Specialized query language • LQL (Layered Query Language)

Annotated Example

Noun Compound Extraction (1) layers’ beginnings should match FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content layers’ endings should match

Noun Compound Extraction (2) SELECTLOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQL GROUPBY lc ORDER BY freq DESC

Noun Compound Extraction (3) SELECTLOWER(compound.content) AS lc, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQL GROUPBY lc ORDER BY freq DESC layer negation artificial range

Finding Bigram Counts SELECTCOUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ [layer=’pos’ && tag_type="noun“ && content="immunodeficiency"] AS word1 [layer=’pos’ && tag_type="noun“ && (content="virus"||content="viruses")] ] ] SELECT word1.content END_LQL GROUPBY lc ORDER BY freq DESC

Paraphrases • Types of paraphrases (Warren,1978): • Prepositional • immunodeficiency virus in humans right • Verbal • virus causinghuman immunodeficiency  left • immunodeficiency virus found inhumans  left • Copula • immunodeficiency virus that is human right

Prepositional Paraphrases SELECTLOWER(prep.content) lp, COUNT(*) AS freq FROM BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && contentIN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && contentIN ("the","a","an")] [layer=’pos’ && tag_type="noun" && contentIN ("human", "humans")] ] SELECT prep.content END_LQL GROUP BY lp, ORDER BY freq DESC optional layer

Evaluation • obtained 418,678 noun compounds (NCs) • annotated the top 232 NCs (after cleaning) • agreement 88% • kappa .606 • baseline (left): 83.19% • n-grams: Pr, #, χ2 • prepositional paraphrases • for inflections, we used UMLS

Results wrong N/A correct

Discussion • Semantics of bone marrow cells • top verbalparaphrases • cells derived from bone marrow (22 instances) • cells isolated from bone marrow (14 instances) • top prepositional paraphrases • cells in bone marrow (456 instances) • cells from bone marrow (108 instances) • Finding hard examples for NC bracketing • w1w2w3 such that both w1w2 and w2w3 are MeSH terms

The End Thank you!

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Presentation Transcript

Chapter 5 Case Study: MVC Architecture for Web Applications

Noun Clauses

The MonetDB Architecture

“The Story of an Hour” by Kate Chopin

Basic WEB Architecture

Chapter 1: Introduction to Scaling Networks

Conceptual Architecture View

Information Extraction

Recognizing Noun Clauses

4.RL.1

Memory Scaling: A Systems Architecture Perspective

Title

Scales and Scaling in Biology and Ecology

BGP 102: Scaling the Network

Pronouns

Noun Clauses

Chapter 5 Case Study: MVC Architecture for Web Applications

Semantic Web

Joey helps us learn Compound words

Shurley Grammar Unit 4

Introduction of Revit Architecture, Structure, and System