Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems

Marti Hearst School of Information, UC Berkeley UCB Neyman Seminar October 25, 2006 Unambiguous + Unlimited = UnsupervisedUsing the Web for Natural Language Processing Problems This research supported in part by NSF DBI-0317510

Natural Language Processing • The ultimate goal: write programs that read and understand stories and conversations. • This is too hard! Instead we tackle sub-problems. • There have been notable successes lately: • Machine translation is vastly improved • Speech recognition is decent in limited circumstances • Text categorization works with some accuracy

Automatic Help Desk Translation at MS

How can a machine understand these differences? Get the cat with the gloves.

How can a machine understand these differences? Get the sock from the cat with the gloves. Get the glove from the cat with the socks.

How can a machine understand these differences? • Decorate the cake with the frosting. • Decorate the cake with the kids. • Throw out the cake with the frosting. • Throw out the cake with the kids.

Why is this difficult? • Same syntactic structure, different meanings. • Natural language processing algorithms have to deal with the specifics of individual words. • Enormous vocabulary sizes. • The average English speaker’s vocabulary is around 50,000 words, • Many of these can be combined with many others, • And they mean different things when they do!

How to tackle this problem? • The field was stuck for quite some time. • Hand-enter all semantic concepts and relations • A new approach started around 1990 • Get large text collections • Compute statistics over the words in those collections • There are many different algorithms.

Size Matters Recent realization: bigger is better than smarter! Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL

Example Problem • Grammar checker example: Which word to use? <principal><principle> • Solution: use well-edited text and look at which words surround each use: • I am in my third year as the principal of Anamosa High School. • School-principal transfers caused some upset. • This is a simple formulation of the quantum mechanical uncertainty principle. • Power without principle is barren, but principlewithout power is futile. (Tony Blair)

Using Very, Very Large Corpora • Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: • Principal: “high school” • Principle: “rule” • At grammar-check time, choose the spelling best predicted by the surrounding words. • Surprising results: • Log-linear improvement even to a billion words! • Getting more data is better than fine-tuning algorithms!

The Effects of LARGE Datasets • From Banko & Brill ‘01

How to Extend this Idea? • This is an exciting result … • BUT relies on having huge amounts of text that has been appropriately annotated!

How to Avoid Manual Labeling? • “Web as a baseline” (Lapata & Keller 04,05) • Main idea: apply web-determined counts to every problem imaginable. • Example: for t in {<principal><principle>} • Compute f(w-1, t, w+1) • The largest count wins

Web as a Baseline • Works very well in some cases • machine translation candidate selection • article generation • noun compound interpretation • noun compound bracketing • adjective ordering • But lacking in others • spelling correction • countability detection • prepositional phrase attachment • How to push this idea further? Significantly better than the best supervised algorithm. Not significantly different from the best supervised.

Using Unambiguous Cases • The trick: look for unambiguous cases to start • Use these to improve the results beyond what co-occurrence statistics indicate. • An Early Example: • Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93 • Problem: Prepositional Phrase attachment • I eat/v spaghetti/n1 with/p a fork/n2. • I eat/vspaghetti/n1 with/psauce/n2. • Question: does n2 attach to v or to n1?

Using Unambiguous Cases • How to do this with unlabeled data? • First try: • Parse some text into phrase structure • Then compute certain co-occurrences f(v, n1, p) f(n1, p) f(v, n1) • Problem: results not accurate enough • The trick: look for unambiguous cases: • Spaghetti with sauce is delicious. (pre-verbal) • I eat with a fork. (no direct object) • Use these to improve the results beyond what co-occurrence statistics indicate.

Unambiguous + Unlimited = Unsupervised • Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea • The potential of these approaches are not fully realized • Our work (with Preslav Nakov): • Structural Ambiguity Decisions • PP-attachment • Noun compound bracketing • Coordination grouping • Semantic Relation Acquisition • Hypernym (ISA) relations • Verbal relations between nouns • SAT Analogy problems

Applying U + U = U to Structural Ambiguity • We introduce the use of (nearly) unambiguous features: • Surface features • Paraphrases • Combined with ngrams • Use from very, very large corpora • Achieve state-of-the-art results without labeled examples.

Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) In (a), the antibodytargets the liver cell. In (b), the cell lineis derived from the liver.

Dependency Model • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ]w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3

Our U + U + U Algorithm • Compute bigram estimates • Compute estimates from surface features • Compute estimates from paraphrases • Combine these scores with a voting algorithm to choose left or right bracketing. • We use the same general approach for two other structural ambiguity problems.

Using ngrams to estimate probabilities • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5 • Use 2 to determine if w1 is associated with w2 (thus indicating left bracketing), and same for w1 with w3

Our U + U + U Algorithm • Compute bigram estimates • Compute estimates from surface features • Compute estimates from paraphrases • Combine these scores with a voting algorithm to choose left or right bracketing.

Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • amino-acid sequence  left • brain stem’s cell  left • brain’s stem cell  right • The enormous size of the Web makes these frequent enough to be useful.

Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • Double dash • T-cell-depletion unusable…

Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell  right • Attached to the second word • brain stem’s cell  left • Combined features • brain’s stem-cell  right

Web-derived Surface Features:Capitalization • anycase – lowercase – uppercase • Plasmodium vivax Malaria  left • plasmodium vivax Malaria  left • lowercase – uppercase–anycase • brain Stem cell  right • brain Stem Cell  right • Disable this on: • Roman digits • Single-letter words: e.g. vitamin D deficiency

Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell  right

Web-derived Surface Features:Parentheses • Single-word • growth factor (beta)  left • (brain) stem cell  right • Two-word • (growth factor) beta  left • brain (stem cell)  right

Web-derived Surface Features:Comma, dot, semi-colon • Following the first word • home. health care  right • adult, male rat  right • Following the second word • health care, provider  left • lung cancer: patients  left

Web-derived Surface Features:Dash to External Word • External word to the left • mouse-brain stem cell  right • External word to the right • tumor necrosis factor-alpha  left

Other Web-derived Features:Abbreviation • After the second word • tumor necrosis factor (NF) right • After the third word • tumor necrosis (TN) factor  right • We query for, e.g., “tumor necrosis tn factor” • Problems: • Roman digits: IV, VI • States: CA • Short words: me

Other Web-derived Features:Concatenation • Consider health care reform • healthcare : 79,500,000 • carereform : 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”

Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left

Other Web-derived Features:Internal Inflection Variability • Vary inflection of second word • tyrosine kinase activation • tyrosine kinases activation

Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult male ratas • male adult rat

Paraphrases • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cells in the brain right • cells from the brainstem right • Verbal • virus causinghuman immunodeficiency  left • Copula • office building that is a skyscraper right

Paraphrases • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use: is/was and that/which/who • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.

Evaluation:Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)

Co-occurrence Statistics • Lauer set • Bio set

Paraphrase and Surface Features Performance • Lauer Set • Biomedical Set

Individual Surface Features Performance: Bio

Results Lauer

Results: Comparing with Others

Results Bio

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems

Presentation Transcript

Sensory Processing in the Classroom

PROCESSING OF CERAMICS AND CERMETS

Lexical Semantics and Ontologies Tutorial at the ACL/HCSnet 2006 Advanced Program in Natural Language Processing

Natural Language Processing Applications

CS 388: Natural Language Processing: Part-Of-Speech Tagging, Sequence Labeling, and Hidden Markov Models (HMMs)

Polymers and Plastics

Language processing: introduction to compiler construction

APES year in review

APES year in review

LING / C SC 439/539 Statistical Natural Language Processing

Statistical Natural Language Parsing

Lecture 39 of 42

Natural Language Processing (4)

Speech perception as a window into language processing:

Natural Language Processing COMPSCI 423/723

Natural Language Processing, Linguistics and Terminology

Membranes for Gas Conditioning

Data Stream Processing and Analytics

Natural Language Processing

APES year in review

Kernel Methods in Natural Language Processing