Marti hearst school of information uc berkeley ucb neyman seminar october 25 2006
Download
1 / 82

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

Marti Hearst School of Information, UC Berkeley UCB Neyman Seminar October 25, 2006. Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems. This research supported in part by NSF DBI-0317510. Natural Language Processing.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Marti Hearst

School of Information, UC Berkeley

UCB Neyman Seminar

October 25, 2006

Unambiguous + Unlimited = UnsupervisedUsing the Web for Natural Language Processing Problems

This research supported in part by NSF DBI-0317510


Natural Language Processing

  • The ultimate goal: write programs that read and understand stories and conversations.

    • This is too hard! Instead we tackle sub-problems.

  • There have been notable successes lately:

    • Machine translation is vastly improved

    • Speech recognition is decent in limited circumstances

    • Text categorization works with some accuracy


Automatic Help Desk Translation at MS


How can a machine understand these differences?

Get the cat with the gloves.


How can a machine understand these differences?

Get the sock from the cat with the gloves.

Get the glove from the cat with the socks.


How can a machine understand these differences?

  • Decorate the cake with the frosting.

  • Decorate the cake with the kids.

  • Throw out the cake with the frosting.

  • Throw out the cake with the kids.


Why is this difficult?

  • Same syntactic structure, different meanings.

  • Natural language processing algorithms have to deal with the specifics of individual words.

  • Enormous vocabulary sizes.

    • The average English speaker’s vocabulary is around 50,000 words,

    • Many of these can be combined with many others,

    • And they mean different things when they do!


How to tackle this problem?

  • The field was stuck for quite some time.

    • Hand-enter all semantic concepts and relations

  • A new approach started around 1990

    • Get large text collections

    • Compute statistics over the words in those collections

  • There are many different algorithms.


Size Matters

Recent realization: bigger is better than smarter!

Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL


Example Problem

  • Grammar checker example:

    Which word to use?

    <principal><principle>

  • Solution: use well-edited text and look at which words surround each use:

    • I am in my third year as the principal of Anamosa High School.

    • School-principal transfers caused some upset.

    • This is a simple formulation of the quantum mechanical uncertainty principle.

    • Power without principle is barren, but principlewithout power is futile. (Tony Blair)


Using Very, Very Large Corpora

  • Keep track of which words are the neighbors of each spelling in well-edited text, e.g.:

    • Principal: “high school”

    • Principle: “rule”

  • At grammar-check time, choose the spelling best predicted by the surrounding words.

  • Surprising results:

    • Log-linear improvement even to a billion words!

    • Getting more data is better than fine-tuning algorithms!


The Effects of LARGE Datasets

  • From Banko & Brill ‘01


How to Extend this Idea?

  • This is an exciting result …

  • BUT relies on having huge amounts of text that has been appropriately annotated!


How to Avoid Manual Labeling?

  • “Web as a baseline” (Lapata & Keller 04,05)

  • Main idea: apply web-determined counts to every problem imaginable.

    • Example: for t in {<principal><principle>}

    • Compute f(w-1, t, w+1)

    • The largest count wins


Web as a Baseline

  • Works very well in some cases

    • machine translation candidate selection

    • article generation

    • noun compound interpretation

    • noun compound bracketing

    • adjective ordering

  • But lacking in others

    • spelling correction

    • countability detection

    • prepositional phrase attachment

  • How to push this idea further?

Significantly better than the

best supervised algorithm.

Not significantly different from the best supervised.


Using Unambiguous Cases

  • The trick: look for unambiguous cases to start

  • Use these to improve the results beyond what co-occurrence statistics indicate.

  • An Early Example:

    • Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93

    • Problem: Prepositional Phrase attachment

      • I eat/v spaghetti/n1 with/p a fork/n2.

      • I eat/vspaghetti/n1 with/psauce/n2.

    • Question: does n2 attach to v or to n1?


Using Unambiguous Cases

  • How to do this with unlabeled data?

  • First try:

    • Parse some text into phrase structure

    • Then compute certain co-occurrences

      f(v, n1, p) f(n1, p) f(v, n1)

    • Problem: results not accurate enough

  • The trick: look for unambiguous cases:

    • Spaghetti with sauce is delicious. (pre-verbal)

    • I eat with a fork. (no direct object)

  • Use these to improve the results beyond what co-occurrence statistics indicate.


Unambiguous + Unlimited = Unsupervised

  • Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea

    • The potential of these approaches are not fully realized

  • Our work (with Preslav Nakov):

    • Structural Ambiguity Decisions

      • PP-attachment

      • Noun compound bracketing

      • Coordination grouping

    • Semantic Relation Acquisition

      • Hypernym (ISA) relations

      • Verbal relations between nouns

        • SAT Analogy problems


Applying U + U = U to Structural Ambiguity

  • We introduce the use of (nearly) unambiguous features:

    • Surface features

    • Paraphrases

  • Combined with ngrams

  • Use from very, very large corpora

  • Achieve state-of-the-art results without labeled examples.


Noun Compound Bracketing

(a)[ [ liver cell ] antibody ] (left bracketing)

(b)[ liver [cell line] ] (right bracketing)

In (a), the antibodytargets the liver cell.

In (b), the cell lineis derived from the liver.


Dependency Model

  • right bracketing: [w1[w2w3] ]

    • w2w3 is a compound (modified by w1)

      • home health care

    • w1 and w2 independently modify w3

      • adult male rat

  • left bracketing : [ [w1w2 ]w3]

    • only 1 modificational choice possible

      • law enforcement officer

w1 w2 w3

w1 w2 w3


Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.

  • We use the same general approach for two other structural ambiguity problems.


Computing Bigram Statistics

  • Dependency Model, Frequencies

    • Compare #(w1,w2) to #(w1,w3)

  • Dependency model, Probabilities

    • Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)

    • Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)

  • So we compare Pr(w1w2|w2) to Pr(w1w3|w3)

right

w1 w2 w3

left


Using ngrams to estimate probabilities

  • Using page hits as a proxy for n-gram counts

    • Pr(w1w2|w2) = #(w1,w2) / #(w2)

      • #(w2) word frequency; query for “w2”

      • #(w1,w2) bigram frequency; query for “w1 w2”

    • smoothed by 0.5

  • Use 2 to determine if w1 is associated with w2 (thus indicating left bracketing), and same for w1 with w3


Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.


Web-derived Surface Features

  • Authors often disambiguate noun compounds using surface markers, e.g.:

    • amino-acid sequence  left

    • brain stem’s cell  left

    • brain’s stem cell  right

  • The enormous size of the Web makes these frequent enough to be useful.


Web-derived Surface Features:Dash (hyphen)

  • Left dash

    • cell-cycle analysis left

  • Right dash

    • donor T-cell right

  • Double dash

    • T-cell-depletion unusable…


Web-derived Surface Features:Possessive Marker

  • Attached to the first word

    • brain’s stem cell  right

  • Attached to the second word

    • brain stem’s cell  left

  • Combined features

    • brain’s stem-cell  right


Web-derived Surface Features:Capitalization

  • anycase – lowercase – uppercase

    • Plasmodium vivax Malaria  left

    • plasmodium vivax Malaria  left

  • lowercase – uppercase–anycase

    • brain Stem cell  right

    • brain Stem Cell  right

  • Disable this on:

    • Roman digits

    • Single-letter words: e.g. vitamin D deficiency


Web-derived Surface Features:Embedded Slash

  • Left embedded slash

    • leukemia/lymphoma cell  right


Web-derived Surface Features:Parentheses

  • Single-word

    • growth factor (beta)  left

    • (brain) stem cell  right

  • Two-word

    • (growth factor) beta  left

    • brain (stem cell)  right


Web-derived Surface Features:Comma, dot, semi-colon

  • Following the first word

    • home. health care  right

    • adult, male rat  right

  • Following the second word

    • health care, provider  left

    • lung cancer: patients  left


Web-derived Surface Features:Dash to External Word

  • External word to the left

    • mouse-brain stem cell  right

  • External word to the right

    • tumor necrosis factor-alpha  left


Other Web-derived Features:Abbreviation

  • After the second word

    • tumor necrosis factor (NF) right

  • After the third word

    • tumor necrosis (TN) factor  right

  • We query for, e.g., “tumor necrosis tn factor”

  • Problems:

    • Roman digits: IV, VI

    • States: CA

    • Short words: me


Other Web-derived Features:Concatenation

  • Consider health care reform

    • healthcare : 79,500,000

    • carereform : 269

    • healthreform: 812

  • Adjacency model

    • healthcare vs. carereform

  • Dependency model

    • healthcare vs. healthreform

  • Triples

    • “healthcarereform” vs. “health carereform”


Other Web-derived Features:Reorder

  • Reorders for “healthcare reform”

    • “care reform health” right

    • “reform health care” left


Other Web-derived Features:Internal Inflection Variability

  • Vary inflection of second word

    • tyrosine kinase activation

    • tyrosine kinases activation


Other Web-derived Features:Switch The First Two Words

  • Predict right, if we can reorder

    • adult male ratas

    • male adult rat


Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.


Paraphrases

  • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978)

    • Prepositional

      • stem cells in the brain right

      • cells from the brainstem right

    • Verbal

      • virus causinghuman immunodeficiency  left

    • Copula

      • office building that is a skyscraper right


Paraphrases

  • prepositional paraphrases:

    • We use: ~150 prepositions

  • verbal paraphrases:

    • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.

  • copula paraphrases:

    • We use: is/was and that/which/who

  • optional elements:

    • articles: a, an, the

    • quantifiers: some, every, etc.

    • pronouns: this, these, etc.


Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.


Evaluation:Datasets

  • Lauer Set

    • 244 noun compounds (NCs)

      • from Grolier’s encyclopedia

      • inter-annotator agreement: 81.5%

  • Biomedical Set

    • 430 NCs

      • from MEDLINE

      • inter-annotator agreement: 88% (=.606)


Co-occurrence Statistics

  • Lauer set

  • Bio set


Paraphrase and Surface Features Performance

  • Lauer Set

  • Biomedical Set


Individual Surface Features Performance: Bio


Individual Surface Features Performance: Bio


Results Lauer


Results: Comparing with Others


Results Bio


Results for Noun Compound Bracketing

  • Introduced search engine statistics that go beyond the n-gram (applicable to other tasks)

    • surface features

    • paraphrases

  • Obtained new state-of-the-art results on NC bracketing

    • more robust than Lauer (1995)

    • more accurate than Keller&Lapata (2004)


Prepositional Phrase Attachment

Problem:

(a) Peter spent millions of dollars. (noun attach)

(b) Peter spent time with his family. (verb attach)

Which attachment for quadruple:

(v, n1, p, n2)

Results:

Much simpler than other algorithms

As good as or better than best unsupervised, and better than some supervised approaches


Noun Phrase Coordination

  • (Modified) real sentence:

    • The Department of Chronic Diseases andHealth Promotion leads and strengthens global efforts to prevent and control chronic diseases or disabilities and to promote health and quality of life.


NC coordination: ellipsis

  • Ellipsis

    • car and truck production

    • means car production and truck production

  • No ellipsis

    • president and chief executive

  • All-way coordination

    • Securities and Exchange Commission


Results428 examples from Penn TB


Semantic Relation Detection

  • Goal: automatically augment a lexical database

  • Many potential relation types:

    • ISA (hypernymy/hyponymy)

    • Part-Of (meronymy)

  • Idea: find unambiguous contexts which (nearly) always indicate the relation of interest


Lexico-Syntactic Patterns


Lexico-Syntactic Patterns


Adding a New Relation


Semantic Relation Detection

  • Lexico-syntactic Patterns:

    • Should occur frequently in text

    • Should (nearly) always suggest the relation of interest

    • Should be recognizable with little pre-encoded knowledge.

  • These patterns have been used extensively by other researchers.


Semantic Relation Detection

  • What relationship holds between two nouns?

    • olive oil – oil comes from olives

    • machine oil – oil used on machines

  • Assigning the meaning relations between these terms has been seen as a very difficult solution

  • Our solution:

    • Use clever queries against the web to figure out the relations.


Queries for Semantic Relations

  • Convert the noun-noun compound into a query of the form:

  • noun2 that * noun1

  • “oil that * olive(s)”

  • This returns search result snippets containing interesting verbs.

    • In this case:

      • Come from

      • Be obtained from

      • Be extracted from

      • Made from


Uncovering Semantic Relations

  • More examples:

    • Migraine drug -> treat, be used for, reduce, prevent

    • Wrinkle drug -> treat, be used for, reduce, smooth

    • Printer tray -> hold, come with, be folded, fit under, be inserted into

    • Student protest -> be led by, be sponsored by, pit, be, be organized by


Application: SAT Analogy Problems


Tackling the SAT Analogy Problem

  • First issue queries to find the relations (features) that hold between each word pair

  • Compare the features for each answer pair to those of the question pair.

    • Weight the features with term count and document counts

    • Compare the weighted feature sets using Dice coefficient


Queries for SAT Analogy Problem


Extract Features from Retrieved Text

  • Verb

    • The committeeincludes many members.

    • This is a committee, whichincludes many members.

    • This is a committee, including many members.

  • Verb+Preposition

    • The committeeconsists of many members.

  • Preposition

    • He is a memberof the committee.

  • Coordinating Conjunction

    • the committeeand its members


Most Frequent Features for “committee member”


SAT Results: Nouns Only


Conclusions

  • The enormous size of the web opens new opportunities for text analysis

    • There are many words, but they are more likely to appear together in a huge dataset

    • This allows us to do word-specific analysis

  • To counter the labeled-data roadblock, we start with unambiguous features that we can find naturally.

    • We’ve applied this to structural and semantic language problems.

    • These are stepping stones towards sophisticated language understanding.


http://biotext.berkeley.edu

Supported in part by NSF DBI-0317510

Thank you!


Using n-grams to make predictions

  • Say trying to distinguish:

    [home health] care

    home [health care]

  • Main idea: compare these co-occurrence probabilities

    • “home health” vs

    • “health care”


Using n-grams to make predictions

  • Use search engines page hits as a proxy for n-gram counts

    • compare Pr(w1w2|w2) to Pr(w1w3|w3)

    • Pr(w1 w2|w2) = #(w1,w2) / #(w2)

      • #(w2) word frequency; query for “w2”

      • #(w1,w2) bigram frequency; query for “w1 w2”


Probabilities: Why? (1)

  • Why should we use:

    • (a) Pr(w1w2|w2), rather than

    • (b) Pr(w2w1|w1)?

  • Keller&Lapata (2004) calculate:

    • AltaVista queries:

      • (a): 70.49%

      • (b): 68.85%

    • British National Corpus:

      • (a): 63.11%

      • (b): 65.57%


Probabilities: Why? (2)

  • Why should we use:

    • (a) Pr(w1w2|w2), rather than

    • (b) Pr(w2w1|w1)?

  • Maybe to introduce a bracketing prior.

    • Just like Lauer (1995) did.

  • But otherwise, no reason to prefer either one.

    • Do we need probabilities? (association is OK)

    • Do we need a directed model? (symmetry is OK)


Adjacency & Dependency (2)

  • right bracketing: [w1[w2w3] ]

    • w2w3 is a compound (modified by w1)

    • w1 and w2 independently modify w3

  • adjacency model

    • Is w2w3 a compound?

    • (vs. w1w2 being a compound)

  • dependency model

    • Does w1 modify w3?

    • (vs. w1 modifying w2)

w1 w2 w3

w1 w2 w3

w1 w2 w3


Paraphrases: pattern (1)

  • v n1pn2 v n2n1(noun)

  • Can we turn “n1p n2” into a noun compound “n2 n1”?

    • meet/v demands/n1from/p customers/n2 

    • meet/v the customer/n2 demands/n1

  • Problem: ditransitive verbs like give

    • gave/v an apple/n1 to/p him/n2

    • gave/v him/n2 an apple/n1

  • Solution:

    • no determiner before n1

    • determiner before n2 is required

    • the preposition cannot be to


Paraphrases: pattern (2)

  • v n1pn2 v p n2n1(verb)

  • If “p n2” is an indirect object of v, then it could be switched with the direct object n1.

    • had/v a program/n1in/p place/n2

    • had/v in/p place/n2aprogram/n1

Determiner before n1 is required to prevent

“n2 n1” from forming a noun compound.


Paraphrases: pattern (3)

  • v n1pn2 p n2 * v n1(verb)

  • “*” indicates a wildcard position (up to three intervening words are allowed)

  • Looks for appositions, where the PP has moved in front of the verb, e.g.

    • I gave/v an apple/n1to/p him/n2

    • to/p him/n2 I gave/v an apple/n1


Paraphrases: pattern (4)

  • v n1p n2 n1p n2 v(noun)

  • Looks for appositions, where “n1 p n2” has moved in front of v

    • shaken/v confidence/n1 in/p markets/n2 

    • confidence/n1 in/p markets/n2 shaken/v


Paraphrases: pattern (5)

  • v n1 p n2  v PRONOUN p n2(verb)

  • n1 is a pronoun  verb (Hindle&Rooth, 93)

  • Pattern (5) substitutes n1 with a dative pronoun (him or her), e.g.

    • put/v a client/n1 at/p odds/n2 

    • put/v himat/p odds/n2

pronoun


Paraphrases: pattern (6)

  • v n1 p n2  BEn1 p n2(noun)

  • BE is typically used with a noun attachment

  • Pattern (6) substitutes v with a form of to be (isor are), e.g.

    • eat/v spaghetti/n1 with/p sauce/n2 

    • isspaghetti/n1 with/p sauce/n2

to be


ad
  • Login